W3C home > Mailing lists > Public > public-i18n-core@w3.org > January to March 2008

Comment about "Be flexible when referencing Unicode!"

From: Richard Ishida <ishida@w3.org>
Date: Wed, 20 Feb 2008 13:45:39 -0000
To: <public-i18n-core@w3.org>
Message-ID: <004901c873c6$e1877950$a4966bf0$@org>

I'm forwarding this to the i18n list, rather than handling as a blog comment, in case it contains comments we need to consider wrt HTML5. (With permission from the author.)

RI

-----Original Message-----
From: WordPress [mailto:wordpress@rishida.net] 
Sent: 18 February 2008 03:41
To: ishida@w3.org
Subject: [ishida &gt;&gt; blog] Please moderate: "Be flexible when referencing Unicode!"

A new comment on the post #135 "Be flexible when referencing Unicode!" is waiting for your approval
http://rishida.net/blog/?p=135

Author : Leif Halvard Silli (IP: 84.208.108.246 , cm-84.208.108.246.getinternet.no)
E-mail : lhs@malform.no
URL    : 
Whois  : http://ws.arin.net/cgi-bin/whois.pl?queryinput=84.208.108.246
Comment: 
With the 5th edition of XML 1.0, <a href='http://www.w3.org/XML/2008/02/xml10_5th_edition_background.html' rel="nofollow">these UNICODE 2.0 limitation is lifted</a>. But the Internatiionalisation aspect of HREF, ID and NAME is still somewhat unclear to me. Hence some questios and comments in that regard, if I may:

   First, for XHTML, I did not know that one could trust the W3 Validator to check that one only uses characters from UNICODE 2.0 in the respective attributes. However, the Validator also tells us that for HTML 5, <em>all</em> UNICODE characters are permitted (including those only found in UNICODE 3 to 5, as HTML 4 is not limited to UNICODE 2.0). But can I trust the Valdiator to be correct in this regard? Or, what is that the validator tells us about the content of <code>ID</code>, <code>name</code> and <code>href</code>?

   What about HTML 5? The NAME attribute is currently not in the draft for HTML 5. But on the other side, I don't see a mention in the section about the <code>ID</code> telling that there are any restrictions on the content of <code>ID</code> in HTML 5. Does that mean that <code>ID</code> in the html/text serialisation of HTML 5 will not have the same limitations as in HTML 4?

   <blockquote cite='http://www.w3.org/TR/xhtml1/#C_8'>[…] the collection of legal values in XML 1.0 Section 2.3, production 5 is much larger than that permitted to be used in the <code>ID</code> and <code>NAME</code> types defined in HTML 4. When defining fragment identifiers to be backward-compatible, only strings matching the pattern [A-Za-z][A-Za-z0-9:_.-]* should be used. See Section 6.2 of [HTML4] for more information.</blockquote>

   <blockquote cite='http://www.w3.org/TR/html401/struct/links.html#h-12.2.3'>[…] the name attribute may contain character references. Thus, the value D&#xfc;rst is a valid name attribute value, as is D&amp;uuml;rst . The id attribute, on the other hand, may not contain character references. […] The name attribute allows richer anchor names (with entities).</blockquote>

   <blockquote cite='http://www.w3.org/TR/html401/appendix/notes.html#h-B.2.1'>Although URIs do not contain non-ASCII values (see [URI], section 2.1) authors sometimes specify them in attribute values expecting URIs (i.e., defined with %URI; in the DTD). For instance, the following href value is illegal:
   &#x3c;A href=&#x22;http://foo.org/H&#xe5;kon&#x22;&#x3e;...&#x3c;/A&#x3e;</blockquote>

   <blockquote cite='http://www.w3.org/TR/html4/types.html#h-6.2'>ID and NAME tokens must begin with a letter ([A-Za-z]) and may be followed by any number of letters, digits ([0-9]), hyphens ("-"), underscores ("_"), colons (":"), and periods (".").</blockquote>

   Thus,  the «richness» of the <code>NAME</code> attribute, is tied to the fact that it can contain character references. 
    Even so, it seems as if HTML 4 espects that the value of <code>NAME</code> always begins with A-Za-z. In other words, the «richness» is in reality limited to character references that can «spice up» words that begins with any of the letters between Aa and Zz.  So, words that begins with a Dano-Norwegian Ø (e.g. Ørn) or with a Cyrillic letter, are not permitted as value of <code>NAME</code>. While e.g. JØrgen would be perfimitted. And a Cyrillic word preceded with a Latin letter … 
     At the same time, this is contradicted by the fact that the HTML 4 reference itself uses <code>NAME</code> attributes which only containes numbers. Try <a href='http://www.w3.org/TR/html4/appendix/changes.html#19991224' rel="nofollow">this URL</a>.

   Because of all this, I don't understand why the W3 Validator doesn't give an errror on HTML 4 documents with non-ASCII characters inside HREF and NAME, as the above quotes tells us that non-ASCII characters inside <code>NAME</code> are only permitted if they are inserted as character references. Whereas the URI standard, which is references in HTML 4, says that URLs must be percent encoded because of the ASCII limitations in URIs.

   None of the <code>NAME</code> attributes in this blog article (which is in XHTML format) has inserted their non-ASCII letters as character references, thus how can this document be legal, even if you restrain yourself to only UNICODE 2.0?

   I do understand that the W3 Validator, for those scripts and characters you mention, simply doesn't accept these characters regardless of whether they are inserted as Character References or typed directly, simply  because XML 1.0 is tied up with UNICODE 2.0 — where these letters doesn't exist. (I tested it, and the W3 Validator give the exact same error message whether you insert them as character references or you type them directly. 

   Now, I also see that – <em>perhaps</em> – since your article uses the UTF-8 character encoding, it really ought not to be any limitation of what you can put inside <code>NAME</code> (of XHTML and HTML 4 documens) or in <code>ID</code> (XHTML 1 documents), since the value of ID and NAME ends up being interpreted as percentage encoded UTF-8-references anyhow. (Is it possible that HTML 4 is colored by the fact that UTF-8 was not in common use when it was written?)  But where is this stated? Where is it stated that we do not need to care about what HTML 4 says about these things? I mean, the fact that XML 1.0 was tied up with UNICODE 2.0 seems like a quite stupid error.  And one could not really expect anyone to respect that part of the XML 1.0 standard, as it seems very illogical. I would not refuse to use my own characters only in order to become validated by the W3 Valdiator. And  likewise, when we look at HTML 4, it is not logical with the limitations of what <code>NAME</code> may contain. 

   Yet, at the same time, there are many that seem content with only ASCII values in these attribtues. And non-ASCII values in these attributes aren't much common - are they - whether in XHTML or HTML?  (This is not an argument against their use. But it is a point that is needed to make in order to say that merely expanding XML 1.0 to use UNICODE above 2.0, is perhaps a huge step for W3C, but still a small step for the humanity. It takes more, and more practical steps, to get authors to use non-ASCII IDs. The discrimination against South-Asian languages is a theoretical one. In reality, all non-ASCII scripts are discriminated against.)

   If the W3 Valdiator is picky about characters not existing in UNICODE 2.0, then it should also be picky about <strong>how</strong> <code>NAME</code> and <code>ID</code> values are inserted, don't you think?

   Or, what is it that the Validator tells us when we try to validate this blog post of yours? Is it in fact so that it only checks <em>which</em> characters you have used, but  abstain from telling us whether you have inserted them the correct way?  (The Validator has some limitations with regard to checking content of attributes.)

   I think the logical thing would be that authors could insert all sensible UNICODE letters in <code>HREF</code>, <code>NAME</code> or <code>ID</code>, and that it becomes the User Agent's responsability to send them correctly through the HTTP protocol – as percentage-encoded characters. 

   Is this in fact the difference between XML and HTML? That XML expects the <abbr title='user agent'>UA</abbr> to handle this? 

   I am sorry if these thigns are cleare to everyone else, but me ... I would like to see someone putting these things straight … 

Approve it: http://rishida.net/blog/wp-admin/comment.php?action=mac&c=7312
Delete it: http://rishida.net/blog/wp-admin/comment.php?action=cdc&c=7312
Spam it: http://rishida.net/blog/wp-admin/comment.php?action=cdc&dt=spam&c=7312
Currently 25 comments are waiting for approval. Please visit the moderation panel:
http://rishida.net/blog/wp-admin/moderation.php
Received on Wednesday, 20 February 2008 13:42:26 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Wednesday, 1 October 2008 10:18:53 GMT