IRI status of HTML

An W3/i18n article says [1]:

    "For IRIs to work, there are four main requirements:
    1. the syntax of the format where IRIs are used (eg. HTML, XML, SVG,
    etc) must support the use of non-ASCII characters in Web addresses."

After that it jumps to a (surprising) conclusion regarding HTML 4:

    "Various document formats and specifications already support IRIs.
    Examples include HTML 4.0, [...] "

How come? For the record, the document "HTML 5 differences from HTML 4" 
says [2]:

    "HTML now has native support for IRIs. In HTML 4 this was only
    handled as error handling."

So, does the article take "Web addresses" to mean "content of the href 
attribute"? But even *then*, this statement seems *formally* untrue, as 
HTML (even if no one seems to care) only allows URI content in HREF 
attributes. [3]. The spec mentions how a UA should work around HREF 
attributes containing non-ASCII characters (UTF-8 representation and 
%-encoding), but that is not the same as "support" of IRIs in the 
*document format*, since HTML 4 seems to espect a much closer 
relationship between HTTP protocol and docment format, than this i18n 
article does. [4]

The document should at least tell on what basis it says that HTML 
accepts IRIs. Is the statement based on post specification 
developements, as is the case when it comes to the language tags?

I think the article should mention document format support for IRI style 
fragment identifiers as a fift requirement, as this is bound up with 
both the format syntax and (un)related to the document encoding. At any 
rate, as of now, the article fails to discuss fragment identifiers.

The current status for IRIs and fragment identifiers in HTML is this:

    * When the the specification talks about NAME='' as a compatibility
      attribute for older User Agents (i.e. if you equip the A elelement
      with both attributes simultaneously), then it emphasizes that it
      is restricticed to the characters of the ID attribute, since the
      two attributes share the same namespace. [5]
    * When it speaks about NAME="" versus ID="" as a either-or, that is:
      a feature choice, then it says that NAME can contain CDATA = the
      entire Unicode range of characters, including character entities. [6] 

The benefit of the name="" attribute is undercommunicated. And the 
general perception is, as I perceive it, that fragment identifiers has 
to be pure ASCII. The fact that Wikipedia does not use IRI fragment 
identifiers, when they otherwise are so clever in their use of I18N and 
L10N techniques, is disapoint to note in that respect. After all, 
Wikipedia consequently uses <a name="idref"> instead of e.g. <p 
id="idref"> when it comes to creating intertextual links. And worse, 
they could have inserted the fragments as pure %-encoded identifiers - 
such identifiers displays as readable text in Firefox and Opera (yes, 
despite Opoera's bug), but instead they somehow escape the escapes, so 
that name="Å" becomes name=".C3.85" instead of name="C385". Anyway ...

Now, what about HTML 5?

    * "URL" as used inside the HTML 5 draft, covers both IRIs and URIs.
    * name='' is sofar not included in HTML 5
    * the experimental HTML 5 validation (see validator.w3.org) does
      sofar not throw any error if you insert whether character
      references or non-ASCII inside the ID attribute, quite unlike the
      W3C HTML 4 validator, which does. (And the HTML 5 validation
      engine is extremely detailed and checks many more attributes than
      the "old" W3 validtor, where I would not always trust the lack of
      error messages.)
    * the draft does not say that non-ASCII letters are forbidden in the
      ID attribute


Just to throw out some questions: XHTML accepts non-ASCII characters in 
the ID attribute. Can we take for granted that HTML 5 will also do that? 
What things needs to be specified in that regard? As much as I see it, 
HTML 5 must say something direct about how to create internationlized 
fragment identifiers.

[1] http://w3.org/International/articles/idn-and-iri/
[2] http://www.w3.org/TR/html5-diff/#syntax-misc
[3] http://w3.org/TR/html401/sgml/loosedtd.html#URI
[4] http://w3.org/TR/html401/appendix/notes.html#non-ascii-chars
[5] http://w3.org/TR/html401/types.html#type-name
[6] http://w3.org/TR/html401/struct/links.html#h-12.2.3
-- 
leif halvard silli

Received on Wednesday, 17 December 2008 07:20:45 UTC