- From: Leif Halvard Silli <lhs@malform.no>
- Date: Wed, 17 Dec 2008 08:20:04 +0100
- To: www-international@w3.org
An W3/i18n article says [1]: "For IRIs to work, there are four main requirements: 1. the syntax of the format where IRIs are used (eg. HTML, XML, SVG, etc) must support the use of non-ASCII characters in Web addresses." After that it jumps to a (surprising) conclusion regarding HTML 4: "Various document formats and specifications already support IRIs. Examples include HTML 4.0, [...] " How come? For the record, the document "HTML 5 differences from HTML 4" says [2]: "HTML now has native support for IRIs. In HTML 4 this was only handled as error handling." So, does the article take "Web addresses" to mean "content of the href attribute"? But even *then*, this statement seems *formally* untrue, as HTML (even if no one seems to care) only allows URI content in HREF attributes. [3]. The spec mentions how a UA should work around HREF attributes containing non-ASCII characters (UTF-8 representation and %-encoding), but that is not the same as "support" of IRIs in the *document format*, since HTML 4 seems to espect a much closer relationship between HTTP protocol and docment format, than this i18n article does. [4] The document should at least tell on what basis it says that HTML accepts IRIs. Is the statement based on post specification developements, as is the case when it comes to the language tags? I think the article should mention document format support for IRI style fragment identifiers as a fift requirement, as this is bound up with both the format syntax and (un)related to the document encoding. At any rate, as of now, the article fails to discuss fragment identifiers. The current status for IRIs and fragment identifiers in HTML is this: * When the the specification talks about NAME='' as a compatibility attribute for older User Agents (i.e. if you equip the A elelement with both attributes simultaneously), then it emphasizes that it is restricticed to the characters of the ID attribute, since the two attributes share the same namespace. [5] * When it speaks about NAME="" versus ID="" as a either-or, that is: a feature choice, then it says that NAME can contain CDATA = the entire Unicode range of characters, including character entities. [6] The benefit of the name="" attribute is undercommunicated. And the general perception is, as I perceive it, that fragment identifiers has to be pure ASCII. The fact that Wikipedia does not use IRI fragment identifiers, when they otherwise are so clever in their use of I18N and L10N techniques, is disapoint to note in that respect. After all, Wikipedia consequently uses <a name="idref"> instead of e.g. <p id="idref"> when it comes to creating intertextual links. And worse, they could have inserted the fragments as pure %-encoded identifiers - such identifiers displays as readable text in Firefox and Opera (yes, despite Opoera's bug), but instead they somehow escape the escapes, so that name="Å" becomes name=".C3.85" instead of name="C385". Anyway ... Now, what about HTML 5? * "URL" as used inside the HTML 5 draft, covers both IRIs and URIs. * name='' is sofar not included in HTML 5 * the experimental HTML 5 validation (see validator.w3.org) does sofar not throw any error if you insert whether character references or non-ASCII inside the ID attribute, quite unlike the W3C HTML 4 validator, which does. (And the HTML 5 validation engine is extremely detailed and checks many more attributes than the "old" W3 validtor, where I would not always trust the lack of error messages.) * the draft does not say that non-ASCII letters are forbidden in the ID attribute Just to throw out some questions: XHTML accepts non-ASCII characters in the ID attribute. Can we take for granted that HTML 5 will also do that? What things needs to be specified in that regard? As much as I see it, HTML 5 must say something direct about how to create internationlized fragment identifiers. [1] http://w3.org/International/articles/idn-and-iri/ [2] http://www.w3.org/TR/html5-diff/#syntax-misc [3] http://w3.org/TR/html401/sgml/loosedtd.html#URI [4] http://w3.org/TR/html401/appendix/notes.html#non-ascii-chars [5] http://w3.org/TR/html401/types.html#type-name [6] http://w3.org/TR/html401/struct/links.html#h-12.2.3 -- leif halvard silli
Received on Wednesday, 17 December 2008 07:20:45 UTC