- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Tue, 01 Oct 2002 11:34:13 +0200
- To: www-html@w3.org
Hi, HTML and XHTML allow non-ASCII characters in anchor names (the name attribute of the a element in HTML, the id attribute in XHTML) but they fail to define how fragment identifiers must be encoded and decoded to ensure correct interpretation. Consider e.g. a HTML document with the following anchor: <h1><a name='Björn'>...</a></h1> Document internal links to this fragment might look like <ol> <!--1--><li><a href='#Björn' >none nfc</a> <!-- invalid --> <!--2--><li><a href='#Björn'>none nfd</a> <!-- invalid --> <!--3--><li><a href='#Bj%f6rn' >iso-8859-1 nfc</a> <!--4--><li><a href='#Bj%c3%b6rn' >utf-8 nfc</a> <!--5--><li><a href='#Bjo%cc%88rn' >utf-8 nfd</a> </ol> Cases 1 and 2 are invalid (non-ASCII character in URI Reference). However, appendix B.2.1 suggests to fix it like this: We recommend that user agents adopt the following convention for handling non-ASCII characters in such cases: 1. Represent each character in UTF-8 (see [RFC2279]) as one or more bytes. 2. Escape these bytes with the URI escaping mechanism (i.e., by converting each byte to %HH, where HH is the hexadecimal notation of the byte value). This would cause Case 1 == Case 4 Case 2 == Case 5 I tested the above cases in various browsers. The encoding of the document has been set to ISO-8859-1: +-------------+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +-------------+---+---+---+---+---+ | Mozilla 1.0 | x | | x | x | | | Opera 6.0 | x | | | | | | Amaya 6.4 | x | | | | | | IE 6.0 | x | x | | | | +-------------+---+---+---+---+---+ In a UTF-8 encoded document using <h1><a name='Björn'>...</a></h1> instead of a named character reference, Mozilla will no longer pass test case 3, other results remain equal. Using <h1><a name='Björn'>...</a></h1> will change the results as below: +-------------+---+---+---+---+---+ | | 1 | 2 | 3 | 4 | 5 | +-------------+---+---+---+---+---+ | Mozilla 1.0 | | x | | | x | | Opera 6.0 | | x | | | | | Amaya 6.4 | | x | | | | | IE 6.0 | x | x | | | | +-------------+---+---+---+---+---+ The test document is always delivered as text/html with a correct charset parameter. RFC 3236 (application/xhtml+xml) refers to XPointer for possible future interpretations of fragment identifiers. Past drafts for XPointer had a section (4.1.2) [1] on how to deal with non-ASCII characters, after the draft has been split up into various parts, there is no longer such section, I wonder whether this is a) conforming to CharMod and b) will cause problems, since it still is not clear, how to interprete non-ASCII escapes in URI reference. Another problem is, that the available resources (mostly drafts) disagree whether NFC normalization must, must not or should happen and if, when it should happen. I'd like to see this situation improved already in HTML 4, but at least for application/xhtml+xml something needs to be done, IMO. [1] http://www.w3.org/TR/2001/CR-xptr-20010911/#uri-escaping regards.
Received on Tuesday, 1 October 2002 05:34:09 UTC