- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Wed, 18 Feb 2009 10:37:45 -0500
- To: "Roy T. Fielding" <fielding@gbiv.com>
- CC: HTML WG <public-html@w3.org>
Roy T. Fielding wrote: > The id attribute in HTML5 is defined to be an opaque string, > presumably in the document character encoding. Therefore, either > the data in the fragment has to be converted to the document character > encoding, or the data in the id has to be converted to the URI encoding, > before the two can be compared as opaque strings. > > The name attribute in HTML4 is defined to be cdata in the document > character encoding. Therefore, either the data in the fragment has > to be converted to the document character encoding, or the data in > the name attribute has to be converted to the URI encoding, before > the two can be compared as opaque strings. > > Firefox is doing what was recommended by HTML4: > > http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars > > Note. The same conversion based on UTF-8 should be applied to > values of the name attribute for the A element. For the record, that is not quite what Firefox is doing. What Firefox is doing is the following: 1) During parsing, ids are converted from bytes to Unicode characters using the document encoding. 2) During parsing, names are converted from bytes to Unicode characters using the document encoding. For the <a> element the name is then encoded as UTF-8, the resulting byte array has all %-escapes replaced by the relevant bytes (as indicated by the escape), and the resulting byte-array is converted to Unicode by assuming that it's UTF-8. 3) When asked to scroll to a fragment identifier, the fragment identifier is fetched from the URI object it's in. This gives a byte array. All %-escapes are replaced with the corresponding bytes. Then the byte array is converted to Unicode by assuming it to be UTF-8, and the resulting string matched against names and ids in the document. If no match is found, the same byte array is converted to Unicode by using the originating character encoding of the URI, and the matching is tried again. In practice, it turns out that matching when the author didn't expect it is rarely a problem (because it's rare that there are names or fragment identifiers around that are the same up to escaping but meant to NOT match), while not matching when the author expected a match causes "web site doesn't work" issues. Hence the above algorithm, which attempts to match as broadly as possible... As an aside, now that I reread the HTML4 note in the light of day it sounds like it's aimed at authors, not UAs. -Boris
Received on Wednesday, 18 February 2009 15:38:31 UTC