W3C home > Mailing lists > Public > public-html@w3.org > February 2009

Re: fragid navigation and pct-encoded

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Wed, 18 Feb 2009 10:37:45 -0500
Message-ID: <499C2B49.8060302@mit.edu>
To: "Roy T. Fielding" <fielding@gbiv.com>
CC: HTML WG <public-html@w3.org>
Roy T. Fielding wrote:
> The id attribute in HTML5 is defined to be an opaque string,
> presumably in the document character encoding.  Therefore, either
> the data in the fragment has to be converted to the document character
> encoding, or the data in the id has to be converted to the URI encoding,
> before the two can be compared as opaque strings.
> 
> The name attribute in HTML4 is defined to be cdata in the document
> character encoding.  Therefore, either the data in the fragment has
> to be converted to the document character encoding, or the data in
> the name attribute has to be converted to the URI encoding, before
> the two can be compared as opaque strings.
> 
> Firefox is doing what was recommended by HTML4:
> 
>   http://www.w3.org/TR/html4/appendix/notes.html#non-ascii-chars
> 
>     Note. The same conversion based on UTF-8 should be applied to
>     values of the name attribute for the A element.

For the record, that is not quite what Firefox is doing.

What Firefox is doing is the following:

1)  During parsing, ids are converted from bytes to Unicode characters
     using the document encoding.
2)  During parsing, names are converted from bytes to Unicode characters
     using the document encoding.  For the <a> element the name is then
     encoded as UTF-8, the resulting byte array has all %-escapes
     replaced by the relevant bytes (as indicated by the escape), and
     the resulting byte-array is converted to Unicode by assuming that
     it's UTF-8.
3)  When asked to scroll to a fragment identifier, the fragment
     identifier is fetched from the URI object it's in.  This gives
     a byte array.  All %-escapes are replaced with the corresponding
     bytes.  Then the byte array is converted to Unicode by assuming
     it to be UTF-8, and the resulting string matched against names
     and ids in the document.  If no match is found, the same byte
     array is converted to Unicode by using the originating character
     encoding of the URI, and the matching is tried again.

In practice, it turns out that matching when the author didn't expect it 
is rarely a problem (because it's rare that there are names or fragment 
identifiers around that are the same up to escaping but meant to NOT 
match), while not matching when the author expected a match causes "web 
site doesn't work" issues.  Hence the above algorithm, which attempts to 
match as broadly as possible...

As an aside, now that I reread the HTML4 note in the light of day it 
sounds like it's aimed at authors, not UAs.

-Boris
Received on Wednesday, 18 February 2009 15:38:31 UTC

This archive was generated by hypermail 2.3.1 : Monday, 29 September 2014 09:39:01 UTC