Re: fragid navigation and pct-encoded from Boris Zbarsky on 2009-02-18 (public-html@w3.org from February 2009)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Wed, 18 Feb 2009 11:09:33 -0500
To: Ian Hickson <ian@hixie.ch>
CC: HTML WG <public-html@w3.org>
Message-ID: <499C32BD.8090304@mit.edu>
Ian Hickson wrote:
> It's not clear to me why anything would get canonicalized in the fragment 
> identifier. IE doesn't canonicalize anything, and none of the specs seem 
> to expect the fragment identifier to be canonicalized.

RFC 2396 (URI) says:

   fragment      = *uric
   uric          = reserved | unreserved | escaped
   reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
                 "$" | ","
   unreserved  = alphanum | mark
   mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
   escaped     = "%" hex hex

alphanum has the usual [A-Za-z0-9] expansion.

Per RFC 2396, ASCII space is not allowed unescaped inside a fragment 
identifier.  No non-ASCII byte is allowed in a fragment identifier.

RFC 3987 (IRI) says:

   ifragment      = *( ipchar / "/" / "?" )
   ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                        / "*" / "+" / "," / ";" / "="
   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

ucschar is defined as various Unicode character ranges (but not all of 
Unicode; for example iprivate is not allowed here).  ALPHA and DIGIT are 
actually not defined in the grammar, but the only sane assumption for 
them is [A-Za-z] and [0-9] respectively.

Again, ASCII space is not allowed unescaped inside a fragment 
identifier.  When converting to a URI (e.g. for placement into an HTTP 
request), ucschar must be encoded as UTF-8 and %-escaped.  So any IRI 
that round-trips via such a medium will have its fragment identifier 
converted to URI-compatible form.

Gecko's URI objects automatically put the string they're given into at 
least the IRI form, so spaces are always %-escaped.

> Given:
> 
>    <a href="#a a">...</a>
> 
> ...my brief testing suggests that IE would look for:
> 
>    name="a a"
> 
> ...while Safari would look for:
> 
>    name="a%20a"
> 
> It appears Gecko would look for either.

As far as I can tell, correct, including situations like foo.html 
containing <a href="bar.html#a a">....  It's not clear to me whether 
IE's behavior is compatible with the abovementioned RFCs, but it might 
be if it never actually constructs a URI or anything like that.

Also note that if the valid URI file:///tmp/test.html#a%20a is pasted 
into the URL bar in IE it will not fine the name="a a" anchor, while if 
the invalid URI "file:///tmp/test.html#a a" is it will.

IE will also treat an IRI and its equivalent URI differently in terms of 
matching target anchors, last I checked.  That is, since it does exact 
matching on the fragid it will either match the URI (%-escaped) version 
or the IRI (direct ucschar) version but not both.

> Given:
> 
>    <a href="#a%20a">...</a>
> 
> ...my brief testing suggests that IE and Safari would look for:
> 
>    name="a%20a"
> 
> ...while Gecko would again look for either that or:
> 
>    name="a a"

Yes, I believe that is correct.

>> It seems like it would lead to odd effects when getElementsByName is 
>> used, but maybe that's ok.
> 
> That section doesn't seem to suggest actually changing the actual 
> attribute, unless I'm misreading it.

It seems to to me (and seems to be author-directed, not ua-directed, now 
that I read it again).

> "Let /decoded fragid/ be the result of expanding any sequences of 
> percent-encoded octets in fragid that are valid UTF-8 sequences into 
> Unicode characters as defined by UTF-8. If any percent-encoded octets in 
> that string are not valid UTF-8 sequences, then skip this step and the 
> next one."
> 
> (The next step is the one that looks for ids that equal /decoded fragid/.)

OK.  I just looked through the CVS history of our "fall back to URI 
origin charset" codepath, and don't actually see an obvious reason it 
was done that way other than the general "better to match too much here 
than too little" philosophy...  So always using UTF-8 is probably fine.

-Boris
Received on Wednesday, 18 February 2009 16:10:22 UTC