- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Wed, 18 Feb 2009 11:09:33 -0500
- To: Ian Hickson <ian@hixie.ch>
- CC: HTML WG <public-html@w3.org>
Ian Hickson wrote:
> It's not clear to me why anything would get canonicalized in the fragment
> identifier. IE doesn't canonicalize anything, and none of the specs seem
> to expect the fragment identifier to be canonicalized.
RFC 2396 (URI) says:
fragment = *uric
uric = reserved | unreserved | escaped
reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
"$" | ","
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
escaped = "%" hex hex
alphanum has the usual [A-Za-z0-9] expansion.
Per RFC 2396, ASCII space is not allowed unescaped inside a fragment
identifier. No non-ASCII byte is allowed in a fragment identifier.
RFC 3987 (IRI) says:
ifragment = *( ipchar / "/" / "?" )
ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@"
sub-delims = "!" / "$" / "&" / "'" / "(" / ")"
/ "*" / "+" / "," / ";" / "="
iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar is defined as various Unicode character ranges (but not all of
Unicode; for example iprivate is not allowed here). ALPHA and DIGIT are
actually not defined in the grammar, but the only sane assumption for
them is [A-Za-z] and [0-9] respectively.
Again, ASCII space is not allowed unescaped inside a fragment
identifier. When converting to a URI (e.g. for placement into an HTTP
request), ucschar must be encoded as UTF-8 and %-escaped. So any IRI
that round-trips via such a medium will have its fragment identifier
converted to URI-compatible form.
Gecko's URI objects automatically put the string they're given into at
least the IRI form, so spaces are always %-escaped.
> Given:
>
> <a href="#a a">...</a>
>
> ...my brief testing suggests that IE would look for:
>
> name="a a"
>
> ...while Safari would look for:
>
> name="a%20a"
>
> It appears Gecko would look for either.
As far as I can tell, correct, including situations like foo.html
containing <a href="bar.html#a a">.... It's not clear to me whether
IE's behavior is compatible with the abovementioned RFCs, but it might
be if it never actually constructs a URI or anything like that.
Also note that if the valid URI file:///tmp/test.html#a%20a is pasted
into the URL bar in IE it will not fine the name="a a" anchor, while if
the invalid URI "file:///tmp/test.html#a a" is it will.
IE will also treat an IRI and its equivalent URI differently in terms of
matching target anchors, last I checked. That is, since it does exact
matching on the fragid it will either match the URI (%-escaped) version
or the IRI (direct ucschar) version but not both.
> Given:
>
> <a href="#a%20a">...</a>
>
> ...my brief testing suggests that IE and Safari would look for:
>
> name="a%20a"
>
> ...while Gecko would again look for either that or:
>
> name="a a"
Yes, I believe that is correct.
>> It seems like it would lead to odd effects when getElementsByName is
>> used, but maybe that's ok.
>
> That section doesn't seem to suggest actually changing the actual
> attribute, unless I'm misreading it.
It seems to to me (and seems to be author-directed, not ua-directed, now
that I read it again).
> "Let /decoded fragid/ be the result of expanding any sequences of
> percent-encoded octets in fragid that are valid UTF-8 sequences into
> Unicode characters as defined by UTF-8. If any percent-encoded octets in
> that string are not valid UTF-8 sequences, then skip this step and the
> next one."
>
> (The next step is the one that looks for ids that equal /decoded fragid/.)
OK. I just looked through the CVS history of our "fall back to URI
origin charset" codepath, and don't actually see an obvious reason it
was done that way other than the general "better to match too much here
than too little" philosophy... So always using UTF-8 is probably fine.
-Boris
Received on Wednesday, 18 February 2009 16:10:22 UTC