- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Wed, 18 Feb 2009 11:09:33 -0500
- To: Ian Hickson <ian@hixie.ch>
- CC: HTML WG <public-html@w3.org>
Ian Hickson wrote: > It's not clear to me why anything would get canonicalized in the fragment > identifier. IE doesn't canonicalize anything, and none of the specs seem > to expect the fragment identifier to be canonicalized. RFC 2396 (URI) says: fragment = *uric uric = reserved | unreserved | escaped reserved = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" | "$" | "," unreserved = alphanum | mark mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")" escaped = "%" hex hex alphanum has the usual [A-Za-z0-9] expansion. Per RFC 2396, ASCII space is not allowed unescaped inside a fragment identifier. No non-ASCII byte is allowed in a fragment identifier. RFC 3987 (IRI) says: ifragment = *( ipchar / "/" / "?" ) ipchar = iunreserved / pct-encoded / sub-delims / ":" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" iunreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar ucschar is defined as various Unicode character ranges (but not all of Unicode; for example iprivate is not allowed here). ALPHA and DIGIT are actually not defined in the grammar, but the only sane assumption for them is [A-Za-z] and [0-9] respectively. Again, ASCII space is not allowed unescaped inside a fragment identifier. When converting to a URI (e.g. for placement into an HTTP request), ucschar must be encoded as UTF-8 and %-escaped. So any IRI that round-trips via such a medium will have its fragment identifier converted to URI-compatible form. Gecko's URI objects automatically put the string they're given into at least the IRI form, so spaces are always %-escaped. > Given: > > <a href="#a a">...</a> > > ...my brief testing suggests that IE would look for: > > name="a a" > > ...while Safari would look for: > > name="a%20a" > > It appears Gecko would look for either. As far as I can tell, correct, including situations like foo.html containing <a href="bar.html#a a">.... It's not clear to me whether IE's behavior is compatible with the abovementioned RFCs, but it might be if it never actually constructs a URI or anything like that. Also note that if the valid URI file:///tmp/test.html#a%20a is pasted into the URL bar in IE it will not fine the name="a a" anchor, while if the invalid URI "file:///tmp/test.html#a a" is it will. IE will also treat an IRI and its equivalent URI differently in terms of matching target anchors, last I checked. That is, since it does exact matching on the fragid it will either match the URI (%-escaped) version or the IRI (direct ucschar) version but not both. > Given: > > <a href="#a%20a">...</a> > > ...my brief testing suggests that IE and Safari would look for: > > name="a%20a" > > ...while Gecko would again look for either that or: > > name="a a" Yes, I believe that is correct. >> It seems like it would lead to odd effects when getElementsByName is >> used, but maybe that's ok. > > That section doesn't seem to suggest actually changing the actual > attribute, unless I'm misreading it. It seems to to me (and seems to be author-directed, not ua-directed, now that I read it again). > "Let /decoded fragid/ be the result of expanding any sequences of > percent-encoded octets in fragid that are valid UTF-8 sequences into > Unicode characters as defined by UTF-8. If any percent-encoded octets in > that string are not valid UTF-8 sequences, then skip this step and the > next one." > > (The next step is the one that looks for ids that equal /decoded fragid/.) OK. I just looked through the CVS history of our "fall back to URI origin charset" codepath, and don't actually see an obvious reason it was done that way other than the general "better to match too much here than too little" philosophy... So always using UTF-8 is probably fine. -Boris
Received on Wednesday, 18 February 2009 16:10:22 UTC