- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 18 Feb 2009 05:30:25 +0000 (UTC)
- To: Boris Zbarsky <bzbarsky@MIT.EDU>
- Cc: HTML WG <public-html@w3.org>
On Tue, 17 Feb 2009, Boris Zbarsky wrote: > Ian Hickson wrote: > > Instead, I have made HTML5 require id="" attributes to be matched > > after decoding the fragment identifier, and name="" attributes to be > > matched before decoding the fragment identifier. > > How does that work when the fragment identifier contains non-ASCII > characters, or spaces, which end canonicalized up as escaped UTF-8 in > URIs? It's not clear to me why anything would get canonicalized in the fragment identifier. IE doesn't canonicalize anything, and none of the specs seem to expect the fragment identifier to be canonicalized. > Since no such canonicalization happens for name attribute values, that > would effectively mean that they never match... In fact, the space > issue is why Gecko unescapes the fragment identifier of the URI; see > <https://bugzilla.mozilla.org/show_bug.cgi?id=46190>. What's general UA > behavior here? Given: <a href="#a a">...</a> ...my brief testing suggests that IE would look for: name="a a" ...while Safari would look for: name="a%20a" It appears Gecko would look for either. Given: <a href="#a%20a">...</a> ...my brief testing suggests that IE and Safari would look for: name="a%20a" ...while Gecko would again look for either that or: name="a a" (Opera isn't considered here since apparently Opera's behaviour, which would be to escape the space in the URI like Gecko and Safari, but then unescape it again per spec when looking for attributes, is incompatible with Web content which expects #%2F to match name="%2F".) > That said, HTML4 appendix B section B.2.1 does suggest URL-escaping name > attributes of <a> that contain non-ASCII characters. Do some UAs do > that? Apparently Gecko, Safari, and Opera do what this section suggests, yes. IE apparently does not. (This section effectively amounts to doing what the IRI spec suggests.) It would be helpful if browser vendors (especially Microsoft) could comment here. I'm happy to change the spec to do more like what Gecko does, if that's what browser vendors are going to implement. > It seems like it would lead to odd effects when getElementsByName is > used, but maybe that's ok. That section doesn't seem to suggest actually changing the actual attribute, unless I'm misreading it. > A related question: when unescaping, what encoding is used to convert > the resulting bytes to Unicode? The spec says, of looking for id="" attributes: "Let /decoded fragid/ be the result of expanding any sequences of percent-encoded octets in fragid that are valid UTF-8 sequences into Unicode characters as defined by UTF-8. If any percent-encoded octets in that string are not valid UTF-8 sequences, then skip this step and the next one." (The next step is the one that looks for ids that equal /decoded fragid/.) -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 18 February 2009 05:31:00 UTC