Re: fragid navigation and pct-encoded

On Tue, 17 Feb 2009, Boris Zbarsky wrote:
> Ian Hickson wrote:
> > Instead, I have made HTML5 require id="" attributes to be matched 
> > after decoding the fragment identifier, and name="" attributes to be 
> > matched before decoding the fragment identifier.
> 
> How does that work when the fragment identifier contains non-ASCII 
> characters, or spaces, which end canonicalized up as escaped UTF-8 in 
> URIs?

It's not clear to me why anything would get canonicalized in the fragment 
identifier. IE doesn't canonicalize anything, and none of the specs seem 
to expect the fragment identifier to be canonicalized.


> Since no such canonicalization happens for name attribute values, that 
> would effectively mean that they never match...  In fact, the space 
> issue is why Gecko unescapes the fragment identifier of the URI; see 
> <https://bugzilla.mozilla.org/show_bug.cgi?id=46190>.  What's general UA 
> behavior here?

Given:

   <a href="#a a">...</a>

...my brief testing suggests that IE would look for:

   name="a a"

...while Safari would look for:

   name="a%20a"

It appears Gecko would look for either.

Given:

   <a href="#a%20a">...</a>

...my brief testing suggests that IE and Safari would look for:

   name="a%20a"

...while Gecko would again look for either that or:

   name="a a"

(Opera isn't considered here since apparently Opera's behaviour, which 
would be to escape the space in the URI like Gecko and Safari, but then 
unescape it again per spec when looking for attributes, is incompatible 
with Web content which expects #%2F to match name="%2F".)


> That said, HTML4 appendix B section B.2.1 does suggest URL-escaping name 
> attributes of <a> that contain non-ASCII characters. Do some UAs do 
> that?

Apparently Gecko, Safari, and Opera do what this section suggests, yes. IE 
apparently does not. (This section effectively amounts to doing what the 
IRI spec suggests.)

It would be helpful if browser vendors (especially Microsoft) could 
comment here. I'm happy to change the spec to do more like what Gecko 
does, if that's what browser vendors are going to implement.


> It seems like it would lead to odd effects when getElementsByName is 
> used, but maybe that's ok.

That section doesn't seem to suggest actually changing the actual 
attribute, unless I'm misreading it.


> A related question: when unescaping, what encoding is used to convert 
> the resulting bytes to Unicode?

The spec says, of looking for id="" attributes:

"Let /decoded fragid/ be the result of expanding any sequences of 
percent-encoded octets in fragid that are valid UTF-8 sequences into 
Unicode characters as defined by UTF-8. If any percent-encoded octets in 
that string are not valid UTF-8 sequences, then skip this step and the 
next one."

(The next step is the one that looks for ids that equal /decoded fragid/.)

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 18 February 2009 05:31:00 UTC