Re: fragid navigation and pct-encoded from Ian Hickson on 2009-04-25 (public-html@w3.org from April 2009)

From: Ian Hickson <ian@hixie.ch>
Date: Sat, 25 Apr 2009 21:57:18 +0000 (UTC)
To: Boris Zbarsky <bzbarsky@MIT.EDU>
Cc: HTML WG <public-html@w3.org>
Message-ID: <Pine.LNX.4.62.0904252152160.10370@hixie.dreamhostps.com>
On Wed, 18 Feb 2009, Boris Zbarsky wrote:
> Ian Hickson wrote:
> > It's not clear to me why anything would get canonicalized in the fragment
> > identifier. IE doesn't canonicalize anything, and none of the specs seem to
> > expect the fragment identifier to be canonicalized.
> 
> RFC 2396 (URI) says:
> 
>   fragment      = *uric
>   uric          = reserved | unreserved | escaped
>   reserved    = ";" | "/" | "?" | ":" | "@" | "&" | "=" | "+" |
>                 "$" | ","
>   unreserved  = alphanum | mark
>   mark        = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
>   escaped     = "%" hex hex
> 
> alphanum has the usual [A-Za-z0-9] expansion.
> 
> Per RFC 2396, ASCII space is not allowed unescaped inside a fragment
> identifier.  No non-ASCII byte is allowed in a fragment identifier.
> 
> RFC 3987 (IRI) says:
> 
>   ifragment      = *( ipchar / "/" / "?" )
>   ipchar         = iunreserved / pct-encoded / sub-delims / ":" / "@"
>   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
>                        / "*" / "+" / "," / ";" / "="
>   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
> 
> ucschar is defined as various Unicode character ranges (but not all of
> Unicode; for example iprivate is not allowed here).  ALPHA and DIGIT are
> actually not defined in the grammar, but the only sane assumption for them is
> [A-Za-z] and [0-9] respectively.
> 
> Again, ASCII space is not allowed unescaped inside a fragment identifier.
> When converting to a URI (e.g. for placement into an HTTP request), ucschar
> must be encoded as UTF-8 and %-escaped.  So any IRI that round-trips via such
> a medium will have its fragment identifier converted to URI-compatible form.

Just because the URL is invalid doesn't mean it has to be canonicalised. 
There are plenty of other URLs that are syntactically invalid that Gecko 
doesn't fix up, for example:

   http://example.com/%


> > Given:
> > 
> >    <a href="#a a">...</a>
> > 
> > ...my brief testing suggests that IE would look for:
> > 
> >    name="a a"
> > 
> > ...while Safari would look for:
> > 
> >    name="a%20a"
> > 
> > It appears Gecko would look for either.
> 
> As far as I can tell, correct, including situations like foo.html containing
> <a href="bar.html#a a">....  It's not clear to me whether IE's behavior is
> compatible with the abovementioned RFCs, but it might be if it never actually
> constructs a URI or anything like that.

The RFCs don't say how to do error handling, so they're somewhat 
irrelevant here.


> IE will also treat an IRI and its equivalent URI differently in terms of
> matching target anchors, last I checked.  That is, since it does exact
> matching on the fragid it will either match the URI (%-escaped) version or the
> IRI (direct ucschar) version but not both.

Right, that's basically what this shows:

> > Given:
> > 
> >    <a href="#a%20a">...</a>
> > 
> > ...my brief testing suggests that IE and Safari would look for:
> > 
> >    name="a%20a"
> > 
> > ...while Gecko would again look for either that or:
> > 
> >    name="a a"

Anyway. Is the algorithm at:

   http://www.whatwg.org/specs/web-apps/current-work/#the-indicated-part-of-the-document

Satisfactory?


> > > It seems like it would lead to odd effects when getElementsByName is 
> > > used, but maybe that's ok.
> > 
> > That section doesn't seem to suggest actually changing the actual 
> > attribute, unless I'm misreading it.
> 
> It seems to to me (and seems to be author-directed, not ua-directed, now 
> that I read it again).

I don't understand.

-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'
Received on Saturday, 25 April 2009 21:57:53 UTC