RE: How browsers display URIs with %-encoding (Opera/Firefox FAIL) from Phillips, Addison on 2011-07-26 (public-iri@w3.org from July 2011)

From: Phillips, Addison <addison@lab126.com>
Date: Tue, 26 Jul 2011 08:54:35 -0700
To: Boris Zbarsky <bzbarsky@MIT.EDU>, "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <131F80DEA635F044946897AFDA9AC3476A94986470@EX-SEA31-D.ant.amazon.com>
(Excuse me for top-posting)

I think I agree with your comment on the %FC case, Boris: if you have an explicit byte escape, one has to assume that the explicit escape is what is desired.

The other problem this thread is trying to deal with, though, is the use of "real" IRIs. If you have "/Dürst" in your path, the u-umlaut character might be represented in the file system of the server by the byte 0xFC, but it's really the character U+00FC and should be a two-byte sequence when the IRI is converted to a URI. The fact that legacy servers expect bytes in some unknown encoding is actually a problem that IRI helps to solve. Yes, those bytes generally encode characters and not random binary stuff. But by using a well-known encoding (UTF-8) on the wire, the server has a chance of being able to decode the bytes to use in the path: the server has better information about its filesystems than does the remote user.

One problem: if you unescape a URI into an IRI, you lose the original byte value. There is nothing that says that the %FC URI has to appear in a Latin-1 page. It could just as well be a link pasted into a Japanese document in Shift_JIS. Or in a UTF-8 document.

I guess what I'm saying is: 

1. The IRI rules should apply when going from Unicode string to URI (use UTF-8 to percent encoding up to the query)

2. The URI rules should apply when going from URI to IRI (decode to bytes)

  a. if the bytes form a valid UTF-8 sequence (they should in an IRI), then interpret the sequence of bytes as characters. That is, %C3%BC -> ü

 ["valid UTF-8" means "a byte sequence that completely decodes to a sequence of Unicode code points according to the rules in Unicode including the non-shortest-form rule." Unlike most encodings, the bit pattern of UTF-8 allows one to say if a sequence is malformed.]

  b. If the bytes do NOT form a valid UTF-8 sequence, then interpret the sequence of bytes as bytes. User agents may choose to apply an arbitrary character encoding (such as the page encoding) to the display of the URI, but MUST NOT reinterpret the bytes as characters in an IRI. That is, %FC -> 0xFC (which may be displayed as u-umlaut in a Latin-1 context). Copy or follow operations use the original %FC. The author must have put that byte there for a reason---if they meant u-umlaut, they would have used an IRI. Yes, this has implications for the address bar of browsers.

Addison

> -----Original Message-----
> From: public-iri-request@w3.org [mailto:public-iri-request@w3.org] On Behalf
> Of Boris Zbarsky
> Sent: Tuesday, July 26, 2011 6:08 AM
> To: "Martin J. Dürst"; PUBLIC-IRI@W3.ORG
> Subject: Re: How browsers display URIs with %-encoding (Opera/Firefox FAIL)
> 
> On 7/25/11 2:25 AM, "Martin J. Dürst" wrote:
> >> In this case, the status bar (and url bar, for that matter) display
> >> here is done by Firefox UI code, not Gecko, for what it's worth.
> >> There are various situations in which the UI code "unescapes" when
> >> the core code would not, and once you start doing that you have to
> >> start guessing at encodings. Right now Gecko does NOT implement IRI
> >> across the board (and last I checked when we tried it broke sites).
> >
> > I can understand that you might not yet implement IRIs everywhere. But
> > the case I showed you, displaying "/People/Dürst" for
> > "/People/D%FCrst", seems to be a clear bug, because "/People/D%FCrst"
> > goes somewhere else than "/People/Dürst" when entered in a
> location/address bar.
> 
> Stop.  Back up.  When you enter it in the location bar, you're entering Unicode,
> which is then encoded as UTF-8 on the wire.  What the page contains is an
> explicit escape.  I think you're making the mistake that user _input_ somehow
> needs to match what the user is shown.  This is a nice property to have, but
> there are other considerations here too.  See below.
> 
> > If you think it's not a bug, can you explain why?
> 
> See below.
> 
> > Imagine a server where all the resource names are encoded in
> > iso-8859-1 (or any other legacy (single-byte) encoding). What you tell
> > http (or whatever other scheme/protocol) by using %FC is that you want
> > the resource with the name with the <0xFC> byte in it.
> 
> Precisely.
> 
> Now the thing is, servers speak bytes but humans do NOT.  So when such a URI
> is used in a web page and the browser UI wants to present that URI to the user
> it has two options:
> 
> 1)  Leave all %-escapes as they are.  This generally makes URIs on many web
> pages out there (which by and large use %-escapes, not actual Unicode chars)
> look like gibberish to users.
> 
> 2)  Unescape %-escapes.  That gives you bytes and then you have to worry
> about how to convert those bytes to chars, since users don't deal with bytes.
> 
> Option 1 leads to a _really_ bad user experience.  So in practice you want
> option 2.  At this point you have several choices.  You can only convert "valid
> UTF-8 sequences" (whatever that means) to chars and reescape all other bytes.
> This seems to be what you would prefer to happen.  Or you can take a guess at
> what the page "really means" (e.g.
> by treating the bytes as bytes in the page encoding).  This is what Firefox does.
> Turns out, this commonly _is_ what the page really means, because lots of
> pages and servers do in fact use legacy encodings as you describe above.  And
> if the page and server use the same legacy encoding and you unescape the %-
> escapes and treat the result as bytes in that legacy encoding, then the string
> you present to the user is fairly
> sensible: it is in fact the Unicode string that was encoded on the page, and is
> very likely what the person saving the file on the server saw as the filename
> when they saved the file.  There may be other options here with more
> complicated heuristics that decide when to have which of those behaviors; this
> seems like something UAs should be able to tailor as needed based on their
> users' expectations and what gives the best user experience, and doesn't seem
> to be something that specifications should be defining, just like they don't
> define exactly what browsers should show in their url bar.
> 
> Your original post in this thread also had this bit:
> 
> > but this would not be interoperable  because if I copy
> 
> etc.  But if you _copy_ the anchor href in this document in Firefox:
> 
>    <!DOCTYPE html>
>    <meta charset="ISO=-8859-1">
>    <a href="http://www.w3.org/People/D%FCrst">Click me</a>
> 
> then the string you get is "http://www.w3.org/People/D%FCrst", precisely
> because of the round-tripping concerns.  No display-unescaping is done on copy.
> And you can't copy the string that's shown in the status bar when you mouse
> over a link, so there are no interop constraints on what that status bar shows.
> 
> There's the open question of what the URL bar shows after you click such a link;
> hard to test in this case because of the server's redirect behavior here.
> 
> Anyway, the upshot of the whole thing is that what's shown to the user needs to
> be the thing that makes sense to the user, while what's communicated to other
> software needs to be what makes sense to software.
>   The mapping between the two can be complicated and not time-invariant
> (also not the same for all users!) and hence imo should not be fixed by a spec.
> 
> Hope that helps,
> Boris
Received on Tuesday, 26 July 2011 15:55:08 UTC