Re: How browsers display URIs with %-encoding (Opera/Firefox FAIL) from Boris Zbarsky on 2011-07-26 (public-iri@w3.org from July 2011)

From: Boris Zbarsky <bzbarsky@MIT.EDU>
Date: Tue, 26 Jul 2011 09:08:15 -0400
To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E2EBC3F.5050006@mit.edu>
On 7/25/11 2:25 AM, "Martin J. Dürst" wrote:
>> In this case, the status bar (and url bar, for that matter) display here
>> is done by Firefox UI code, not Gecko, for what it's worth. There are
>> various situations in which the UI code "unescapes" when the core code
>> would not, and once you start doing that you have to start guessing at
>> encodings. Right now Gecko does NOT implement IRI across the board (and
>> last I checked when we tried it broke sites).
>
> I can understand that you might not yet implement IRIs everywhere. But
> the case I showed you, displaying "/People/Dürst" for "/People/D%FCrst",
> seems to be a clear bug, because "/People/D%FCrst" goes somewhere else
> than "/People/Dürst" when entered in a location/address bar.

Stop.  Back up.  When you enter it in the location bar, you're entering 
Unicode, which is then encoded as UTF-8 on the wire.  What the page 
contains is an explicit escape.  I think you're making the mistake that 
user _input_ somehow needs to match what the user is shown.  This is a 
nice property to have, but there are other considerations here too.  See 
below.

> If you think it's not a bug, can you explain why?

See below.

> Imagine a server where all the resource names are encoded in
> iso-8859-1 (or any other legacy (single-byte) encoding). What you
> tell http (or whatever other scheme/protocol) by using %FC is that
> you want the resource with the name with the <0xFC> byte in it.

Precisely.

Now the thing is, servers speak bytes but humans do NOT.  So when such a 
URI is used in a web page and the browser UI wants to present that URI 
to the user it has two options:

1)  Leave all %-escapes as they are.  This generally makes URIs on many 
web pages out there (which by and large use %-escapes, not actual 
Unicode chars) look like gibberish to users.

2)  Unescape %-escapes.  That gives you bytes and then you have to worry 
about how to convert those bytes to chars, since users don't deal with 
bytes.

Option 1 leads to a _really_ bad user experience.  So in practice you 
want option 2.  At this point you have several choices.  You can only 
convert "valid UTF-8 sequences" (whatever that means) to chars and 
reescape all other bytes.  This seems to be what you would prefer to 
happen.  Or you can take a guess at what the page "really means" (e.g. 
by treating the bytes as bytes in the page encoding).  This is what 
Firefox does.  Turns out, this commonly _is_ what the page really means, 
because lots of pages and servers do in fact use legacy encodings as you 
describe above.  And if the page and server use the same legacy encoding 
and you unescape the %-escapes and treat the result as bytes in that 
legacy encoding, then the string you present to the user is fairly 
sensible: it is in fact the Unicode string that was encoded on the page, 
and is very likely what the person saving the file on the server saw as 
the filename when they saved the file.  There may be other options here 
with more complicated heuristics that decide when to have which of those 
behaviors; this seems like something UAs should be able to tailor as 
needed based on their users' expectations and what gives the best user 
experience, and doesn't seem to be something that specifications should 
be defining, just like they don't define exactly what browsers should 
show in their url bar.

Your original post in this thread also had this bit:

> but this would not be interoperable  because if I copy

etc.  But if you _copy_ the anchor href in this document in Firefox:

   <!DOCTYPE html>
   <meta charset="ISO=-8859-1">
   <a href="http://www.w3.org/People/D%FCrst">Click me</a>

then the string you get is "http://www.w3.org/People/D%FCrst", precisely 
because of the round-tripping concerns.  No display-unescaping is done 
on copy.  And you can't copy the string that's shown in the status bar 
when you mouse over a link, so there are no interop constraints on what 
that status bar shows.

There's the open question of what the URL bar shows after you click such 
a link; hard to test in this case because of the server's redirect 
behavior here.

Anyway, the upshot of the whole thing is that what's shown to the user 
needs to be the thing that makes sense to the user, while what's 
communicated to other software needs to be what makes sense to software. 
  The mapping between the two can be complicated and not time-invariant 
(also not the same for all users!) and hence imo should not be fixed by 
a spec.

Hope that helps,
Boris
Received on Tuesday, 26 July 2011 13:08:57 UTC