Re: How browsers display URIs with %-encoding (Opera/Firefox FAIL) from Martin J. Dürst on 2011-07-27 (public-iri@w3.org from July 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 27 Jul 2011 10:03:11 +0900
To: Boris Zbarsky <bzbarsky@MIT.EDU>
CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Message-ID: <4E2F63CF.1050902@it.aoyama.ac.jp>
Hello Boris,

On 2011/07/26 22:08, Boris Zbarsky wrote:
> On 7/25/11 2:25 AM, "Martin J. Dürst" wrote:
>>> In this case, the status bar (and url bar, for that matter) display here
>>> is done by Firefox UI code, not Gecko, for what it's worth. There are
>>> various situations in which the UI code "unescapes" when the core code
>>> would not, and once you start doing that you have to start guessing at
>>> encodings. Right now Gecko does NOT implement IRI across the board (and
>>> last I checked when we tried it broke sites).
>>
>> I can understand that you might not yet implement IRIs everywhere. But
>> the case I showed you, displaying "/People/Dürst" for "/People/D%FCrst",
>> seems to be a clear bug, because "/People/D%FCrst" goes somewhere else
>> than "/People/Dürst" when entered in a location/address bar.
>
> Stop. Back up. When you enter it in the location bar, you're entering
> Unicode, which is then encoded as UTF-8 on the wire. What the page
> contains is an explicit escape. I think you're making the mistake that
> user _input_ somehow needs to match what the user is shown. This is a
> nice property to have, but there are other considerations here too. See
> below.


Glad to see you are saying it's a nice property.


>> If you think it's not a bug, can you explain why?
>
> See below.
>
>> Imagine a server where all the resource names are encoded in
>> iso-8859-1 (or any other legacy (single-byte) encoding). What you
>> tell http (or whatever other scheme/protocol) by using %FC is that
>> you want the resource with the name with the <0xFC> byte in it.
>
> Precisely.
>
> Now the thing is, servers speak bytes but humans do NOT. So when such a
> URI is used in a web page and the browser UI wants to present that URI
> to the user it has two options:
>
> 1) Leave all %-escapes as they are. This generally makes URIs on many
> web pages out there (which by and large use %-escapes, not actual
> Unicode chars) look like gibberish to users.
>
> 2) Unescape %-escapes. That gives you bytes and then you have to worry
> about how to convert those bytes to chars, since users don't deal with
> bytes.
>
> Option 1 leads to a _really_ bad user experience. So in practice you
> want option 2. At this point you have several choices. You can only
> convert "valid UTF-8 sequences" (whatever that means) to chars and
> reescape all other bytes. This seems to be what you would prefer to
> happen.

Yes. This seems to be suboptimal locally. But it will lead to better 
interoperability and a better user experience globally.


> Or you can take a guess at what the page "really means" (e.g. by
> treating the bytes as bytes in the page encoding). This is what Firefox
> does. Turns out, this commonly _is_ what the page really means, because
> lots of pages and servers do in fact use legacy encodings as you
> describe above. And if the page and server use the same legacy encoding
> and you unescape the %-escapes and treat the result as bytes in that
> legacy encoding, then the string you present to the user is fairly
> sensible: it is in fact the Unicode string that was encoded on the page,
> and is very likely what the person saving the file on the server saw as
> the filename when they saved the file.

I agree with 'very likely'. But there're lot's of ifs. In essence, it's 
just giving the user a warm, fuzzy feeling locally, while in a bigger 
context, it's going to fail badly (and disappoint and confuse the user). 
Let's say the user looks at it and decides it looks good and puts it on 
a business card. And let's say a business partner tries to reach that 
page. It will fail. I guess nobody would call that a good user experience.


> There may be other options here
> with more complicated heuristics that decide when to have which of those
> behaviors; this seems like something UAs should be able to tailor as
> needed based on their users' expectations and what gives the best user
> experience, and doesn't seem to be something that specifications should
> be defining, just like they don't define exactly what browsers should
> show in their url bar.

The argument to not standardize UA aspects is usually a very good one, 
and I usually agree with it wholeheartedly. However, URIs and IRIs are 
about more than 'on-the-wire' interoperability, they are supposed to 
work over the phone and on the side of a bus.

As for the address/location bar, I think the main reason there's no 
standard for it is that no such thing was needed up to now. At least as 
long as we were in an ASCII world, it just showed the URI it used to get 
the page. If there are cases where it didn't, I didn't notice, but you 
might know some cases.


> Your original post in this thread also had this bit:
>
>> but this would not be interoperable because if I copy
>
> etc. But if you _copy_ the anchor href in this document in Firefox:
>
> <!DOCTYPE html>
> <meta charset="ISO=-8859-1">
> <a href="http://www.w3.org/People/D%FCrst">Click me</a>
>
> then the string you get is "http://www.w3.org/People/D%FCrst", precisely
> because of the round-tripping concerns. No display-unescaping is done on
> copy.

That's great. But when I wrote 'copy', I was also speaking about copying 
by hand (or via paper,...). I should have been more explicit about that.


> And you can't copy the string that's shown in the status bar when
> you mouse over a link, so there are no interop constraints on what that
> status bar shows.

I can't copy that string with a copy command, but I can of course copy 
that down onto a napkin. Which leads to interop problems.


> There's the open question of what the URL bar shows after you click such
> a link; hard to test in this case because of the server's redirect
> behavior here.

A quick check here indicates that it shows D%FCrst. That's the right 
thing to do, but it will confuse your user. Being consistent and always 
showing D%FCrst would be less confusing.


> Anyway, the upshot of the whole thing is that what's shown to the user
> needs to be the thing that makes sense to the user, while what's
> communicated to other software needs to be what makes sense to software.
> The mapping between the two can be complicated and not time-invariant
> (also not the same for all users!) and hence imo should not be fixed by
> a spec.

People use software, and software gets used by people. The two have to 
work together to get the job done. If there's something that makes sense 
to the user but not to software, then something is wrong, because the 
user is going to use it again in software, and then it's going to stop 
making sense for the user, too.

IRIs were designed to make sense to the user and to make sense in 
software. The problem is that that's only possible if we nail down the 
encoding for the conversion (to UTF-8 as it happens), and therewith give 
up on converting for other encodings. We definitely tried to figure out 
something that would work better, but we couldn't. If you can come up 
with something better, that would be great (you might want to look at 
Appendix A of RFC 3987 for some approaches that we thought about).

Regards,   Martin.
Received on Wednesday, 27 July 2011 01:04:41 UTC