- From: Boris Zbarsky <bzbarsky@MIT.EDU>
- Date: Tue, 26 Jul 2011 09:08:15 -0400
- To: "Martin J. Dürst" <duerst@it.aoyama.ac.jp>, "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
On 7/25/11 2:25 AM, "Martin J. Dürst" wrote: >> In this case, the status bar (and url bar, for that matter) display here >> is done by Firefox UI code, not Gecko, for what it's worth. There are >> various situations in which the UI code "unescapes" when the core code >> would not, and once you start doing that you have to start guessing at >> encodings. Right now Gecko does NOT implement IRI across the board (and >> last I checked when we tried it broke sites). > > I can understand that you might not yet implement IRIs everywhere. But > the case I showed you, displaying "/People/Dürst" for "/People/D%FCrst", > seems to be a clear bug, because "/People/D%FCrst" goes somewhere else > than "/People/Dürst" when entered in a location/address bar. Stop. Back up. When you enter it in the location bar, you're entering Unicode, which is then encoded as UTF-8 on the wire. What the page contains is an explicit escape. I think you're making the mistake that user _input_ somehow needs to match what the user is shown. This is a nice property to have, but there are other considerations here too. See below. > If you think it's not a bug, can you explain why? See below. > Imagine a server where all the resource names are encoded in > iso-8859-1 (or any other legacy (single-byte) encoding). What you > tell http (or whatever other scheme/protocol) by using %FC is that > you want the resource with the name with the <0xFC> byte in it. Precisely. Now the thing is, servers speak bytes but humans do NOT. So when such a URI is used in a web page and the browser UI wants to present that URI to the user it has two options: 1) Leave all %-escapes as they are. This generally makes URIs on many web pages out there (which by and large use %-escapes, not actual Unicode chars) look like gibberish to users. 2) Unescape %-escapes. That gives you bytes and then you have to worry about how to convert those bytes to chars, since users don't deal with bytes. Option 1 leads to a _really_ bad user experience. So in practice you want option 2. At this point you have several choices. You can only convert "valid UTF-8 sequences" (whatever that means) to chars and reescape all other bytes. This seems to be what you would prefer to happen. Or you can take a guess at what the page "really means" (e.g. by treating the bytes as bytes in the page encoding). This is what Firefox does. Turns out, this commonly _is_ what the page really means, because lots of pages and servers do in fact use legacy encodings as you describe above. And if the page and server use the same legacy encoding and you unescape the %-escapes and treat the result as bytes in that legacy encoding, then the string you present to the user is fairly sensible: it is in fact the Unicode string that was encoded on the page, and is very likely what the person saving the file on the server saw as the filename when they saved the file. There may be other options here with more complicated heuristics that decide when to have which of those behaviors; this seems like something UAs should be able to tailor as needed based on their users' expectations and what gives the best user experience, and doesn't seem to be something that specifications should be defining, just like they don't define exactly what browsers should show in their url bar. Your original post in this thread also had this bit: > but this would not be interoperable because if I copy etc. But if you _copy_ the anchor href in this document in Firefox: <!DOCTYPE html> <meta charset="ISO=-8859-1"> <a href="http://www.w3.org/People/D%FCrst">Click me</a> then the string you get is "http://www.w3.org/People/D%FCrst", precisely because of the round-tripping concerns. No display-unescaping is done on copy. And you can't copy the string that's shown in the status bar when you mouse over a link, so there are no interop constraints on what that status bar shows. There's the open question of what the URL bar shows after you click such a link; hard to test in this case because of the server's redirect behavior here. Anyway, the upshot of the whole thing is that what's shown to the user needs to be the thing that makes sense to the user, while what's communicated to other software needs to be what makes sense to software. The mapping between the two can be complicated and not time-invariant (also not the same for all users!) and hence imo should not be fixed by a spec. Hope that helps, Boris
Received on Tuesday, 26 July 2011 13:08:57 UTC