- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 27 Jul 2011 10:03:11 +0900
- To: Boris Zbarsky <bzbarsky@MIT.EDU>
- CC: "PUBLIC-IRI@W3.ORG" <PUBLIC-IRI@w3.org>
Hello Boris, On 2011/07/26 22:08, Boris Zbarsky wrote: > On 7/25/11 2:25 AM, "Martin J. Dürst" wrote: >>> In this case, the status bar (and url bar, for that matter) display here >>> is done by Firefox UI code, not Gecko, for what it's worth. There are >>> various situations in which the UI code "unescapes" when the core code >>> would not, and once you start doing that you have to start guessing at >>> encodings. Right now Gecko does NOT implement IRI across the board (and >>> last I checked when we tried it broke sites). >> >> I can understand that you might not yet implement IRIs everywhere. But >> the case I showed you, displaying "/People/Dürst" for "/People/D%FCrst", >> seems to be a clear bug, because "/People/D%FCrst" goes somewhere else >> than "/People/Dürst" when entered in a location/address bar. > > Stop. Back up. When you enter it in the location bar, you're entering > Unicode, which is then encoded as UTF-8 on the wire. What the page > contains is an explicit escape. I think you're making the mistake that > user _input_ somehow needs to match what the user is shown. This is a > nice property to have, but there are other considerations here too. See > below. Glad to see you are saying it's a nice property. >> If you think it's not a bug, can you explain why? > > See below. > >> Imagine a server where all the resource names are encoded in >> iso-8859-1 (or any other legacy (single-byte) encoding). What you >> tell http (or whatever other scheme/protocol) by using %FC is that >> you want the resource with the name with the <0xFC> byte in it. > > Precisely. > > Now the thing is, servers speak bytes but humans do NOT. So when such a > URI is used in a web page and the browser UI wants to present that URI > to the user it has two options: > > 1) Leave all %-escapes as they are. This generally makes URIs on many > web pages out there (which by and large use %-escapes, not actual > Unicode chars) look like gibberish to users. > > 2) Unescape %-escapes. That gives you bytes and then you have to worry > about how to convert those bytes to chars, since users don't deal with > bytes. > > Option 1 leads to a _really_ bad user experience. So in practice you > want option 2. At this point you have several choices. You can only > convert "valid UTF-8 sequences" (whatever that means) to chars and > reescape all other bytes. This seems to be what you would prefer to > happen. Yes. This seems to be suboptimal locally. But it will lead to better interoperability and a better user experience globally. > Or you can take a guess at what the page "really means" (e.g. by > treating the bytes as bytes in the page encoding). This is what Firefox > does. Turns out, this commonly _is_ what the page really means, because > lots of pages and servers do in fact use legacy encodings as you > describe above. And if the page and server use the same legacy encoding > and you unescape the %-escapes and treat the result as bytes in that > legacy encoding, then the string you present to the user is fairly > sensible: it is in fact the Unicode string that was encoded on the page, > and is very likely what the person saving the file on the server saw as > the filename when they saved the file. I agree with 'very likely'. But there're lot's of ifs. In essence, it's just giving the user a warm, fuzzy feeling locally, while in a bigger context, it's going to fail badly (and disappoint and confuse the user). Let's say the user looks at it and decides it looks good and puts it on a business card. And let's say a business partner tries to reach that page. It will fail. I guess nobody would call that a good user experience. > There may be other options here > with more complicated heuristics that decide when to have which of those > behaviors; this seems like something UAs should be able to tailor as > needed based on their users' expectations and what gives the best user > experience, and doesn't seem to be something that specifications should > be defining, just like they don't define exactly what browsers should > show in their url bar. The argument to not standardize UA aspects is usually a very good one, and I usually agree with it wholeheartedly. However, URIs and IRIs are about more than 'on-the-wire' interoperability, they are supposed to work over the phone and on the side of a bus. As for the address/location bar, I think the main reason there's no standard for it is that no such thing was needed up to now. At least as long as we were in an ASCII world, it just showed the URI it used to get the page. If there are cases where it didn't, I didn't notice, but you might know some cases. > Your original post in this thread also had this bit: > >> but this would not be interoperable because if I copy > > etc. But if you _copy_ the anchor href in this document in Firefox: > > <!DOCTYPE html> > <meta charset="ISO=-8859-1"> > <a href="http://www.w3.org/People/D%FCrst">Click me</a> > > then the string you get is "http://www.w3.org/People/D%FCrst", precisely > because of the round-tripping concerns. No display-unescaping is done on > copy. That's great. But when I wrote 'copy', I was also speaking about copying by hand (or via paper,...). I should have been more explicit about that. > And you can't copy the string that's shown in the status bar when > you mouse over a link, so there are no interop constraints on what that > status bar shows. I can't copy that string with a copy command, but I can of course copy that down onto a napkin. Which leads to interop problems. > There's the open question of what the URL bar shows after you click such > a link; hard to test in this case because of the server's redirect > behavior here. A quick check here indicates that it shows D%FCrst. That's the right thing to do, but it will confuse your user. Being consistent and always showing D%FCrst would be less confusing. > Anyway, the upshot of the whole thing is that what's shown to the user > needs to be the thing that makes sense to the user, while what's > communicated to other software needs to be what makes sense to software. > The mapping between the two can be complicated and not time-invariant > (also not the same for all users!) and hence imo should not be fixed by > a spec. People use software, and software gets used by people. The two have to work together to get the job done. If there's something that makes sense to the user but not to software, then something is wrong, because the user is going to use it again in software, and then it's going to stop making sense for the user, too. IRIs were designed to make sense to the user and to make sense in software. The problem is that that's only possible if we nail down the encoding for the conversion (to UTF-8 as it happens), and therewith give up on converting for other encodings. We definitely tried to figure out something that would work better, but we couldn't. If you can come up with something better, that would be great (you might want to look at Appendix A of RFC 3987 for some approaches that we thought about). Regards, Martin.
Received on Wednesday, 27 July 2011 01:04:41 UTC