Re: How browsers display URIs with %-encoding (Opera/Firefox FAIL) from Martin J. Dürst on 2011-07-27 (public-iri@w3.org from July 2011)

From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Date: Wed, 27 Jul 2011 19:15:37 +0900
To: Bjoern Hoehrmann <derhoermi@gmx.net>
CC: "public-iri@w3.org" <public-iri@w3.org>
Message-ID: <4E2FE549.1040800@it.aoyama.ac.jp>

Hello Björn,

On 2011/07/27 11:00, Bjoern Hoehrmann wrote:
> * Martin J. Dürst wrote:
>> The idea is that because %-encoding in URIs has to be interpreted as
>> UTF-8 when converting to IRIs [...]
>
> Converting `data:image/png,...%C3%B6...` to `data:image/png,...ö...`
> is semantically wrong, there is no character "ö" in this, it's just
> bytes.

Agreed.

> Sure, if you use UTF-8 and don't unicode-normalize, you can
> round-trip in this manner, but that doesn't make it any more right.

Yes. But then you'd also have to say that data:image/png,...abcd... is 
wrong, and it should be data:image/png,...%61%62%63%64..., which is 
possible, but nobody is doing that because they want their data URIs to 
be short.

> If you have `http://.../%C3%B6` the situation is no different, there
> is no reason for `%C3%B6` to actually mean `ö` in any sense beyond
> round-tripping, "converting to IRIs" may be wrong in some situations.

It may indeed be wrong in some situations. But the chance it's right 
gets quite a bit higher, first because of the properties of UTF-8 bit 
patterns (see 
http://www.ifi.unizh.ch/arvo/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf), 
and second because the Web on various levels is moving towards UTF-8.

> I do understand what outcome you desire, but I do not understand how
> you would get around this problem short of one or more of, accepting
> wrong results like in the data: case above, relying on complicated
> and probably unreliable heuristics, or abandoning the idea that some
> of the time %xx sequences stand for octets while at other times they
> stand for characters (turned into bytes by some character encoding).

As for data:, base64 is explicitly possible and usually shorter (see 
http://jimbojw.com/wiki/index.php?title=Data_URIs_and_Inline_Images) 
unless the data is text in the first place, and people prefer short 
data: URIs (it's gibberish anyway).

As for heuristics, they are often quite reliable.

And yes, %xx sequences stand sometimes (actually most of the time) for 
characters (via an encoding), and occasionally for bytes that have no 
connection with characters. That has been the case since the first 
URI/URL spec.

> I argued for the last option eight years ago, unsuccessfully, and I
> do not like the first option. Do you think about this in terms of
> the heuristics option and are saying the heuristics are not perfect,
> or is there some other dimension to it? In your example you discuss
> this only in terms of round-tripping, but that is not how I look at
> this at all -- I want to get away from talking about bytes here.

There was a proposal to use something like %uXXXX (XXXX being a 
hexadecimal Unicode code point value) in URIs, and I guess that would 
have been to your liking. There was even some support for that in some 
version of JavaScript, I guess it's still there because such stuff dies 
slowly. I guess it's something like this that you are talking about. 
However, that didn't work well, because the server side (in particular 
Apache on Unix/Linux) essentially works with bytes and not with 
characters. That discussion was done something like 14 or so years ago.

Regards,    Martin.

Received on Wednesday, 27 July 2011 10:17:02 UTC