- From: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Date: Wed, 27 Jul 2011 19:15:37 +0900
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- CC: "public-iri@w3.org" <public-iri@w3.org>
Hello Björn, On 2011/07/27 11:00, Bjoern Hoehrmann wrote: > * Martin J. Dürst wrote: >> The idea is that because %-encoding in URIs has to be interpreted as >> UTF-8 when converting to IRIs [...] > > Converting `data:image/png,...%C3%B6...` to `data:image/png,...ö...` > is semantically wrong, there is no character "ö" in this, it's just > bytes. Agreed. > Sure, if you use UTF-8 and don't unicode-normalize, you can > round-trip in this manner, but that doesn't make it any more right. Yes. But then you'd also have to say that data:image/png,...abcd... is wrong, and it should be data:image/png,...%61%62%63%64..., which is possible, but nobody is doing that because they want their data URIs to be short. > If you have `http://.../%C3%B6` the situation is no different, there > is no reason for `%C3%B6` to actually mean `ö` in any sense beyond > round-tripping, "converting to IRIs" may be wrong in some situations. It may indeed be wrong in some situations. But the chance it's right gets quite a bit higher, first because of the properties of UTF-8 bit patterns (see http://www.ifi.unizh.ch/arvo/mml/mduerst/papers/PDF/IUC11-UTF-8.pdf), and second because the Web on various levels is moving towards UTF-8. > I do understand what outcome you desire, but I do not understand how > you would get around this problem short of one or more of, accepting > wrong results like in the data: case above, relying on complicated > and probably unreliable heuristics, or abandoning the idea that some > of the time %xx sequences stand for octets while at other times they > stand for characters (turned into bytes by some character encoding). As for data:, base64 is explicitly possible and usually shorter (see http://jimbojw.com/wiki/index.php?title=Data_URIs_and_Inline_Images) unless the data is text in the first place, and people prefer short data: URIs (it's gibberish anyway). As for heuristics, they are often quite reliable. And yes, %xx sequences stand sometimes (actually most of the time) for characters (via an encoding), and occasionally for bytes that have no connection with characters. That has been the case since the first URI/URL spec. > I argued for the last option eight years ago, unsuccessfully, and I > do not like the first option. Do you think about this in terms of > the heuristics option and are saying the heuristics are not perfect, > or is there some other dimension to it? In your example you discuss > this only in terms of round-tripping, but that is not how I look at > this at all -- I want to get away from talking about bytes here. There was a proposal to use something like %uXXXX (XXXX being a hexadecimal Unicode code point value) in URIs, and I guess that would have been to your liking. There was even some support for that in some version of JavaScript, I guess it's still there because such stuff dies slowly. I guess it's something like this that you are talking about. However, that didn't work well, because the server side (in particular Apache on Unix/Linux) essentially works with bytes and not with characters. That discussion was done something like 14 or so years ago. Regards, Martin.
Received on Wednesday, 27 July 2011 10:17:02 UTC