Re: UTF-8 in URIs

Julian,

Consider a file named "exposé.html", served by www.example.com. This URI can be encoded in many different ways depending on the character set and (for Unicode) normalization form used, for example:

    ISO-8859-1      http://www.exmaple.com/expos%E9.html

    UTF-8 NFD       http://www.exmaple.com/expose%CC%81.html

    UTF-8 NFC       http://www.exmaple.com/expos%C3%A9.html

Today, you have no guarantee that typing "http://www.example.com/exposé.html" in your web browser will work since the browser's choice of character set and normalization form may not match the server's, and it may not be possible for the server to correctly guess.  And an intermediate proxy will have difficulty efficiently/correctly caching the content as well.


On Jan 16, 2014, at 4:08 AM, Julian Reschke <julian.reschke@gmx.de> wrote:

> On 2014-01-15 21:54, Poul-Henning Kamp wrote:
>> In message <CACuKZqF0oxcpJWYnDzzVSwzeJgQ4K18gZCynyYh0uJwY=4xHtA@mail.gmail.com>
>> , Zhong Yu writes:
>> 
>>> Can you give an example where an intermediary benefits from decoding
>>> URI octets into unicodes?
>> 
>> Not necessarily from converting them into unicodes, but normalising
>> them using whatever rule we might prefer, so that cache-lookups
>> will always find the same object, no matter how the URI was mangled
>> with encodings.
> 
> Why do you need more normalization than you have right now?
> 
> /me confused
> 
> 

_________________________________________________________
Michael Sweet, Senior Printing System Engineer, PWG Chair

Received on Thursday, 16 January 2014 14:29:51 UTC