W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2014

Re: UTF-8 in URIs

From: Michael Sweet <msweet@apple.com>
Date: Thu, 16 Jan 2014 09:29:19 -0500
Cc: Poul-Henning Kamp <phk@phk.freebsd.dk>, Zhong Yu <zhong.j.yu@gmail.com>, Gabriel Montenegro <Gabriel.Montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <OSAMAM@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <Michael.Bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
Message-id: <8ADC0330-992B-4912-93A4-844D12E32906@apple.com>
To: Julian Reschke <julian.reschke@gmx.de>

Consider a file named "exposť.html", served by www.example.com. This URI can be encoded in many different ways depending on the character set and (for Unicode) normalization form used, for example:

    ISO-8859-1      http://www.exmaple.com/expos%E9.html

    UTF-8 NFD       http://www.exmaple.com/expose%CC%81.html

    UTF-8 NFC       http://www.exmaple.com/expos%C3%A9.html

Today, you have no guarantee that typing "http://www.example.com/exposť.html" in your web browser will work since the browser's choice of character set and normalization form may not match the server's, and it may not be possible for the server to correctly guess.  And an intermediate proxy will have difficulty efficiently/correctly caching the content as well.

On Jan 16, 2014, at 4:08 AM, Julian Reschke <julian.reschke@gmx.de> wrote:

> On 2014-01-15 21:54, Poul-Henning Kamp wrote:
>> In message <CACuKZqF0oxcpJWYnDzzVSwzeJgQ4K18gZCynyYh0uJwY=4xHtA@mail.gmail.com>
>> , Zhong Yu writes:
>>> Can you give an example where an intermediary benefits from decoding
>>> URI octets into unicodes?
>> Not necessarily from converting them into unicodes, but normalising
>> them using whatever rule we might prefer, so that cache-lookups
>> will always find the same object, no matter how the URI was mangled
>> with encodings.
> Why do you need more normalization than you have right now?
> /me confused

Michael Sweet, Senior Printing System Engineer, PWG Chair

Received on Thursday, 16 January 2014 14:29:51 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:23 UTC