W3C home > Mailing lists > Public > ietf-http-wg@w3.org > January to March 2014

Re: *** GMX Spamverdacht *** Re: UTF-8 in URIs

From: Julian Reschke <julian.reschke@gmx.de>
Date: Thu, 16 Jan 2014 16:11:47 +0100
Message-ID: <52D7F6B3.6080508@gmx.de>
To: Nicolas Mailhot <nicolas.mailhot@laposte.net>
CC: Zhong Yu <zhong.j.yu@gmail.com>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On 2014-01-16 15:57, Nicolas Mailhot wrote:
> ...
>>> And it's useless if you can't interpret it reliably. May as well log the
>>> output of /dev/random at the time. Don't have time to get humans comb
>>> millions of log lines to fix encoding errors.
>>
>> Define "encoding error" in the context of a URI.
>
> Any URI that can not be reliably decoded in the textual representation the
> URL creator preferred by a random http processor (web site, intermediary,
> web client) without outside help.

A valid URI is all US-ASCII. There's nothing that needs to be decoded at 
all.

> And there *is* a preferred textual representation because you know, people
> do not enter URLs in binary editors.
>
>>>>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in
>>>>> XML,
>>>>> that's one part of the XML spec that worked very well) and require
>>>>> http/1
>>>>> to 2 bridges to translate to the canonical form. Helping clients push
>>>>> local 8bits encodings will just perpetuate pre-2000 legacy mess.
>>>>
>>>> How do you translate a URI with unknown URI encoding to UTF-8?
>>>
>>> You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an
>>> error. That will make people fix their encodings quickly.
>>
>> This is not going to work:
>>
>> a) People may have chosen a non-UTF8 encoding by accident (system locale
>> etc) and can't change it retroactively,
>
> They can add an UTF-8 translator gateway at http/2 adoption time. No
> different and much easier than the mass of documents that needed to be
> fixed once people started exchanging them in binary not dead wood form.

And that translator rewrites all URIs that appear in payloads?

> Some past mistakes need correction you can't grandfather them eternally at
> the cost of eternal future interop problems.

The only interop problem I'm aware of is when clients construct *new* 
URIs, such as is common in WebDAV. The way to fix this is to *advocate* 
UTF-8. But even if everybody agrees on UTF-8 there's still the NFC/NFD 
mismatch between OSX and the rest of the world.

>> b) There might be actual *binary* data in the URI.
>
> So just define the canonical binary-to-utf8 mapping. If you don't your URL
> will crash as soon as it needs to be displayed in an address bar, network
> console or activity log. Again, much easier to define a single
> binary-to-utf8 mapping than random encoding to random display encoding
> rules (hint: it is not possible and that's the core problem).

I still have no clue what problem you are trying to solve. Sorry.

 > ...
>> Hm, no. They just happen to work in a way different from your
>> preference, but they do just work fine.
>
> No, they don't work. Working is not "avoid any automated processing and
> for christsakes never use anything but ASCII because the state is
> undefined and things will randomly break"

The state is fully defined. It's just that you don't like that state.

> ...

It seems we aren't getting anywhere.

Can somebody else help me understanding what this is all about? :-)

Best regards, Julian
Received on Thursday, 16 January 2014 15:12:20 UTC

This archive was generated by hypermail 2.4.0 : Friday, 17 January 2020 17:14:23 UTC