- From: Julian Reschke <julian.reschke@gmx.de>
- Date: Thu, 16 Jan 2014 11:28:49 +0100
- To: Nicolas Mailhot <nicolas.mailhot@laposte.net>
- CC: Zhong Yu <zhong.j.yu@gmail.com>, Gabriel Montenegro <gabriel.montenegro@microsoft.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>, Osama Mazahir <osamam@microsoft.com>, Dave Thaler <dthaler@microsoft.com>, Mike Bishop <michael.bishop@microsoft.com>, Matthew Cox <macox@microsoft.com>
On 2014-01-16 11:24, Nicolas Mailhot wrote: > Le Jeu 16 janvier 2014 11:06, Julian Reschke a écrit : >> On 2014-01-16 10:52, Nicolas Mailhot wrote: >>> >>> Le Mer 15 janvier 2014 21:46, Zhong Yu a écrit : >>>> Can you give an example where an intermediary benefits from decoding >>>> URI octets into unicodes? >>> >>> Intermediaries can not perform URL-based filtering it they can not >>> decode >>> URLS reliably. Intermediaries need to normalise URLs to a single >>> encoding >>> if they log them (for debugging or policy purposes). unix-like "just a >>> bunch of bytes with no encoding indication" is an i18n disaster >>> supported >>> only by users of ASCII scripts >> >> Well, you could log what you got on the wire. It's ASCII. > > And it's useless if you can't interpret it reliably. May as well log the > output of /dev/random at the time. Don't have time to get humans comb > millions of log lines to fix encoding errors. Define "encoding error" in the context of a URI. >>> I favour making URLs UTF-8 by default in HTTP/2 (just as it was in XML, >>> that's one part of the XML spec that worked very well) and require >>> http/1 >>> to 2 bridges to translate to the canonical form. Helping clients push >>> local 8bits encodings will just perpetuate pre-2000 legacy mess. >> >> How do you translate a URI with unknown URI encoding to UTF-8? > > You treat it as UTF-8. If it fails UTF-8 sanity rules you fail with an > error. That will make people fix their encodings quickly. This is not going to work: a) People may have chosen a non-UTF8 encoding by accident (system locale etc) and can't change it retroactively, b) There might be actual *binary* data in the URI. >>> Whenever someone specifies a new better encoding it will be time for >>> HTTP/3. Unicode specs are way more complex than http, changes won't >>> happen >>> quicker than http revisions. >> >> The problem here is that HTTP URIs are octet sequences, not character >> sequences. > > The problem is that octet sequences are useless by themselves if you can > not decode them. Hm, no. They just happen to work in a way different from your preference, but they do just work fine. Best regards, Julian (Yes, I'd prefer that there'd be more UTF-8 in HTTP, but there are problems that are hard to solve without breaking things)
Received on Thursday, 16 January 2014 10:29:23 UTC