- From: James M Snell <jasnell@gmail.com>
- Date: Sat, 9 Feb 2013 23:37:29 -0800
- To: Willy Tarreau <w@1wt.eu>
- Cc: Mark Nottingham <mnot@mnot.net>, Martin Dürst <duerst@it.aoyama.ac.jp>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
- Message-ID: <CABP7RbfgR4u+n9_K1DqYqf8HUPuXWGLyHOOAPGwWxKs7M_dmKw@mail.gmail.com>
Keep in mind that allowing headers to potentially contain utf-8 does not change the definitions of existing headers. Those that are currently defined with ASCII only values would likely remain ASCII only; we would need to either update the definitions of those existing headers (breaking backwards compatibility) or define new utf-8 compatible replacements. All we need to decide at this point, really, is a) are utf-8 header values important to us and b) does/will our basic header encoding allow for utf-8 if the answer to (a) is yes. On Feb 9, 2013 11:26 PM, "Willy Tarreau" <w@1wt.eu> wrote: > Hello Martin, > > On Sun, Feb 10, 2013 at 02:02:46PM +0900, "Martin J. Dürst" wrote: > > >The encoding can > > >become inefficient to transport for other charsets by inflating data by > up > > >to 50% > > > > Well, that's actually an urban myth. The 50% is for CJK > > (Chinese/Japanese/Korean). > > With the fast development of China, it is perfectly imaginable that > in 10 years, a significant portion of the web traffic is made with > Chineese URLs, so we must not ignore that. > > > For the languages/scripts of India, South > > East Asia, and a few more places, it can be 200%. (For texts purely in > > an alphabet in the Supplemental planes such as Old Italic, Shavian, > > Osmanya,..., it can be 300%, but I guess we can ignore these.) But these > > numbers only apply to cases that don't contain any ASCII at all. > > I don't see how this is possible since you have 6 bits of data per byte > plus a few bits on the first byte, and you need 3 bytes to transport 16 > bits, which is 50% for me :-) > > > >and may make compression less efficient. > > > > That depends very much on the method of compression that's used. > > I agree, but adding unused bits or entropy in general will make compression > algorithms less efficient. > > > >I'm not saying I'm totally against UTF-8 in HTTP/2 (eventhough I hate > using > > >it), I'm saying that it's not *THE* solution to every problem. It's just > > >*A* > > >solution to *A* problem : "how to extend character sets in existing > > >documents > > >without having to re-encode them all". I don't think this specific > problem > > >is > > >related to the scope of the HTTP/2 work, so at first glance, I'd say > that > > >UTF-8 doesn't seem to solve a known problem here. > > > > The fact that I mentioned Websockets may have lead to a > > misunderstanding. I'm not proposing to use UTF-8 only in bodies, just in > > headers (I wouldn't object, though). My understanding was that James was > > talking about headers, and I was doing so, too. > > I was talking about header values too. As a developer of intermediaries, > I'm not interested in the body at all. I'm seeing people do ugly things > all the time, like regex-matching hosts with ".*\.example\.com" without > being aware how slow it is to do that on each and every Host header field. > Typically doing that with an UTF-8 aware library is even slower. > > That's why I'm having some concerns. > > Ideally, everything we transport should be in its original form. If hosts > come from DNS, they should appear encoded as they were returned by the DNS > server (even with the ugly IDN format). If paths are supposed to be UTF-8, > let them be sent in their raw original UTF-8 form without changing the > format. But then we don't want to mix Host and path, and we want to put as > a first rule that only the shortest forms are allowed. If most header > fields > are pure ASCII (eg: encodings), declare them as such. If some header fields > are enums, use enums and not text. Etc... > > Regards, > Willy > >
Received on Sunday, 10 February 2013 07:37:57 UTC