Re: Delta Compression and UTF-8 Header Values

On Sat, Feb 09, 2013 at 02:04:30PM +0000, Poul-Henning Kamp wrote:
> Content-Type: text/plain; charset=ISO-8859-1
> --------
> In message <20130209133341.GA8712@1wt.eu>, Willy Tarreau writes:
> 
> >I'm not saying I'm totally against UTF-8 in HTTP/2 [...]
> 
> What and where do you mean when you say "UTF-8" In HTTP/2 ?
> 
> I think we need to be more precise, to avoid misunderstandings.
> 
> In HTTP/1, there is a peculiar mix between protocol-mechanics, and
> metadata:  If I add a custom bit of metadata, it must follow certain
> rules, since otherwise it will break the protocol mechanics.
> 
> For instance, I cannot define a custom header called:
> 
> 	 "FOO" CRNL CRNL ": " [8 zero bytes]
> 
> If we define HTTP/2 as a "binary" protocol in some sensible way,
> this restriction could go away, and we'd just move something like:
> 
> 	<HDR nlen=7,blen=8> "FOO" CRNL CRNL \0\0\0\0\0\0\0\0
> 
> down the wire, and not care about what it is, what it means or
> what character set, if any, it is encoded in.
> 
> It is only the metadata that needs inspection along the way where
> we need to decide about UTF-8, and it really isn't that much.

Prefixing values with their lengths generally is the most efficient
way to work (CPU-wise).

> Host:
> 	Why would we care about the character set ?  We're
> 	just going to pass it to DNS anyway.
> 
> URI:
> 	At least the query strings, possibly all of it ?
> 	But do we really care ?  Provided we take the Host
> 	part out, as proposed, we treat this as a unit.

I'd be cautious about mixing URI and query strings, I see too often
people rewrite some requests to move the question mark away and
replace it with a slash. Then they don't realize they're possibly
mixing two distinct encodings, still they do!

> Cache-Control:
> 	And what good would UTF-8 do here in the first place ?

No need, we need to use tokens here and tokens can be an enum.

> So where is it you want UTF-8, and what difference will it make ?

Hey Poul-Henning, please do not put words in my mouth, I'm not
saying I want UTF-8, OK ? As I said, I don't like this encoding at
all. What I'm saying is that if we have to transport such encoded
data, I prefer that we pass it as-is in its original form than having
to decode/encode it. For example, if it becomes a norm that URI,
Location or Referer is UTF8-encoded, let's pass them untransformed.

But in general, I think that 20 years of web have shown that the
protocol does not need this at all to succeed.

Regards,
Willy

Received on Saturday, 9 February 2013 15:05:48 UTC