Re: UTF-8 or ASCII Header Names? from Roberto Peon on 2013-08-16 (ietf-http-wg@w3.org from July to September 2013)

From: Roberto Peon <grmocg@gmail.com>
Date: Fri, 16 Aug 2013 09:57:38 -0700
To: James M Snell <jasnell@gmail.com>
Cc: Martin Thomson <martin.thomson@gmail.com>, Martin J. Dürst <duerst@it.aoyama.ac.jp>, Fred Akalin <akalin@google.com>, "ietf-http-wg@w3.org" <ietf-http-wg@w3.org>
Message-ID: <CAP+FsNcZg==LDg19m0BhfdOpK_6tN1fXfjRqHFztHbW1=TbVZg@mail.gmail.com>

In addition to compressing the bytestrings, the compressor will have to
validate utf-8. Nearly the same complexity as normalization (which was
proposed earlier) to me-- I now get to scan things yet another time,
increasing CPU utilization.. for what? Basically nothing in return if the
upper-level doesn't care about it.

If the upper-level cares about it, then it should be a prereq of feeding
something into the compressor. If not, then it shouldn't be. Either way,
these concerns belong outside the compressor.

-=R


On Fri, Aug 16, 2013 at 9:49 AM, James M Snell <jasnell@gmail.com> wrote:

> On Fri, Aug 16, 2013 at 9:29 AM, Roberto Peon <grmocg@gmail.com> wrote:
> > I view it as liberating-- as the compressor is now freed from worrying
> about
> > normalization, etc. which, if done, should be done at a higher layer.
> >
>
> FWIW, I don't believe anyone had said anything about normalization...
> valid UTF-8 octets, yes, but not normalization. The compression
> mechanism is really not affected by whether or not we say UTF-8
> here...
>
> - James
>
> > There is currently exactly one field that the compressor makes
> assumptions
> > about and we could change that by requiring that the HTTP-layer do the
> > transformation of cookie into cookie-crumbs instead of having the
> compressor
> > do it. The compressor knows zero about anything else, semantically right
> > now.
> >
> > The huffman encoder that we had and will likely add back worked on
> bytes. It
> > mostly encountered ASCII, and thus the frequency table was skewed to
> > compress ASCII better than other things, but it could still handle UTF-8,
> > raw binary, whatever.
> >
> > I could certainly see an eventual future where some values are just raw
> > binary.
> > Sure, the huffman-based encoder would not compress that very well, but
> that
> > is OK-- the binary rep should already be fairly small in comparison to
> the
> > B64 encoding we do today (I'd rather have the data remain the same size
> than
> > getting a 30% decrease after a 4X expansion, which is what would happen
> > today...), and an escape valve of not having to use the huffman encoding
> has
> > always been the plan.
> >
> > We could still allow for compressors to do things with semantic
> knowledge,
> > but there is no need to *require* it by declaring the type of all values
> a
> > prior.
> > Simply require that any transformation the compressor does must not
> change
> > the semantic meaning of the value. Problem solved, I think.
> >
> > -=R
> >
> > -=R
> >
> >
> > On Fri, Aug 16, 2013 at 9:19 AM, Martin Thomson <
> martin.thomson@gmail.com>
> > wrote:
> >>
> >> On 16 August 2013 08:44, Roberto Peon <grmocg@gmail.com> wrote:
> >> > The keys should be ASCII, and the values bytes.
> >>
> >> That's a fairly narrow view.  If the values were (for example) ASCII,
> >> then you'd have an opportunity to compress better.  At worst, you can
> >> wipe the high order bit from every octet.
> >>
> >> At some level you are going to need to either make assumptions about
> >> the properties of values, or rely on specific knowledge about them if
> >> you are going to compress effectively.  Even if it were the case that
> >> the bytes were UTF-8, you could still make some gains over pure bytes
> >> (even just by exploiting the fact that certain byte sequences are not
> >> possible in UTF-8).
> >
> >
>

Received on Friday, 16 August 2013 16:58:06 UTC