Re: Delta Compression and UTF-8 Header Values from Willy Tarreau on 2013-02-09 (ietf-http-wg@w3.org from January to March 2013)

From: Willy Tarreau <w@1wt.eu>
Date: Sat, 9 Feb 2013 15:58:34 +0100
To: Martin Nilsson <nilsson@opera.com>
Cc: ietf-http-wg@w3.org
Message-ID: <20130209145834.GB8712@1wt.eu>

On Sat, Feb 09, 2013 at 03:12:32PM +0100, Martin Nilsson wrote:
> On Sat, 09 Feb 2013 14:33:41 +0100, Willy Tarreau <w@1wt.eu> wrote:
> 
> >Also, processing it is
> >particularly inefficient as you have to parse each and every byte to find
> >a length, making string comparisons quite slow.
> 
> You don't need to know the length in characters to compare strings. Just  
> comparing byte on byte works fine.

This is exactly what you want to avoid when comparing with lots of strings.
It's generally more efficient to first compare lengths, then byte per byte
only if lengths match. This is equally true when checking for some regex
patterns such as "/cache/dir/../..../" where "." denotes a character. And
last but not least, the Boyer-Moore search is much less efficient with
UTF-8 encoding than what it is with non-encoded data.

I'm really all for just transporting raw data as much as possible, that
only the two ends need to understand and agree upon when it comes to the
encoding. However, if some data come from commonly UTF-8 encoded sources,
then I'd rather keep them as-is than having to re-encode them.

Willy

Received on Saturday, 9 February 2013 14:59:18 UTC