timestamps encoding (was: Re: Straw Poll: Restore Header Table and Static Table Indices)

On Tue, Oct 21, 2014 at 12:20:16PM +0200, Willy Tarreau wrote:
> On Tue, Oct 21, 2014 at 12:02:39PM +0200, Julian Reschke wrote:
> > On 2014-10-21 11:44, Willy Tarreau wrote:
> > >...
> > >I know, and you remember, that was one of the basic points of our
> > >proposal 2 years ago. I think we'll hardly propose this here since
> > >it changes the ability to pass certain invalid values. However, I
> > > ...
> > 
> > Right. But maybe we could try something the represents valid values 
> > well, but still allows transporting invalid values?
> 
> I hadn't thought about that option honnestly and I find that it could be
> very efficient and even save a lot of CPU by avoiding to process anything
> complicated or out of range. I mean, if the date respects a very precise
> format, we encode it and transport it in binary. Otherwise we leave it
> in plain text mode. We can have an encoding starting with something
> impossible in text mode (eg: LF byte) to mark that what follows is a
> 32-bit binary encoding of a timestamp. A variable-length encoding would
> allow to pass end of the current epoch and to use less bytes currently.
> This method would even be compatible with existing encoding and parsers.

I just experimented with this : variable length encoding of the date using
255/256 possible values for each byte, so that it remains compatible with
string processing in programs (no zero byte is produced).

The example code below converts valid strings to their compacted form
and leaves the non-convertable ones intact. The converted string's format
indicates what format it's in (ie: if it begins with 0xA its a compact one).
The gains are interesting, just tested on example.com, there are 3 dates
there, 24 bytes each, and reduced to 5 bytes each, that's 57 bytes saved
in one response header (19 bytes saved per valid header).

The encoding function works this way : it tries to parse a valid HTTP date
and returns either the original string or the compacted one :

	const char *http_date_encode(const char *date)
	{
		struct tm tm;
		static char buf[10];

		if (strptime(date, "%a, %d %b %Y %H:%M:%S %Z", &tm) == NULL)
			return date;

		if (!nz_encode(buf + 1, sizeof(buf) - 1, mktime(&tm)))
			return date;

		buf[0] = '\n';
		return buf;
	}

The decoder does the opposite, it checks whether the date string starts
with an LF character and returns either its expansion or the string itself :

	const char *http_date_decode(const char *date)
	{
		struct tm *tm;
		static char buf[200];
		time_t t;

		if (*date != '\n')
			return date;

		t = nz_decode(date + 1);
		tm = gmtime(&t);
		strftime(buf, sizeof(buf), "%a, %d %b %Y %H:%M:%S %Z", tm);
		return buf;
	}

The non-zero encoding/decoding functions are simple base255 encoders which
ensure that we won't add an extra byte for a long time, but if someone doesn't
feel at ease with these, it's possible to use another format (and waste more
space) :

	/* decode a non-zero encoded string into a long int */
	unsigned long nz_decode(const char *buf)
	{
		unsigned long v;
		unsigned long mul = 1;

		for (v = 0; *buf; buf++) {
			v += mul * (unsigned char)(*buf - 1);
			mul *= 255;
		}
		return v;
	}

	/* encode value <v> into a non-zero string */
	const char *nz_encode(char *buf, int size, unsigned long v)
	{
		char *out;

		for (out = buf; size >= 1;) {
			if (!v) {
				*out = 0;
				return buf;
			}
			*(out++) = 1 + (v % 255);
			v /= 255;
		}
		return NULL;
	}

Then the calling code which converts the date both ways and compares
outputs.

	int main(int argc, char **argv)
	{
		const char *in;
		const char *enc;
		const char *dec;

		if (argc < 2)
			return 1;

		in = argv[1];
		enc = http_date_encode(argv[1]);
		dec = http_date_decode(enc);

		printf("in =<%s> (%d chars)\n", in, strlen(in));
		printf("enc=<%s> (%d chars) savings=%d\n", enc, strlen(enc), strlen(in)-strlen(enc));
		printf("dec=<%s> (%d chars) strcmp=%d\n", dec, strlen(dec), strcmp(in, dec));
		return 0;
	}

$ curl -I example.com
HTTP/1.1 200 OK
Accept-Ranges: bytes
Cache-Control: max-age=604800
Content-Type: text/html
Date: Tue, 21 Oct 2014 14:20:53 GMT
Etag: "359670651"
Expires: Tue, 28 Oct 2014 14:20:53 GMT
Last-Modified: Fri, 09 Aug 2013 23:54:35 GMT
Server: ECS (iad/182A)
X-Cache: HIT
x-ec-custom-error: 1
Content-Length: 1270

These 3 dates (Date, Expires, Last-Modified) can then be shrunk by 57 bytes
total without the risk of altering an invalid content. What's nice with this
method is that only the producer needs to know what headers it wants to
compact, the recipient doesn't need to know them since it can use the LF
byte to know if a value needs to be expanded or not. An if-modified-since
in a request could be reduced by 19 bytes, which is nice as well.

Thoughts ?

Willy

Received on Tuesday, 21 October 2014 14:26:27 UTC