HTML case canonicalization and compression: small is beautiful

I have some new performance results that is of interest to implementors of
HTML authoring tools...

As part of our paper "Network Performance Effects of HTTP/1.1, CSS1, and
PNG" we looked into compression of HTML to see how much we can save in
bytes and hence time to transfer the data. The paper is available from

http://www.w3.org/pub/WWW/Protocols/HTTP/Performance/Pipeline.html

We have made some simple tests on how zlib compression is affected by case
canonicalizing HTML tags. The figures are available at

http://www.w3.org/pub/WWW/Protocols/HTTP/Performance/Compression/HTMLCanon.h
tml

From this very small test, _lowercase_ canonicalization of HTML tags gives
the best performance. This is not surprising as most of the actual text in
the document is lowercase and hence the probability that lowercase HTML tag
names can be reused in the dictionary is bigger than if using uppercase
tags. Uppercase is, however, the dominant way most editors work today.

Optimizing the compression algorithm for size does not have a significant
impact compared to the default compression.

These data should be taken with a grain of salt as the data set is very
small (exactly on file which is the top page of our "microscape" test site)

Other things that are interesting to investigate are experimenting with
different dictionaries and other types of canonicalizations.

Thanks

Henrik
--
Henrik Frystyk Nielsen, <frystyk@w3.org>
World Wide Web Consortium, MIT/LCS NE43-346
545 Technology Square, Cambridge MA 02139, USA

Received on Wednesday, 26 February 1997 13:48:03 UTC