Re: Pipelining and compression effect on HTTP/1.1 proxies from Henrik Frystyk Nielsen on 1997-04-23 (ietf-http-wg@w3.org from April to June 1997)

From: Henrik Frystyk Nielsen <frystyk@w3.org>
Date: Wed, 23 Apr 1997 11:52:48 -0400
To: Benjamin Franz <snowhare@netimages.com>, http-wg@cuckoo.hpl.hp.com
Message-Id: <3.0.1.32.19970423115248.00b29100@pop.w3.org>
At 03:44 PM 4/22/97 -0700, Benjamin Franz wrote:

>That is an *exceptionally large* HTML document - about 10 times the size
>of the average HTML document based on the results from our webcrawling
>robot here (N ~= 5,000 HTML documents found by webcrawling). Very few web
>designers would put that much on a single page because they are aiming for
>a target of 30-50K TOTAL for a page - including graphics.

It would be interesting to elaborate a bit on getting a better impression
on what the distribution of web pages is. A sample of 5000 is not big
enough to put *'s around your conclusions. I know that there are many cache
maintainers and maybe even indexers on this mailing list. Benjamin, what if
you tried getting these people to take a snapshot of their caches and get
the sizes of the HTML pages? It would be very useful information to a lot
of us!

>As noted: deflate and other compression schemes do much better on large
>text/* documents than small ones. Using an overly large document gives a
>misleading comparision against the short window compression that modems
>perform by basically allowing deflate a 'running start'. You should do the
>comparision using 3-4K HTML documents: The whole test document should be
>only 3-5K uncompressed and 1-2K compressed. 

I tried to do this with the page

  http://www.w3.org/pub/WWW/Protocols/HTTP/Performance/Compression/PPP.html

which is 4312 uncompressed and 1759 compressed. It still gives a 30%
increase in speed and a 35% gain in packets. Below that size, the number of
TCP packets begin to be the same and therefore little difference is to be
expected.

Note, this is using default compression _including_ the dictionary.
Intelligent tricks can be played by making a pre-defined HTML-aware
dictionary in which case the win will be bigger.

>> 3) On the more speculative side, I don't consider the current composition
>> of data formats in caches being constant. The paper describes the potential
>> benefits of using style sheets and other data formats than the more
>> traditional gif and jpeg. Style sheets are just starting to be deployed and
>> it may change the contents significantly over the next 6 months. CSS1 style
>> style sheets compress just as well as HTML, so there is yet another point
>> counting for compression.
>
>Again, the document used was around 10 times the size of the typical HTML
>document. This should be re-done with more typical test documents. In
>fact, it would probably be a good idea to test multiple sizes of documents
>as well as realistic mixes of text/* and image/* to understand how
>document size and mix affect the results of compression and pipelining.

My point here was that the size may not be that bad after all - considering
the effect of style sheets. As style sheets may be included in the HTML
document this may cause the overall size of HTML documents to increase.
Likewise, it will make a lot of graphics go away, as it gets replaced by
style sheets.

>> So, the _actual_data_ that we have now for the effect of compression seems
>> to indicate with little doubt that it is worth doing!
>
>No - it only indicates that is may be worth doing. Or may not. Your PPP
>and LAN tests were done using atypical input data - their results may be
>(probably are) atypical as well. 

If you look closely at the LAN data, the size is actually of little
importance.  The main point here is that adding information in the first
TCP packet increases the possibility of avoiding delayed TCP
acknowledgement. This alone accounts for up to 200ms delays and this may
happen even on a 2-3K document. On a 10MBit Ethernet, it takes about 50ms
to transfer 40K, so this is not nearly as significant a delay.

Henrik
--
Henrik Frystyk Nielsen, <frystyk@w3.org>
World Wide Web Consortium, MIT/LCS NE43-346
545 Technology Square, Cambridge MA 02139, USA
Received on Wednesday, 23 April 1997 08:56:22 UTC