- From: Ville Skyttä <ville.skytta@iki.fi>
- Date: Tue, 8 Apr 2008 01:33:36 +0300
- To: public-qa-dev@w3.org
- Message-Id: <200804080133.37074.ville.skytta@iki.fi>
On Friday 04 April 2008, olivier Thereaux wrote: > > * for much larger document (e.g the huge HTML5 spec - Content-Length: > 2032139) the bottlenecks are more evenly distributed. For the html5 > validated on my (old) computer: > > 46.42s HTML/Encoding.pm Huh, that much in HTML::Encoding! I took a look at what H::E does, and briefly looking at the code I got the impression that for large documents that do not have </head> (such as the HTML 5 spec) which are the pathological case, encoding_from_meta_element processes the whole document in 8kB chunks, appends those chunks into a big buffer one at a time it as it receives them, and decodes the whole big buffer, not just the last chunk, every time after grabbing the next chunk (8kB in first pass, 16kB in the next, etc...). And encoding_from_meta_element might be called several times from a loop in encoding_from_html_document (6 times if the list of encodings is not passed in)... I may not be aware of all the quirks that need to be taken into account and patches below can thus introduce some bugs, but here are some numbers taken on my box, running encoding_from_meta_element on the HTML 5 spec: Vanilla 0.56: about 20 seconds. 0.56 patched to decode only the last retrieved 8kB chunk, and not maintaining the big buffer at all: about 0.4 seconds. See attached decode-last-chunk-only.patch. 0.56 patched to use HTML::HeadParser: about 0.13 seconds. See attached use-headparser.patch. 0.56 patched with both decode-last-chunk-only.patch and use-headparser.patch: no measurable difference to the use-headparser.patch only case.
Attachments
- text/x-diff attachment: decode-last-chunk-only.patch
- text/x-diff attachment: use-headparser.patch
Received on Monday, 7 April 2008 22:34:38 UTC