Re: profiling the Markup Validator with Devel::NYTProf

On Friday 04 April 2008, olivier Thereaux wrote:
>
> * for much larger document (e.g the huge HTML5 spec - Content-Length:
> 2032139) the bottlenecks are more evenly distributed. For the html5
> validated on my (old) computer:
>
> 46.42s HTML/Encoding.pm

Huh, that much in HTML::Encoding!

I took a look at what H::E does, and briefly looking at the code I got the 
impression that for large documents that do not have </head> (such as the 
HTML 5 spec) which are the pathological case, encoding_from_meta_element 
processes the whole document in 8kB chunks, appends those chunks into a big 
buffer one at a time it as it receives them, and decodes the whole big 
buffer, not just the last chunk, every time after grabbing the next chunk 
(8kB in first pass, 16kB in the next, etc...).  And 
encoding_from_meta_element might be called several times from a loop in 
encoding_from_html_document (6 times if the list of encodings is not passed 
in)...

I may not be aware of all the quirks that need to be taken into account and 
patches below can thus introduce some bugs, but here are some numbers taken 
on my box, running encoding_from_meta_element on the HTML 5 spec:

Vanilla 0.56: about 20 seconds.

0.56 patched to decode only the last retrieved 8kB chunk, and not maintaining 
the big buffer at all: about 0.4 seconds.  See attached 
decode-last-chunk-only.patch.

0.56 patched to use HTML::HeadParser: about 0.13 seconds.  See attached 
use-headparser.patch.

0.56 patched with both decode-last-chunk-only.patch and use-headparser.patch: 
no measurable difference to the use-headparser.patch only case.

Received on Monday, 7 April 2008 22:34:38 UTC