profiling the Markup Validator with Devel::NYTProf

Hello,

I spent a bit of time today playing with the Devel::NYTProf profile on  
the markup validator code, trying to find bottlenecks in the code.

The Devel::NYTProf profiler really is a nice piece of software.  
Simple, efficient, and clear in its results, I am rather smitten.

http://open.blogs.nytimes.com/2008/03/05/the-new-york-times-perl-profiler/


Running it on the Markup Validator

% perl -T -d:NYTProf check uri=http://www.w3.org/TR/html5
% nytprofhtml
% open profiler/index.html
% open profiler/check.html

... showed that the most time-consuming parts were...

* for small documents (e.g http://qa-dev.w3.org/ - Content-Length:  
3345) the bottleneck seems to be HTML::Template. We do cache the  
templates but this is still the slowest (albeit reasonably) part of  
the process.

* for much larger document (e.g the huge HTML5 spec - Content-Length:  
2032139) the bottlenecks are more evenly distributed. For the html5  
validated on my (old) computer:

46.42s HTML/Encoding.pm
05.19s HTTP/Message.pm
06.09s LWP/Protocol/http.pm
16.27s Encode.pm
36.22s Encode/Encoding.pm
42.15s check

154.9s total execution time (!)

Interesting to see that the very time-consuming processes are not so  
much validation but encoding detection and decoding... That aside,  
looking at check I  found something very surprising. There is one line  
responsible for 25 seconds of processing, and that is (current) line  
2549:
if ($self->{am_in_heading}==1){
... in sub W3C::Validator::SAXHandler::data()
Of course the line itself is not time-consuming, but its being called  
1.2 million times (once per character) is really heavy.

I'm wondering if it would be possible to make that one line faster.

But if not, I think we need to reconsider the benefit of the "show  
outline" feature. That feature is the only reason why we have sub  
W3C::Validator::SAXHandler::data() at this point.
Pros:
* when the feature was broken, some people complained.
* it is used for ~ 2% of the validation
Cons:
* 2% usage is not much
* the future is not essential to validation
* the metadata extractor is much more useful and powerful
   (although not necessarily more efficient, being xslt based)

Any thought? Devel::NYTProf is installed on qa-dev, BTW, have fun with  
it. I'll look at checklink too early next week.

-- 
olivier

Received on Friday, 4 April 2008 20:47:35 UTC