Re: SPO, start_element and performance (Was: release SPO 1.0 from CVS, push to cpan?)

Hi Bjoern,

You wrote:
> Investigating them, I had a brief look at the current `check` code. As
> I understand it, the current SGML::Parser::OpenSP handler always has a
> start_element handler (and others aswell) declared. This makes the  
> code
> extremely slow, if you make a handler
>
>  sub start_element {
>    require Data::Dumper;
>    print Data::Dumper::Dumper(\@_);
>  }
>
> it should immediately become clear why, it spends allmost all its time
> creating huge data structures and converting strings from UTF-32 as
> they are provided by OpenSP into UTF-8 encoded Perl strings. Overall
> it should be even slower than calling `onsgmls` and going with regular
> expressions over the output as `check` did before.

I see, thanks a lot for pointing it out.

I guess one could say that with all the preparsing and other features,  
trhe markup validator has long compromised performance for user- 
friendliness.

But more to the point - at the moment SPO's start_element handler is  
used for 1) the outline feature and 2) to check xmlns presence and  
value in a number of document types.

For the latter, I guess we could move that code over to a handler of  
the XML parser: so far we're using XML::LibXML for XML-well-formedness  
and I've looked into using the SAX version of that module instead, to  
plug the Appendix C checker into. (without success yet though -  
XML::LibXML::SAX::Parser remains elusively ill-documented...) But I  
suppose that would be tantamount to moving the performance issue to  
another module...

For the former, would you suggest to use different SPO handlers, one  
without start_element() and one with, depending on the options and  
needs?

> If performance is still some sort of concern, I would recommend to
> pass a handler that has no start_element callback defined unless you
> really have to.

Performance always an issue as we're having tons of traffic, but our  
recent server upgrades and indeed the move to SPO (even with  
start_element handler, even with three parsing rounds for some  
documents - preparse, xml-wf and validation proper) have made the  
situation very bearable for now...

Thanks,
-- 
olivier

Received on Wednesday, 2 January 2008 02:55:53 UTC