- From: olivier Thereaux <ot@zoy.org>
- Date: Wed, 2 Jan 2008 11:55:40 +0900
- To: Bjoern Hoehrmann <derhoermi@gmx.net>
- Cc: spo-devel@lists.sf.net, Tools dev list <public-qa-dev@w3.org>
Hi Bjoern,
You wrote:
> Investigating them, I had a brief look at the current `check` code. As
> I understand it, the current SGML::Parser::OpenSP handler always has a
> start_element handler (and others aswell) declared. This makes the
> code
> extremely slow, if you make a handler
>
> sub start_element {
> require Data::Dumper;
> print Data::Dumper::Dumper(\@_);
> }
>
> it should immediately become clear why, it spends allmost all its time
> creating huge data structures and converting strings from UTF-32 as
> they are provided by OpenSP into UTF-8 encoded Perl strings. Overall
> it should be even slower than calling `onsgmls` and going with regular
> expressions over the output as `check` did before.
I see, thanks a lot for pointing it out.
I guess one could say that with all the preparsing and other features,
trhe markup validator has long compromised performance for user-
friendliness.
But more to the point - at the moment SPO's start_element handler is
used for 1) the outline feature and 2) to check xmlns presence and
value in a number of document types.
For the latter, I guess we could move that code over to a handler of
the XML parser: so far we're using XML::LibXML for XML-well-formedness
and I've looked into using the SAX version of that module instead, to
plug the Appendix C checker into. (without success yet though -
XML::LibXML::SAX::Parser remains elusively ill-documented...) But I
suppose that would be tantamount to moving the performance issue to
another module...
For the former, would you suggest to use different SPO handlers, one
without start_element() and one with, depending on the options and
needs?
> If performance is still some sort of concern, I would recommend to
> pass a handler that has no start_element callback defined unless you
> really have to.
Performance always an issue as we're having tons of traffic, but our
recent server upgrades and indeed the move to SPO (even with
start_element handler, even with three parsing rounds for some
documents - preparse, xml-wf and validation proper) have made the
situation very bearable for now...
Thanks,
--
olivier
Received on Wednesday, 2 January 2008 02:55:53 UTC