Re: Validator timeout and XML-LibXML bug

On Wednesday 09 June 2010, Dominique Hazael-Massieux wrote:
> Hi Ville,

Hi,

> While investigating the source of the report of validator timeout (à la
> [1]), I found your bug report to the XML-libxml maintainers:
> https://rt.cpan.org/Public/Bug/Display.html?id=56671
> which I gather hasn't seen much progress in the past two months.

You're unfortunately right; not even an ack from the maintainer has been 
received.

> * is there an open bug matching this problem in our own bugzilla? I had
> a quick look and didn't find one, but it might be hidden into another
> bug report; if you think there is none, I'll create one

I don't remember if there's a bug report about this in Bugzilla.  But there is 
at least this: http://lists.w3.org/Archives/Public/www-
validator/2010Mar/0019.html

Caution: the following is based on what I remember of the issue, unfortunately 
it's been a while since I've had time to work on the validator or really even 
follow the list, so it might not be completely accurate.

Whenever there's a parse error, XML::LibXML gives us a chain of errors.  This 
chain is initially pointed at the last one in the chain, which often does not 
convey much at all about the actual problem.  We need to iterate the chain 
using $error->_prev() to get to the start of the chain where usually the 
actual error causing the rest of the chained ones is at.

Now, version 1.69 of XML::LibXML fails to provide the entire chain (I don't 
remember if it's always or only in some cases) and we get only the "tail" of 
it which leads to very confusing error messages like in the above mailing list 
message.

Version 1.70 on the other hand does provide the chain, but there are some 
cases that trigger extreme slowness (I gather) at the time it internally 
constructs the chain.

A lot of these errors in practical validator use are due to undefined 
entities, because we don't let XML::LibXML to fetch external entities.  We 
don't let it do that because letting it do so would cause a lot of entity/DTD 
fetching, and a potential security issue.  We could tell it to use XML 
catalogs [0] to get around the first problem; that works and works around the 
slowness issue in the most usual cases, but after that there's still the 
security issue to tackle: XML::LibXML does not have an easy to use option that 
we could use to "jail" it into a specific dir or set of dirs which means it 
could be tricked to load things it shouldn't as external entities [1] [2].

I've tried to use the things XML::LibXML provides for this purpose, and tried 
various ways to avoid entity expansion and/or to get errors resulting from it 
to be ignored, but my experiments have failed.  Unfortunately the exact 
details about the reasons for the failures escape me at the moment and I don't 
seem to have any code from these experiments hanging around that I could check 
:(

> * is this a bug new in 1.70?

I hope the above clarifies this part.

> if so, do you know if reverting to 1.69 is
> an option (i.e. do we rely on APIs that are specific to 1.70?)

If I remember correctly, version 0.8.6 would work with 1.69 (with the above 
caveats about 1.69).  Changes made to the CVS version after the 0.8.6 release 
(validator-0_8_6-release tag in CVS) however require version 1.70.

I don't have a good solution to this problem to offer right now.  Reverting to 
1.69 should fix the slowness, but then again it would cause other problems as 
outlined above; depending on how severe they are an alternative to consider 
could be to just disable the XML well-formedness checks altogether for now.  
There were also XML::LibXML "developer" versions 1.69_1 and 1.69_2 between 
1.69 and 1.70, but I don't remember trying those out myself so I can't say 
much at all about them.

I'll eventually get to revisiting the entity expansion/jailing things with 
XML::LibXML and perhaps even trying out another XML parser, but unfortunately 
I cannot at the moment promise when that would be.

[0] I did a quick hack that generates one from the SGML open catalog we 
currently have, see misc/soc2xml.pl in CVS.
[1] 
http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.htm
[2] http://www.securiteam.com/securitynews/6D0100A5PU.html

Received on Wednesday, 9 June 2010 18:50:15 UTC