W3C home > Mailing lists > Public > public-qa-dev@w3.org > June 2010

Re: Validator timeout and XML-LibXML bug

From: Dominique Hazael-Massieux <dom@w3.org>
Date: Thu, 10 Jun 2010 11:52:53 +0200
To: Ville Skyttä <ville.skytta@iki.fi>
Cc: public-qa-dev@w3.org
Message-ID: <1276163573.2081.24.camel@localhost>
Le mercredi 09 juin 2010 à 21:49 +0300, Ville Skyttä a écrit :
> > * is there an open bug matching this problem in our own bugzilla? I had
> > a quick look and didn't find one, but it might be hidden into another
> > bug report; if you think there is none, I'll create one
> 
> I don't remember if there's a bug report about this in Bugzilla.  But there is 
> at least this: http://lists.w3.org/Archives/Public/www-
> validator/2010Mar/0019.html

Thanks; I've created a bug in bugzilla to document the situation:
http://www.w3.org/Bugs/Public/show_bug.cgi?id=9899

At the very least, it can be used as pointer for people asking what's
going on in www-validator.

> Whenever there's a parse error, XML::LibXML gives us a chain of errors.  This 
> chain is initially pointed at the last one in the chain, which often does not 
> convey much at all about the actual problem.  We need to iterate the chain 
> using $error->_prev() to get to the start of the chain where usually the 
> actual error causing the rest of the chained ones is at.
> 
> Now, version 1.69 of XML::LibXML fails to provide the entire chain (I don't 
> remember if it's always or only in some cases) and we get only the "tail" of 
> it which leads to very confusing error messages like in the above mailing list 
> message.
> 
> Version 1.70 on the other hand does provide the chain, but there are some 
> cases that trigger extreme slowness (I gather) at the time it internally 
> constructs the chain.

I hadn't managed to analyse it in these terms, but that seems indeed to
match what I see when using the perl debugger on the said pages.

> A lot of these errors in practical validator use are due to undefined 
> entities, because we don't let XML::LibXML to fetch external entities.  We 
> don't let it do that because letting it do so would cause a lot of entity/DTD 
> fetching, and a potential security issue.  We could tell it to use XML 
> catalogs [0] to get around the first problem; that works and works around the 
> slowness issue in the most usual cases, but after that there's still the 
> security issue to tackle: XML::LibXML does not have an easy to use option that 
> we could use to "jail" it into a specific dir or set of dirs which means it 
> could be tricked to load things it shouldn't as external entities [1] [2]..

Woulnd't the XML Parser option of "ext_ent_handler" be a way to do that
jailing? 
http://search.cpan.org/dist/XML-LibXML/lib/XML/LibXML/Parser.pod#ext_ent_handler
The code example there seems to suggest just that.

Thanks a lot for all your insights on this problem, and for taking the
time to document it so well here!

Dom
Received on Thursday, 10 June 2010 09:53:07 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:51 GMT