- From: Ville Skyttä <ville.skytta@iki.fi>
- Date: Wed, 9 Jun 2010 21:49:35 +0300
- To: "Dominique Hazael-Massieux" <dom@w3.org>
- Cc: public-qa-dev@w3.org
On Wednesday 09 June 2010, Dominique Hazael-Massieux wrote: > Hi Ville, Hi, > While investigating the source of the report of validator timeout (à la > [1]), I found your bug report to the XML-libxml maintainers: > https://rt.cpan.org/Public/Bug/Display.html?id=56671 > which I gather hasn't seen much progress in the past two months. You're unfortunately right; not even an ack from the maintainer has been received. > * is there an open bug matching this problem in our own bugzilla? I had > a quick look and didn't find one, but it might be hidden into another > bug report; if you think there is none, I'll create one I don't remember if there's a bug report about this in Bugzilla. But there is at least this: http://lists.w3.org/Archives/Public/www- validator/2010Mar/0019.html Caution: the following is based on what I remember of the issue, unfortunately it's been a while since I've had time to work on the validator or really even follow the list, so it might not be completely accurate. Whenever there's a parse error, XML::LibXML gives us a chain of errors. This chain is initially pointed at the last one in the chain, which often does not convey much at all about the actual problem. We need to iterate the chain using $error->_prev() to get to the start of the chain where usually the actual error causing the rest of the chained ones is at. Now, version 1.69 of XML::LibXML fails to provide the entire chain (I don't remember if it's always or only in some cases) and we get only the "tail" of it which leads to very confusing error messages like in the above mailing list message. Version 1.70 on the other hand does provide the chain, but there are some cases that trigger extreme slowness (I gather) at the time it internally constructs the chain. A lot of these errors in practical validator use are due to undefined entities, because we don't let XML::LibXML to fetch external entities. We don't let it do that because letting it do so would cause a lot of entity/DTD fetching, and a potential security issue. We could tell it to use XML catalogs [0] to get around the first problem; that works and works around the slowness issue in the most usual cases, but after that there's still the security issue to tackle: XML::LibXML does not have an easy to use option that we could use to "jail" it into a specific dir or set of dirs which means it could be tricked to load things it shouldn't as external entities [1] [2]. I've tried to use the things XML::LibXML provides for this purpose, and tried various ways to avoid entity expansion and/or to get errors resulting from it to be ignored, but my experiments have failed. Unfortunately the exact details about the reasons for the failures escape me at the moment and I don't seem to have any code from these experiments hanging around that I could check :( > * is this a bug new in 1.70? I hope the above clarifies this part. > if so, do you know if reverting to 1.69 is > an option (i.e. do we rely on APIs that are specific to 1.70?) If I remember correctly, version 0.8.6 would work with 1.69 (with the above caveats about 1.69). Changes made to the CVS version after the 0.8.6 release (validator-0_8_6-release tag in CVS) however require version 1.70. I don't have a good solution to this problem to offer right now. Reverting to 1.69 should fix the slowness, but then again it would cause other problems as outlined above; depending on how severe they are an alternative to consider could be to just disable the XML well-formedness checks altogether for now. There were also XML::LibXML "developer" versions 1.69_1 and 1.69_2 between 1.69 and 1.70, but I don't remember trying those out myself so I can't say much at all about them. I'll eventually get to revisiting the entity expansion/jailing things with XML::LibXML and perhaps even trying out another XML parser, but unfortunately I cannot at the moment promise when that would be. [0] I did a quick hack that generates one from the SGML open catalog we currently have, see misc/soc2xml.pl in CVS. [1] http://searchsecuritychannel.techtarget.com/generic/0,295582,sid97_gci1304703,00.htm [2] http://www.securiteam.com/securitynews/6D0100A5PU.html
Received on Wednesday, 9 June 2010 18:50:15 UTC