using sax filters within the markup validator

in http://lists.w3.org/Archives/Public/www-archive/2005Sep/0001 I  
demo'd a (rough) perl SAX filter to produce the outline of a  
document. Prior discussions and reading made me think that this could  
be a good way to create the outline with the 0.8+ version of the  
markup validator, running S::P::O.

However, unless I missed some option or misunderstood the way to use  
the SAX filter, this method seems to choke very easily on tag soup,  
which seems to be rather problematic, since the input of the  
validator is rather seldom even well formed. It even apparently  
chokes (way too) easily on comments, although this may well be a  
mistake in how I coded the filter. The content also needs to be  
transcoded to utf-8 before sending it through the SAX pipe.

Could it be that we will have to give up on the idea of using sax  
filters, since our input is so loose? (I suppose we could use  
HTML::Parser or subclasses thereof, instead) Or are you aware of  
ideas or ways to reconcile our input and the strictness of SAX  
processing?

-- 
olivier 

Received on Monday, 24 October 2005 05:35:45 UTC