- From: Nick Kew <nick@webthing.com>
- Date: Mon, 24 Oct 2005 10:53:57 +0100
- To: QA-dev Dev <public-qa-dev@w3.org>
On Monday 24 October 2005 06:35, you wrote: > in http://lists.w3.org/Archives/Public/www-archive/2005Sep/0001 Hmmm, I don't recollect that. > I > demo'd a (rough) perl SAX filter to produce the outline of a > document. Prior discussions and reading made me think that this could > be a good way to create the outline with the 0.8+ version of the > markup validator, running S::P::O. Hmmm. OpenSP is a SAX parser; libxml2 provides a SAX filter used in many of my tools (including AccessValet). Both work fairly well to generate document outlines. Or am I missing something? > However, unless I missed some option or misunderstood the way to use > the SAX filter, this method seems to choke very easily on tag soup, > which seems to be rather problematic, since the input of the > validator is rather seldom even well formed. If you tried it with pure-XML SAX then of course it'll fall over on most of the web. I find libxml2's HTMLparser the easiest to use for HTML. Except in the context of _validating_ SGML/HTML, where of course OpenSP is the only show in town. > It even apparently > chokes (way too) easily on comments, although this may well be a > mistake in how I coded the filter. The content also needs to be > transcoded to utf-8 before sending it through the SAX pipe. Not if the SAX filter accepts other encodings and will transcode them internally. Both OpenSP and libxml2 can do that. > Could it be that we will have to give up on the idea of using sax > filters, since our input is so loose? (I suppose we could use > HTML::Parser or subclasses thereof, instead) Or are you aware of > ideas or ways to reconcile our input and the strictness of SAX > processing? Perhaps if I had a clearer idea what you're aiming for? -- Nick Kew
Received on Monday, 24 October 2005 09:53:03 UTC