W3C home > Mailing lists > Public > public-qa-dev@w3.org > October 2005

Re: using sax filters within the markup validator

From: Nick Kew <nick@webthing.com>
Date: Mon, 24 Oct 2005 10:53:57 +0100
To: QA-dev Dev <public-qa-dev@w3.org>
Message-Id: <200510241054.00346.nick@webthing.com>

On Monday 24 October 2005 06:35, you wrote:
> in http://lists.w3.org/Archives/Public/www-archive/2005Sep/0001

Hmmm, I don't recollect that.

> I 
> demo'd a (rough) perl SAX filter to produce the outline of a
> document. Prior discussions and reading made me think that this could
> be a good way to create the outline with the 0.8+ version of the
> markup validator, running S::P::O.

Hmmm.  OpenSP is a SAX parser; libxml2 provides a SAX filter used in
many of my tools (including AccessValet).  Both work fairly well to
generate document outlines.  Or am I missing something?

> However, unless I missed some option or misunderstood the way to use
> the SAX filter, this method seems to choke very easily on tag soup,
> which seems to be rather problematic, since the input of the
> validator is rather seldom even well formed.

If you tried it with pure-XML SAX then of course it'll fall over on most of
the web.  I find libxml2's HTMLparser the easiest to use for HTML.  Except
in the context of _validating_ SGML/HTML, where of course OpenSP is the
only show in town.

> It even apparently 
> chokes (way too) easily on comments, although this may well be a
> mistake in how I coded the filter. The content also needs to be
> transcoded to utf-8 before sending it through the SAX pipe.

Not if the SAX filter accepts other encodings and will transcode them
internally.  Both OpenSP and libxml2 can do that.

> Could it be that we will have to give up on the idea of using sax
> filters, since our input is so loose? (I suppose we could use
> HTML::Parser or subclasses thereof, instead) Or are you aware of
> ideas or ways to reconcile our input and the strictness of SAX
> processing?

Perhaps if I had a clearer idea what you're aiming for?

-- 
Nick Kew
Received on Monday, 24 October 2005 09:53:03 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:45 GMT