W3C home > Mailing lists > Public > public-qa-dev@w3.org > October 2005

Re: using sax filters within the markup validator

From: Bjoern Hoehrmann <derhoermi@gmx.net>
Date: Mon, 24 Oct 2005 19:43:03 +0200
To: olivier Thereaux <ot@w3.org>
Cc: QA-dev Dev <public-qa-dev@w3.org>
Message-ID: <en5ql1dsfhp1pqfep4p6s93btgc6c1eh2n@hive.bjoern.hoehrmann.de>

* olivier Thereaux wrote:
>However, unless I missed some option or misunderstood the way to use  
>the SAX filter, this method seems to choke very easily on tag soup,  
>which seems to be rather problematic, since the input of the  
>validator is rather seldom even well formed. It even apparently  
>chokes (way too) easily on comments, although this may well be a  
>mistake in how I coded the filter. The content also needs to be  
>transcoded to utf-8 before sending it through the SAX pipe.

Well, if you have <h1>a<span>b</h1>c</span>d</h1> what is the heading?
If you have a well-formed event stream it's easy, all characters between
the h1 start_element and h1 end_element. If it is not well-formed, you
would have to sort such issues out in the filter, that's a waste of re-
sources. So the question would become whether to use an XML processor
that turns the markup above into a well-formed event stream or one that
stops processing when encountering the first </h1>. All processors can
do the latter, few can do the former. I think OpenSP's event stream is
well-formed even for ill-formed input, libxml2 and maybe xerces should
be able to do something similar. What the Validator does is not really
relevant to the filter design though, as we would not use a processor
that does not gurantee a well-formed event stream.

Validity problems might be more relevant, in <h1>a<h2>b</h2>c</h1> what
are the headings? Or we might ask whether it makes sense to run checks
that check more than DTD-validity if the document is not DTD-valid. The
more error recovery in the process, the more confusing the result...
-- 
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/ 
Received on Monday, 24 October 2005 17:43:08 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Thursday, 19 August 2010 18:12:45 GMT