- From: Bjoern Hoehrmann <derhoermi@gmx.net>
- Date: Mon, 24 Oct 2005 19:43:03 +0200
- To: olivier Thereaux <ot@w3.org>
- Cc: QA-dev Dev <public-qa-dev@w3.org>
* olivier Thereaux wrote: >However, unless I missed some option or misunderstood the way to use >the SAX filter, this method seems to choke very easily on tag soup, >which seems to be rather problematic, since the input of the >validator is rather seldom even well formed. It even apparently >chokes (way too) easily on comments, although this may well be a >mistake in how I coded the filter. The content also needs to be >transcoded to utf-8 before sending it through the SAX pipe. Well, if you have <h1>a<span>b</h1>c</span>d</h1> what is the heading? If you have a well-formed event stream it's easy, all characters between the h1 start_element and h1 end_element. If it is not well-formed, you would have to sort such issues out in the filter, that's a waste of re- sources. So the question would become whether to use an XML processor that turns the markup above into a well-formed event stream or one that stops processing when encountering the first </h1>. All processors can do the latter, few can do the former. I think OpenSP's event stream is well-formed even for ill-formed input, libxml2 and maybe xerces should be able to do something similar. What the Validator does is not really relevant to the filter design though, as we would not use a processor that does not gurantee a well-formed event stream. Validity problems might be more relevant, in <h1>a<h2>b</h2>c</h1> what are the headings? Or we might ask whether it makes sense to run checks that check more than DTD-validity if the document is not DTD-valid. The more error recovery in the process, the more confusing the result... -- Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de 68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/
Received on Monday, 24 October 2005 17:43:08 UTC