- From: John Cowan <cowan@ccil.org>
- Date: Thu, 2 Nov 2006 11:29:24 -0500
- To: Norman Walsh <Norman.Walsh@Sun.COM>
- Cc: www-tag@w3.org
Norman Walsh scripsit: > | "TagSoup also includes a command-line processor > | that reads HTML files and can generate either > | clean HTML or well-formed XML that is a close > | approximation to XHTML." > | > | 1.) Why a "generate ... a close approximation XHTML?" Doesn't it need to > | to "generate XHTML?" > > I wonder if John reads this list. John? I haven't been, but I have now joined (at least for the duration) and I've read this thread from the archives. > My guess is that it has to do > with rules that XHTML imposes but that aren't easy to deduce from a > random stream of tags, but I could be wrong. There are a variety of reasons why TagSoup output is not necessarily valid XHTML. For one thing, I do not attempt to supply default contents for required attributes such as image/@alt. As Bjoern points out, there isn't any obvious way to do so that adds actual value rather than merely mechanical validity. Ditto for required elements (head, title, and body), although if you use a head-only element I will supply a head parent, and ditto for body-only elements. Nor do I filter out unknown attributes or elements. (TagSoup doesn't understand XML-style namespace declarations and uses a private hack instead, but that's mere implementation laziness.) Secondly, TagSoup's schema language does not model occurrence or ordering constraints (it's not clear that doing the latter is even feasible in a simple streaming parser). All content models are of the forms "(foo|bar|baz|...)" or "(#PCDATA|foo|bar|...)" or "EMPTY". If you feed it HTML with multiple body elements, they will remain present in the output. Exclusion rules like "no a element within another a element" are likewise not modeled. Thirdly, TagSoup does not include an encoding guesser; it assumes that the encoding is correctly supplied by the environment. That's a simple implementation restriction, and there is a hook provided for such a thing; I don't package the Mozilla charset detector because it is about ten times as big as TagSoup itself. The main purpose of TagSoup is to bring every HTML document (modulo encoding issues) into the realm of well-formed XML, so that the powerful tools of the XML environment can be applied to it. I provide TSaxon, a minor repackaging of Saxon 6 with an option allowing HTML input; others have used TagSoup with XQuery, with XOM, and doubtless with many other XML libraries and applications. Secondary to that, it can be used to change messy HTML into clean HTML (it supports both HTML and XML output modes on the same lines as XSLT). > | 2.) Secondly (and you may no know this and maybe I shouldn't even be > | asking on the list, but...) how do I use TagSoup on a Windows machine? For the record, I develop TagSoup on Cygwin and run it on Windows routinely, using the Sun JVM. Note: Java 5.0's buggy packaged XSLT makes it impossible to build TagSoup from source without installing a corrected version; I use Java 1.4 for builds instead. The bug does not cause a problem at runtime, as XSLT support is not required. The home page is http://tagsoup.info . -- John Cowan cowan@ccil.org http://ccil.org/~cowan The penguin geeks is happy / As under the waves they lark The closed-source geeks ain't happy / They sad cause they in the dark But geeks in the dark is lucky / They in for a worser treat One day when the Borg go belly-up / Guess who wind up on the street.
Received on Thursday, 2 November 2006 16:35:26 UTC