Re: New TAG issue: TagSoupIntegration-54 from John Cowan on 2006-11-02 (www-tag@w3.org from November 2006)

From: John Cowan <cowan@ccil.org>
Date: Thu, 2 Nov 2006 11:29:24 -0500
To: Norman Walsh <Norman.Walsh@Sun.COM>
Cc: www-tag@w3.org
Message-ID: <20061102162924.GC22033@ccil.org>
Norman Walsh scripsit:

> | 	"TagSoup also includes a command-line processor 
> | 	that reads HTML files and can generate either 
> | 	clean HTML or well-formed XML that is a close 
> | 	approximation to XHTML."
> |
> | 1.) Why a "generate ... a close approximation XHTML?"  Doesn't it need to
> | to "generate XHTML?"
> 
> I wonder if John reads this list. John? 

I haven't been, but I have now joined (at least for the duration) and
I've read this thread from the archives.

> My guess is that it has to do
> with rules that XHTML imposes but that aren't easy to deduce from a
> random stream of tags, but I could be wrong.

There are a variety of reasons why TagSoup output is not necessarily
valid XHTML.  For one thing, I do not attempt to supply default contents
for required attributes such as image/@alt.  As Bjoern points out, there
isn't any obvious way to do so that adds actual value rather than merely
mechanical validity.  Ditto for required elements (head, title, and body),
although if you use a head-only element I will supply a head parent,
and ditto for body-only elements.  Nor do I filter out unknown attributes
or elements.  (TagSoup doesn't understand XML-style namespace declarations
and uses a private hack instead, but that's mere implementation laziness.)

Secondly, TagSoup's schema language does not model occurrence or
ordering constraints (it's not clear that doing the latter is even
feasible in a simple streaming parser).  All content models are of the
forms "(foo|bar|baz|...)" or "(#PCDATA|foo|bar|...)" or "EMPTY".  If you
feed it HTML with multiple body elements, they will remain present in
the output.  Exclusion rules like "no a element within another a element"
are likewise not modeled.

Thirdly, TagSoup does not include an encoding guesser; it assumes that
the encoding is correctly supplied by the environment.  That's a simple
implementation restriction, and there is a hook provided for such a
thing; I don't package the Mozilla charset detector because it is about
ten times as big as TagSoup itself.

The main purpose of TagSoup is to bring every HTML document (modulo
encoding issues) into the realm of well-formed XML, so that the powerful
tools of the XML environment can be applied to it.  I provide TSaxon, a
minor repackaging of Saxon 6 with an option allowing HTML input; others
have used TagSoup with XQuery, with XOM, and doubtless with many other
XML libraries and applications.

Secondary to that, it can be used to change messy HTML into clean HTML (it
supports both HTML and XML output modes on the same lines as XSLT).

> | 2.) Secondly (and you may no know this and maybe I shouldn't even be
> | asking on the list, but...) how do I use TagSoup on a Windows machine?

For the record, I develop TagSoup on Cygwin and run it on Windows
routinely, using the Sun JVM.  Note: Java 5.0's buggy packaged XSLT
makes it impossible to build TagSoup from source without installing a
corrected version; I use Java 1.4 for builds instead.  The bug does not
cause a problem at runtime, as XSLT support is not required.

The home page is http://tagsoup.info .

-- 
John Cowan  cowan@ccil.org  http://ccil.org/~cowan
The penguin geeks is happy / As under the waves they lark
The closed-source geeks ain't happy / They sad cause they in the dark
But geeks in the dark is lucky / They in for a worser treat
One day when the Borg go belly-up / Guess who wind up on the street.
Received on Thursday, 2 November 2006 16:35:26 UTC