Re: XHTML character entity support

Ian Hickson scripsit:

> TagSoup is could be made more compatible with existing deployed content, 
> then. It might be compatible enough for most purposes already, but there 
> are pages on the Web that depend on the <head> element being always 
> present. Also, the <ul> element should certainly not be implied.

TagSoup is not intended for deployment in browsers.  Rather, it generates
SAX events based on HTML input, permitting fairly arbitrary HTML to
be processed using XML tools such as XSLT.  It guarantees, therefore,
that the output is well-formed XML (except for encoding issues) rather
than that it conforms to any specific schema.  If you don't like what
TagSoup outputs, you can always transform the output further until the
result is more like what you expect.

In particular, there are absolutely no guarantees that CSS paths or
JavaScript DOM references that work on the HTML will continue to work
on the XML; they probably won't.

In principle it would be possible to use an implementation of the HTML5
algorithm to construct a DOM and then use a simple DOM walker to read
out SAX events, but this would be much more heavyweight in time and
space than TagSoup is, so I imagine it will continue to be used.

> Also, on another note, TagSoup is not compliant with HTML4 if it doesn't 
> output a HEAD element without an explicit <HEAD> tag, since <HEAD> is an 
> optional tag in HTML4. :-)

True; see above.

-- 
Híggledy-pìggledy / XML programmers            John Cowan
Try to escape those / I-eighteen-N woes;        http://www.ccil.org/~cowan
Incontrovertibly / What we need more of is      cowan@ccil.org
Unicode weenies and / François Yergeaus.

Received on Tuesday, 24 November 2009 20:15:27 UTC