Re: XHTML character entity support

On Tue, 24 Nov 2009 21:14:54 +0100, John Cowan <cowan@ccil.org> wrote:

> Ian Hickson scripsit:
>
>> TagSoup is could be made more compatible with existing deployed content,
>> then. It might be compatible enough for most purposes already, but there
>> are pages on the Web that depend on the <head> element being always
>> present. Also, the <ul> element should certainly not be implied.
>
> TagSoup is not intended for deployment in browsers.  Rather, it generates
> SAX events based on HTML input, permitting fairly arbitrary HTML to
> be processed using XML tools such as XSLT.  It guarantees, therefore,
> that the output is well-formed XML (except for encoding issues) rather
> than that it conforms to any specific schema.  If you don't like what
> TagSoup outputs, you can always transform the output further until the
> result is more like what you expect.
>
> In particular, there are absolutely no guarantees that CSS paths or
> JavaScript DOM references that work on the HTML will continue to work
> on the XML; they probably won't.
>
> In principle it would be possible to use an implementation of the HTML5
> algorithm to construct a DOM and then use a simple DOM walker to read
> out SAX events, but this would be much more heavyweight in time and
> space than TagSoup is, so I imagine it will continue to be used.

The Validator.nu HTML parser can be run in SAX streaming mode which  
doesn't construct a DOM in between.

Because of things like attributes on stray <html> tags affecting  
attributes on the root element, a streaming parser sometimes either has to  
abort, emit non-SAX events or violate HTML5.


>> Also, on another note, TagSoup is not compliant with HTML4 if it doesn't
>> output a HEAD element without an explicit <HEAD> tag, since <HEAD> is an
>> optional tag in HTML4. :-)
>
> True; see above.
>


-- 
Simon Pieters
Opera Software

Received on Wednesday, 25 November 2009 12:37:58 UTC