Re: XHTML character entity support from Philip Taylor on 2009-11-25 (www-archive@w3.org from November 2009)

From: Philip Taylor <pjt47@cam.ac.uk>
Date: Wed, 25 Nov 2009 16:30:14 +0000
To: John Cowan <cowan@ccil.org>
CC: www-archive@w3.org
Message-ID: <4B0D5B96.2060200@cam.ac.uk>

John Cowan wrote:
> Ian Hickson scripsit:
> 
>> TagSoup is could be made more compatible with existing deployed content, 
>> then. It might be compatible enough for most purposes already, but there 
>> are pages on the Web that depend on the <head> element being always 
>> present. Also, the <ul> element should certainly not be implied.
> 
> TagSoup is not intended for deployment in browsers.  Rather, it generates
> SAX events based on HTML input, permitting fairly arbitrary HTML to
> be processed using XML tools such as XSLT.  It guarantees, therefore,
> that the output is well-formed XML (except for encoding issues) rather
> than that it conforms to any specific schema.  If you don't like what
> TagSoup outputs, you can always transform the output further until the
> result is more like what you expect.

Here's one unrelated case where TagSoup could be made more compatible 
with existing deployed content, for users who are not writing web 
browsers, and which can't be fixed by transforming the output:

http://animaldiversity.ummz.umich.edu/site/accounts/information/Picoides_scalaris.html 
includes the markup:

   <a href="http://animaldiversity.ummz.umich.edu/site/teach/index.html" 
  href="">Teaching</a>

TagSoup 1.2 with --nodefaults outputs:

   <a href="">Teaching</a>

The HTML5 algorithm (and html5lib, and the Validator.nu parser (in 
streaming and buffered modes)) gives output equivalent to:

   <a 
href="http://animaldiversity.ummz.umich.edu/site/teach/index.html">Teaching</a>

All browsers (as far as I'm aware) use the first attribute whenever 
there are duplicates, matching what HTML5 defines. (...except for 
'style' where some browsers do crazy things)

People writing web crawlers using TagSoup would (I expect) prefer to see 
the same links that a browser user would see and that the page author 
tested by clicking on the links in their browser. So TagSoup could be 
slightly more compatible with content if it adopted HTML5's rule for 
attribute precedence.

There aren't that many cases where a streaming parser can't conform to 
HTML5 without aborting (there's "<html a><html b>", "<table>text", 
"<i><p></i>", "</html><p>", and some similar things; see e.g. 
<http://lists.w3.org/Archives/Public/public-html/2009May/0582.html>) -- 
in general it seems like following the HTML5 algorithm more closely 
would help compatibility with content. (But it would probably only help 
a little bit, and I guess it would break compatibility with people who 
currently use TagSoup and make assumptions about its output, which might 
be a bigger problem.)

-- 
Philip Taylor
pjt47@cam.ac.uk

Received on Wednesday, 25 November 2009 16:30:43 UTC