- From: Philip Taylor <pjt47@cam.ac.uk>
- Date: Wed, 25 Nov 2009 16:30:14 +0000
- To: John Cowan <cowan@ccil.org>
- CC: www-archive@w3.org
John Cowan wrote: > Ian Hickson scripsit: > >> TagSoup is could be made more compatible with existing deployed content, >> then. It might be compatible enough for most purposes already, but there >> are pages on the Web that depend on the <head> element being always >> present. Also, the <ul> element should certainly not be implied. > > TagSoup is not intended for deployment in browsers. Rather, it generates > SAX events based on HTML input, permitting fairly arbitrary HTML to > be processed using XML tools such as XSLT. It guarantees, therefore, > that the output is well-formed XML (except for encoding issues) rather > than that it conforms to any specific schema. If you don't like what > TagSoup outputs, you can always transform the output further until the > result is more like what you expect. Here's one unrelated case where TagSoup could be made more compatible with existing deployed content, for users who are not writing web browsers, and which can't be fixed by transforming the output: http://animaldiversity.ummz.umich.edu/site/accounts/information/Picoides_scalaris.html includes the markup: <a href="http://animaldiversity.ummz.umich.edu/site/teach/index.html" href="">Teaching</a> TagSoup 1.2 with --nodefaults outputs: <a href="">Teaching</a> The HTML5 algorithm (and html5lib, and the Validator.nu parser (in streaming and buffered modes)) gives output equivalent to: <a href="http://animaldiversity.ummz.umich.edu/site/teach/index.html">Teaching</a> All browsers (as far as I'm aware) use the first attribute whenever there are duplicates, matching what HTML5 defines. (...except for 'style' where some browsers do crazy things) People writing web crawlers using TagSoup would (I expect) prefer to see the same links that a browser user would see and that the page author tested by clicking on the links in their browser. So TagSoup could be slightly more compatible with content if it adopted HTML5's rule for attribute precedence. There aren't that many cases where a streaming parser can't conform to HTML5 without aborting (there's "<html a><html b>", "<table>text", "<i><p></i>", "</html><!-- --><p>", and some similar things; see e.g. <http://lists.w3.org/Archives/Public/public-html/2009May/0582.html>) -- in general it seems like following the HTML5 algorithm more closely would help compatibility with content. (But it would probably only help a little bit, and I guess it would break compatibility with people who currently use TagSoup and make assumptions about its output, which might be a bigger problem.) -- Philip Taylor pjt47@cam.ac.uk
Received on Wednesday, 25 November 2009 16:30:43 UTC