[whatwg] Distinguishing XML and HTML by content sniffing from Michael Day on 2007-03-04 (public-whatwg-archive@w3.org from March 2007)

From: Michael Day <mikeday@yeslogic.com>
Date: Sun, 04 Mar 2007 19:48:32 +1100
Message-ID: <45EA87E0.7010609@yeslogic.com>

Hi Bjoern,

> Well, the article would be more interesting if you had explained why you
> took this particular approach instead of, say, parsing the first 8K with
> an XML parser and if that succeeds it's XML and HTML otherwise, and what
> the implementation would consider your article.

I think that approach could easily misidentify valid HTML documents as 
being XML. It would be easy to parse the first 8Kb of many HTML 
documents with an XML parser, as unclosed tags like <link> and <meta> 
would not trigger any well-formedness errors unless you parsed all the 
way to the end of the document -- not just the first 8Kb -- and found 
that they were never closed.

On a more pragmatic level, I think it would also be slightly more 
difficult to implement this approach with libxml2, as you would have to 
carefully feed the parser only 8Kb (or some other amount) and then stop 
it before it hits the end of the buffer and complains about all the 
unclosed tags. However, the misidentification problem is a more serious 
issue affecting this approach.

Best regards,

Michael

-- 
Print XML with Prince!
http://www.princexml.com

Received on Sunday, 4 March 2007 00:48:32 UTC