[whatwg] Distinguishing XML and HTML by content sniffing from Julian Reschke on 2007-03-04 (public-whatwg-archive@w3.org from March 2007)

From: Julian Reschke <julian.reschke@gmx.de>
Date: Sun, 04 Mar 2007 11:47:35 +0100
Message-ID: <45EAA3C7.1060008@gmx.de>

Michael Day schrieb:
> ...
> I think that approach could easily misidentify valid HTML documents as 
> being XML. It would be easy to parse the first 8Kb of many HTML 
> documents with an XML parser, as unclosed tags like <link> and <meta> 
> would not trigger any well-formedness errors unless you parsed all the 
> way to the end of the document -- not just the first 8Kb -- and found 
> that they were never closed.
> 
> On a more pragmatic level, I think it would also be slightly more 
> difficult to implement this approach with libxml2, as you would have to 
> carefully feed the parser only 8Kb (or some other amount) and then stop 
> it before it hits the end of the buffer and complains about all the 
> unclosed tags. However, the misidentification problem is a more serious 
> issue affecting this approach.

Hm.

What, except efficiency, prevents you from parsing the whole file with 
an XML parser? If it parses, it is XML. Otherwise it isn't.

Best regards, Julian

Received on Sunday, 4 March 2007 02:47:35 UTC