Re: HTML/XML TF Report glosses over Polyglot Markup from Leif Halvard Silli on 2012-12-04 (public-html@w3.org from December 2012)

From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
Date: Tue, 4 Dec 2012 13:06:02 +0100
To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
Cc: Noah Mendelsohn <nrm@arcanedomain.com>, "Eric J. Bowman" <eric@bisonsystems.net>, Robin Berjon <robin@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, www-tag@w3.org
Message-ID: <20121204130602072798.fb61f820@xn--mlform-iua.no>

"Martin J. Dürst", Tue, 04 Dec 2012 14:11:35 +0900:
> On 2012/12/04 14:02, Noah Mendelsohn wrote:
>> Robin Berjon wrote:
>> 
>>> If
>>> you want to process HTML using an XML toolchain, put an HTML parser
>>> in front of it.
>> 
>> 
>> On 12/3/2012 6:36 PM, Eric J. Bowman wrote:
>>> I used to do it that way,
>>> with Tidy and TagSoup, but have found it's simpler to just use an XSLT
>>> engine capable of reading raw HTML,
>> 
>> A question because I'm honestly curious: those XSLT engines don't use an
>> HTML parser to do that? I would have thought most did. Maybe I'm
>> guessing wrong.
> 
> It looks indeed more like a question of "external HTML parser vs. 
> built-in HTML parser" rather than "HTML parser or not".

It is also a question of using an *compatible* HTML parser. E.g. if the 
html parser in libxml2 counts as internal, then it appears to not be 
fully text/html-compatible as it appears to assume xhtml rules - e.g. 
with regard to detecting character encoding, something which e.g. seems 
to affect validator.w3.org. [1] For an already polyglot document, then 
this does not matter, however, as the Polyglot Markup specification 
limits the legal character set to UTF-8.

[1] http://lists.w3.org/Archives/Public/www-validator/2012Nov/0032

-- 
leif halvard silli

Received on Tuesday, 4 December 2012 12:07:48 UTC