Re: HTML/XML TF Report glosses over Polyglot Markup

"Martin J. Dürst", Tue, 04 Dec 2012 14:11:35 +0900:
> On 2012/12/04 14:02, Noah Mendelsohn wrote:
>> Robin Berjon wrote:
>> 
>>> If
>>> you want to process HTML using an XML toolchain, put an HTML parser
>>> in front of it.
>> 
>> 
>> On 12/3/2012 6:36 PM, Eric J. Bowman wrote:
>>> I used to do it that way,
>>> with Tidy and TagSoup, but have found it's simpler to just use an XSLT
>>> engine capable of reading raw HTML,
>> 
>> A question because I'm honestly curious: those XSLT engines don't use an
>> HTML parser to do that? I would have thought most did. Maybe I'm
>> guessing wrong.
> 
> It looks indeed more like a question of "external HTML parser vs. 
> built-in HTML parser" rather than "HTML parser or not".

It is also a question of using an *compatible* HTML parser. E.g. if the 
html parser in libxml2 counts as internal, then it appears to not be 
fully text/html-compatible as it appears to assume xhtml rules - e.g. 
with regard to detecting character encoding, something which e.g. seems 
to affect validator.w3.org. [1] For an already polyglot document, then 
this does not matter, however, as the Polyglot Markup specification 
limits the legal character set to UTF-8.

[1] http://lists.w3.org/Archives/Public/www-validator/2012Nov/0032

-- 
leif halvard silli

Received on Tuesday, 4 December 2012 12:07:47 UTC