- From: Leif Halvard Silli <xn--mlform-iua@xn--mlform-iua.no>
- Date: Tue, 4 Dec 2012 13:06:02 +0100
- To: Martin J. Dürst <duerst@it.aoyama.ac.jp>
- Cc: Noah Mendelsohn <nrm@arcanedomain.com>, "Eric J. Bowman" <eric@bisonsystems.net>, Robin Berjon <robin@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, Henri Sivonen <hsivonen@iki.fi>, public-html WG <public-html@w3.org>, www-tag@w3.org
"Martin J. Dürst", Tue, 04 Dec 2012 14:11:35 +0900: > On 2012/12/04 14:02, Noah Mendelsohn wrote: >> Robin Berjon wrote: >> >>> If >>> you want to process HTML using an XML toolchain, put an HTML parser >>> in front of it. >> >> >> On 12/3/2012 6:36 PM, Eric J. Bowman wrote: >>> I used to do it that way, >>> with Tidy and TagSoup, but have found it's simpler to just use an XSLT >>> engine capable of reading raw HTML, >> >> A question because I'm honestly curious: those XSLT engines don't use an >> HTML parser to do that? I would have thought most did. Maybe I'm >> guessing wrong. > > It looks indeed more like a question of "external HTML parser vs. > built-in HTML parser" rather than "HTML parser or not". It is also a question of using an *compatible* HTML parser. E.g. if the html parser in libxml2 counts as internal, then it appears to not be fully text/html-compatible as it appears to assume xhtml rules - e.g. with regard to detecting character encoding, something which e.g. seems to affect validator.w3.org. [1] For an already polyglot document, then this does not matter, however, as the Polyglot Markup specification limits the legal character set to UTF-8. [1] http://lists.w3.org/Archives/Public/www-validator/2012Nov/0032 -- leif halvard silli
Received on Tuesday, 4 December 2012 12:07:47 UTC