- From: Michael[tm] Smith <mike@w3.org>
- Date: Tue, 22 Jan 2013 17:53:31 +0900
- To: Julian Reschke <julian.reschke@gmx.de>
- Cc: public-html WG <public-html@w3.org>, "www-tag@w3.org List" <www-tag@w3.org>
Julian Reschke <julian.reschke@gmx.de>, 2013-01-21 16:00 +0100: > On 2013-01-21 15:24, Michael[tm] Smith wrote: > > ... > > The reason EPUB requires XHTML is that the EPUB working group made an > >explicit choice to require it. They could have chosen to allow text/html > >EPUB books but they chose not to. And I think some of the people who > >advocated for requiring XHTML didn't understand that existing XML-based > >toolchains could be made to handle text/html content just by putting an > >HTML parser in front of them. > >... > > Is there a web page listing HTML5 parsers that can be used as "drop in" > replacement for an XML parser? There isn't, no. Because there aren't enough at this point to justify making a page to list them. I wasn't implying that a lot of them exist. > I'm aware of Henri's Java parser, but what else is out there & supported? There may be some others but I'm not aware of any. But in part maybe that's an indication that for non-Java environments there aren't many people that actually want to do processing of text/html documents using XML toolchains; they have other non-XML toolchains that work fine for their needs. Actually even for the case of Henri's parser, to me at least the specific utility of it for a validator is that can hook into Jing, and then beyond that specifically the general utility of it is not so much that it can be used in an XML toolchain per se as it is that it does SAX. And then because SAX is commonly used in Java across a lot of other tools (which in practice I guess happen to be mostly designed for use with XML), if you're going to implement a validator in Java, it's more of a natural fit to use a SAX-based parser for it -- especially if another goal is to have to stream-based processing rather than needing to make an in-memory representation of the document in order to process it. But I guess there are a lot of other classes of HTML-processing applications for which stream-based document processing is not as much of an important requirement, so developers make those applications without needing SAX, and so without wanting a SAX-based HTML parser. I think in other non-Java programming environments maybe SAX is not nearly as widely used anyway, and it's more common to use non-streaming APIs that aren't tied closely to XML, and so there's less need to have an HTML parser as a drop-in replacement for XML toolchains in those environments. In other words, I'm not sure how many developers in other environments would actually want or use a drop-into-existing-XML-toolchain HTML parser even if somebody took time to implement one for them. I think what's likely a lot more useful to them is to have their applications be able to make of a good HTML parser, in any form at all, that's robust in the sense that it can actually reliably parse HTML documents the same way that browsers do (which is to say, in a way that conforms to the parsing algorithm in the HTML spec). --Mike -- Michael[tm] Smith http://people.w3.org/mike
Received on Tuesday, 22 January 2013 08:53:46 UTC