Re: The non-polyglot elephant in the room

Julian Reschke <>, 2013-01-21 16:00 +0100:

> On 2013-01-21 15:24, Michael[tm] Smith wrote:
> > ...
> > The reason EPUB requires XHTML is that the EPUB working group made an
> >explicit choice to require it. They could have chosen to allow text/html
> >EPUB books but they chose not to. And I think some of the people who
> >advocated for requiring XHTML didn't understand that existing XML-based
> >toolchains could be made to handle text/html content just by putting an
> >HTML parser in front of them.
> >...
> Is there a web page listing HTML5 parsers that can be used as "drop in"
> replacement for an XML parser?

There isn't, no. Because there aren't enough at this point to justify
making a page to list them. I wasn't implying that a lot of them exist.

> I'm aware of Henri's Java parser, but what else is out there & supported?

There may be some others but I'm not aware of any.

But in part maybe that's an indication that for non-Java environments there
aren't many people that actually want to do processing of text/html
documents using XML toolchains; they have other non-XML toolchains that
work fine for their needs.

Actually even for the case of Henri's parser, to me at least the specific
utility of it for a validator is that can hook into Jing, and then beyond
that specifically the general utility of it is not so much that it can be
used in an XML toolchain per se as it is that it does SAX.

And then because SAX is commonly used in Java across a lot of other tools
(which in practice I guess happen to be mostly designed for use with XML),
if you're going to implement a validator in Java, it's more of a natural
fit to use a SAX-based parser for it -- especially if another goal is to
have to stream-based processing rather than needing to make an in-memory
representation of the document in order to process it.

But I guess there are a lot of other classes of HTML-processing
applications for which stream-based document processing is not as much of
an important requirement, so developers make those applications without
needing SAX, and so without wanting a SAX-based HTML parser.

I think in other non-Java programming environments maybe SAX is not nearly
as widely used anyway, and it's more common to use non-streaming APIs that
aren't tied closely to XML, and so there's less need to have an HTML parser
as a drop-in replacement for XML toolchains in those environments.

In other words, I'm not sure how many developers in other environments
would actually want or use a drop-into-existing-XML-toolchain HTML parser
even if somebody took time to implement one for them. 

I think what's likely a lot more useful to them is to have their
applications be able to make of a good HTML parser, in any form at all,
that's robust in the sense that it can actually reliably parse HTML
documents the same way that browsers do (which is to say, in a way that
conforms to the parsing algorithm in the HTML spec).


Michael[tm] Smith

Received on Tuesday, 22 January 2013 08:53:47 UTC