EPUB and XML [was: The non-polyglot elephant in the room] from Michael[tm] Smith on 2013-01-26 (public-html@w3.org from January 2013)

From: Michael[tm] Smith <mike@w3.org>
Date: Sat, 26 Jan 2013 21:29:53 +0900
To: Bill McCoy <whmccoy@gmail.com>
Cc: public-html@w3.org
Message-ID: <20130126122951.GN42246@sideshowbarker>
Hi Bill,

> @2013-01-23 13:25 -0800:
...
> Whether it was a good idea to base EPUB on XHTML 1.1 back in the
> Dark Ages of Year 2000 is kind of a moot point since so much has changed
> since then including the HTML roadmap which at that time was quite
> XHTML-centtric.

Yeah, in that context it was an understandable decision back then.

> But there is no fundamental requirement that EPUB version x+1 content be
> compatible with reading systems for EPUB version x, and if W3C continues to
> move farther away from XML-based encodings that should IMO be taken into
> consideration in the development of future versions of EPUB. It is a goal
> of IDPF to increase the alignment of EPUB with other W3C specifications

It's great to hear all that.

> & I see EPUB as simply the publication (portable document) packaging of
> the Open Web Platform.

That seems like a great way to describe it. And clearly the Open Web
Platform is not restricted to well-formed XML, so ideally EPUB should not
be restricted to XML if a goal is for it to be the portable-document
packaging format for the Web Platform.

> And it's true that supporting "tag soup" HTML in EPUB would have some
> benefits especially when the same content is used by both websites and
> publications.

I think that's another excellent point. If somebody wants to take their
existing Web content an package it up as and portable document for viewing
in an EPUB reading system, I think they ideally should not be required to
transform it into well-formed XML to do that.

> That said I do think there are benefits to EPUB having only one
> serialization for content,

I think we could say the same about there being benefits to having HTML
itself have only on serialization. But the reality is that it has two, and
Web user agents that want to consume and process actual Web content
properly need to have support for content in both serializations.

> which is well formed

It's true that conforming HTML parsers are currently not as ubiquitous as
XML parsers. But I think in the case of HTML reading systems there would
only be an actual advantage to enforcing XML well-formedness if the reading
systems didn't already have HTML parsers.

But as far as I understand it the case with current EPUB reading systems is
that they're all using browser engines to parse and render EPUB HTML
content, and those browser engines all already have HTML parsers.

> and validatable:

The text/html serialization of HTML is no less validatable than the XML
serialization; we have a modern validator engine -- the validator.nu engine
--  that's fully capable of validating current text/html documents:

  http://html5.validator.nu/

> the algorithm for "tag soup" conversion may now be well defined in HTML5
> but are not necessarily going to be valid against any schema.

Not sure what you mean by that. The validator.nu engine includes a RELAX NG
schema for HTML5 and I think the current EPUB3 validator is actually using
a variant of that schema. The validator.nu engine also includes schemas
for SVG, MathML, and ITS. In general all of the base schemas are agnostic
about anything related to how the documents are serialized.

It's true there are some things in the HTML spec that most current schema
languages on their own are not capable of representing. For example, the
HTML spec defines any attribute with a prefix of "data-" as being valid.
But I don't know of any way to create a RELAX NG schema to express that (in
the case of the validator.nu engine, it uses a SAX filter to drop the
"data-" attributes before the document is exposed to RELAX NG validation).

But that said, the HTML specs that "data-" attributes are valid both in
the text/html serialization and in the XML serialization. So supporting
only the XML serialization in a validator would not prevent you from
needing to have the validator still consider "data-" attributes as valid
and not report errors for them.

> And, EPUB publications - like websites - are not solely made up of HTML
> content. SVG and MathML are first-class citizens as well for example, and
> AFAIK they are defined as XML-based markup languages, lacking an algorithm
> like HTML5 for processing "tag soup" variants.

The HTML parsing algorithm in the HTML spec itself actually already fully
supports parsing of SVG and MathML in text/html, and all major browser
engines have already implemented and shipped with that support.

  http://www.w3.org/html/wg/drafts/html/master/syntax.html#tree-construction

> Is W3C is going to move away from XML altogether and define "tag soup"
> parsing for every specificaiton that's part of the Open Web Platform?

I don't think the W3C has yet moved away from XML altogether for any markup
language -- not even for Web markup languages. The only existing Web
languages we've run into so far that required parsing changes to be defined
for "tag soup" parsing are SVG and MathML.

But other than SVG and MathML pretty much all other markup-language
specifications for the Open Web Platform were never bound to XML to begin
with. And for one of the current proposals I can see for addition of new
markup for the Web platform -- as part of what's being called Web
Components -- the discussions we've been having are centered so far on just
how to handle it in text/html ("tag soup") parsers. And some people in
those discussions have gone so far as to say we shouldn't attempt to define
how to handle that markup for the XML-serialization case, and that it can
just be a feature that authors need to use the text/html serialization for
if they want to use it at all.

So I guess it would be fair to say that at least among the target Web
developers for some of those new features, they want them for the text/html
serialization, and the UA implementors who are looking at implementing them
are focused on coming up with something that works for the text/html case.

In other words I guess you could say it's safe to bet that all new features
the get added to the Web Platform will work in the text/html serialization
at least -- but it's not clear that they are all absolutely going to be
available in the XML serialization.

> If not then it seems that HTML more
> than EPUB could be considered the special case,

HTML is *the* case for the Web Platform. So to the degree that you want
EPUB to be the portable-document packaging of the Web Platform, the EPUB is
right now the special case in terms of it requiring XML well-formedness
while the Web Platform does not.

> and that being due to HTML's own backwards-compatibility reality.

I don't think backwards-compatibility is the sole reason or even the main
reasons why the Web Platform and HTML having evolved with parsing behavior
that includes error recovery (as HTML does but XML currently doesn't).

The reason the Web has parsers capable of error recovery and that most Web
authors and developers choose to target content to them instead of to XML
parsers is that it's a better fit for the realities of the Web -- or
really, a better fit for document publishing and sharing in general. 

If nobody thought it was a better fit, we probably wouldn't have some
really smart people spending their time planning to specify a new version
of XML (MicroXML) that doesn't have XML1's "catch fire and fail" draconian
error-handling requirement, or even a new version of XML (XML-ER) that
defines error-recovery behavior that's very much like the error-recovery
behavior that "tag soup" text/html has (in fact the specification for
XML-ER error-recovery behavior is modeled on the behavior in text/html).

  https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html
  https://dvcs.w3.org/hg/xml-er/raw-file/tip/Overview.html

> I'm not suggesting we go back to the days when we tried to ignore this
> reality and pursued the quixotic goal of pushing everything to XHTML. But
> EPUB is about structured, packaged content -  data that's generated,
> consumed and manipulated by a variety of tools, not only rendered in a
> browser,

None of tool needs make XML an absolute requirement, especially over the
long term. It's true that a lot of people have spent a lot of the last 10+
years building up tooling environments around XML parsers, and we have XML
parsers in a lot more places right now than we do good conforming HTML
parsers. But I think the way to deal with that is not to keep sinking money
and time exclusively into toolchains that require well-formed XML1 as an
input, but instead to improve them toolchains so that they can consume and
process all the same real-world Web content that browsers are capable of
handling, not just the fraction of Web content that is well-formed XML.

> and doesn't have the same backwards compatibility issues as web pages but
> in fact the opposite due to EPUB's XML beginnings. So I'm not sure I
> personally see a strong reason that this should change. But again I think
> this ultimately will likely depend more on what W3C does around XML in
> general, rather than anything else.

I think there are actually some very strong reasons that EPUB reading
systems and processing tools should change to being able to handle
text/html content, as I hope I've made clear in my comments above.

  --Mike

-- 
Michael[tm] Smith http://people.w3.org/mike
Received on Saturday, 26 January 2013 12:30:07 UTC