Re: The non-polyglot elephant in the room

Regarding EPUB: one issue not mentioned in the thread that was a primary
consideration in EPUB 3.0 requiring the XML serialization for HTML (XHTML)
was backwards compatibility with the widely adopted previous version of
EPUB that was based on XHTML1.1. Allowing "tag soup" HTML would have
eliminated the possibility to create EPUB 3 files that can "fall back" i.e.
be gracefully handled (minus new capabilities of course) on EPUB 2 Reading
Systems.  Whether it was a good idea to base EPUB on XHTML 1.1 back in the
Dark Ages of Year 2000 is kind of a moot point since so much has changed
since then including the HTML roadmap which at that time was quite
XHTML-centtric.

I will also agree with the statements that EPUB content creation is not
best accomplished by raw human authoring of source markup but rather with
some assistance from tooling. Most publications/documents are created with
authoring tools and other prevalent formats range from rather baroque
(.docx) to downright opaque (.pdf). It was a design goal for EPUB to be as
simple as possible, but interoperability of tool-based workflows definitely
trumped hand-coding-friendliness. Certain bits of EPUB 3 markup (e.g.
canonical fragment identifiers) are clearly not intended for humans to
author.

But there is no fundamental requirement that EPUB version x+1 content be
compatible with reading systems for EPUB version x, and if W3C continues to
move farther away from XML-based encodings that should IMO be taken into
consideration in the development of future versions of EPUB. It is a goal
of IDPF to increase the alignment of EPUB with other W3C specifications & I
see EPUB as simply the publication (portable document) packaging of the
Open Web Platform. And it's true that supporting "tag soup" HTML in EPUB
would have some benefits especially when the same content is used by both
websites and publications.

That said I do think there are benefits to EPUB having only one
serialization for content, which is well formed and validatable: the
algorithm for "tag soup" conversion may now be well defined in HTML5 but
are not necessarily going to be valid against any schema.  And using a
serialization (XML) that's widely supported with tools built in to
essentially every  SW development environment and runtime platform in
existence makes things simpler for those developing tools and conversion
workflows. I'm not aware of every implementation of HTML to XHTML
conversion but the C-based XHTML2XHTML library contains dozens of modules
comprising over 600KB  of source code. That's a pretty hefty add-on to any
workflow, and it's not clear whether there exist versions of this
conversion for every development environment nor what is their level of
quality and robustnesss. Whereas XML parsing comes for free on every
platform.

And, EPUB publications - like websites - are not solely made up of HTML
content. SVG and MathML are first-class citizens as well for example, and
AFAIK they are defined as XML-based markup languages, lacking an algorithm
like HTML5 for processing "tag soup" variants. Is W3C is going to move away
from XML altogether and define "tag soup" parsing for every specificaiton
that's part of the Open Web Platform?  If not then it seems that HTML more
than EPUB could be considered the special case, and that being due to
HTML's own backwards-compatibility reality. I'm not suggesting we go back
to the days when we tried to ignore this reality and pursued the quixotic
goal of pushing everything to XHTML. But EPUB is about structured, packaged
content -  data that's generated, consumed and manipulated by a variety of
tools, not only rendered in a browser, and doesn't have the same backwards
compatibility issues as web pages but in fact the opposite due to EPUB's
XML beginnings. So I'm not sure I personally see a strong reason that this
should change. But again I think this ultimately will likely depend more on
what W3C does around XML in general, rather than anything else.

--Bill

Received on Thursday, 24 January 2013 22:21:06 UTC