Re: EPUB and XML [was: The non-polyglot elephant in the room] from Bill McCoy on 2013-01-26 (public-html@w3.org from January 2013)

From: Bill McCoy <whmccoy@gmail.com>
Date: Sat, 26 Jan 2013 09:44:03 -0800
To: "Michael[tm] Smith" <mike@w3.org>
Cc: public-html@w3.org
Message-ID: <CAJ0DDbA3LZk8gG1zdT+sTVg3aG3gQiZTsj+d4CqyXfB5p4OwmQ@mail.gmail.com>
Hi Michael,

Thanks for the note and it seems you and I share much the same
perspective. I fully realize that HTML is the core of the Open Web
Platform and at 50,000 foot level it's a pretty compelling argument
that  "clearly the Open Web Platform is not restricted to well-formed
XML, so ideally EPUB should not be restricted to XML if a goal is for
it to be the portable-document  packaging format for the Web
Platform." (of course EPUB is not literally restricted to XML right
now - e.g. CSS is not XML - but in this context we are talking about
HTML content).

My point about "tag soup" content not being valid against a schema
wasn't clearly stated. I wasn't referring to  non-validatable things
like "-data" attribute although that's another (minor) issue as well,
I was referring to the practical consideration that if documents can
contain arbitrary "tag soup" HTML then they will in practice be much
less likely to be in valid. This is not a major issue for EPUB reading
systems that as you noted will soon all be based on browser engines,
but it is an issue for intermediate processing toolchains. Especially
if the parser technology for handling "tag soup" HTML isn't widely
available across development environments & operating systems as a
standalone module that can be as easily used by such toolchains as
they can XML parsing and validation. But, I suspect that this
situation will evolve over time. There's an increasing number of CMS
systems that are based on HTML as content rather than custom XML
formats like DITA or DocBook. If 2 years from now these systems
prevalently support "tag soup" for articles and other content
fragments then I think the answer will be clear. If 2 years from now
these systems prevalently store XHTML because it has led to other
benefits, that might be another story.

And as I see it we have some time here for this to settle out. This
year is a critical period of transition from EPUB 2 to EPUB 3 and the
latter, despite some acknowledged infelicities, is infinitely more
aligned with the modern Web. And that transition is substantially
eased by the backwards compatibility of EPUB 3 with EPUB 2, for which
having EPUB 3 stick with the XML serialization of HTML is critical.
Current tools for creating, manipulating, and validating EPUB all
assume XHTML so I don't see this as an acute issue in practice right
now. And as you point out HTML can be easily transformed to XHTML so
that's always an option en route to creating EPUB just as much as it
is an option downstream.

And, we have a lot of other work to do to better align EPUB with the
overall Open Web Platform - CSS is one key area IMO. And packaging may
be another area to consider further alignment, particularly if W3C's
new System Application WG is successful in establishing an adopted
standard for native-class applications built on Web technologies. They
have fewer backwards compatibility issues to consider, especially
since there's rampant fragmentation in current approaches to packaging
Web content into native applications. So I suspect they will likely
choose more modern approaches to things like manifests, perhaps
JSON-based. And I suspect they will likely not choose to restrict HTML
content to XHTML. That would all become input into what I could
imagine as an EPUB 4 that is more closely aligned overall - with
HTML5+, with CSS (inc. a unified  approach to paginated displays for
high-design publications), with system applications, etc..

I'd personally like to see IDPF & W3C forge a way to move
expeditiously towards this increased alignment while still giving
publishers and publishing industry stakeholders confidence that
there's an independent focus on ensuring that their need will be
addressed... since publishers can't be confident that their
requirements will get timely attention given that W3C has to serve the
needs of the entire IT industry. I'm sure that we'll be discussing
this more in the NYC workshop on Feb 11-12 that W3C is hosting with
involvement of IDPF and BISG (
http://www.w3.org/2012/08/electronic-books/ ).

--Bill McCoy

P.S. I should note that in this thread I'm expressing my personal
opinion not speaking on behalf of IDPF, other than stating that  it
"...is a goal of IDPF to increase the alignment of EPUB with other W3C
specifications" as that's something we've agreed at the Board level.


On Sat, Jan 26, 2013 at 4:29 AM, Michael[tm] Smith <mike@w3.org> wrote:
> Hi Bill,
>
>> @2013-01-23 13:25 -0800:
> ...
>> Whether it was a good idea to base EPUB on XHTML 1.1 back in the
>> Dark Ages of Year 2000 is kind of a moot point since so much has changed
>> since then including the HTML roadmap which at that time was quite
>> XHTML-centtric.
>
> Yeah, in that context it was an understandable decision back then.
>
>> But there is no fundamental requirement that EPUB version x+1 content be
>> compatible with reading systems for EPUB version x, and if W3C continues to
>> move farther away from XML-based encodings that should IMO be taken into
>> consideration in the development of future versions of EPUB. It is a goal
>> of IDPF to increase the alignment of EPUB with other W3C specifications
>
> It's great to hear all that.
>
>> & I see EPUB as simply the publication (portable document) packaging of
>> the Open Web Platform.
>
> That seems like a great way to describe it. And clearly the Open Web
> Platform is not restricted to well-formed XML, so ideally EPUB should not
> be restricted to XML if a goal is for it to be the portable-document
> packaging format for the Web Platform.
>
>> And it's true that supporting "tag soup" HTML in EPUB would have some
>> benefits especially when the same content is used by both websites and
>> publications.
>
> I think that's another excellent point. If somebody wants to take their
> existing Web content an package it up as and portable document for viewing
> in an EPUB reading system, I think they ideally should not be required to
> transform it into well-formed XML to do that.
>
>> That said I do think there are benefits to EPUB having only one
>> serialization for content,
>
> I think we could say the same about there being benefits to having HTML
> itself have only on serialization. But the reality is that it has two, and
> Web user agents that want to consume and process actual Web content
> properly need to have support for content in both serializations.
>
>> which is well formed
>
> It's true that conforming HTML parsers are currently not as ubiquitous as
> XML parsers. But I think in the case of HTML reading systems there would
> only be an actual advantage to enforcing XML well-formedness if the reading
> systems didn't already have HTML parsers.
>
> But as far as I understand it the case with current EPUB reading systems is
> that they're all using browser engines to parse and render EPUB HTML
> content, and those browser engines all already have HTML parsers.
>
>> and validatable:
>
> The text/html serialization of HTML is no less validatable than the XML
> serialization; we have a modern validator engine -- the validator.nu engine
> --  that's fully capable of validating current text/html documents:
>
>   http://html5.validator.nu/
>
>> the algorithm for "tag soup" conversion may now be well defined in HTML5
>> but are not necessarily going to be valid against any schema.
>
> Not sure what you mean by that. The validator.nu engine includes a RELAX NG
> schema for HTML5 and I think the current EPUB3 validator is actually using
> a variant of that schema. The validator.nu engine also includes schemas
> for SVG, MathML, and ITS. In general all of the base schemas are agnostic
> about anything related to how the documents are serialized.
>
> It's true there are some things in the HTML spec that most current schema
> languages on their own are not capable of representing. For example, the
> HTML spec defines any attribute with a prefix of "data-" as being valid.
> But I don't know of any way to create a RELAX NG schema to express that (in
> the case of the validator.nu engine, it uses a SAX filter to drop the
> "data-" attributes before the document is exposed to RELAX NG validation).
>
> But that said, the HTML specs that "data-" attributes are valid both in
> the text/html serialization and in the XML serialization. So supporting
> only the XML serialization in a validator would not prevent you from
> needing to have the validator still consider "data-" attributes as valid
> and not report errors for them.
>
>> And, EPUB publications - like websites - are not solely made up of HTML
>> content. SVG and MathML are first-class citizens as well for example, and
>> AFAIK they are defined as XML-based markup languages, lacking an algorithm
>> like HTML5 for processing "tag soup" variants.
>
> The HTML parsing algorithm in the HTML spec itself actually already fully
> supports parsing of SVG and MathML in text/html, and all major browser
> engines have already implemented and shipped with that support.
>
>   http://www.w3.org/html/wg/drafts/html/master/syntax.html#tree-construction
>
>> Is W3C is going to move away from XML altogether and define "tag soup"
>> parsing for every specificaiton that's part of the Open Web Platform?
>
> I don't think the W3C has yet moved away from XML altogether for any markup
> language -- not even for Web markup languages. The only existing Web
> languages we've run into so far that required parsing changes to be defined
> for "tag soup" parsing are SVG and MathML.
>
> But other than SVG and MathML pretty much all other markup-language
> specifications for the Open Web Platform were never bound to XML to begin
> with. And for one of the current proposals I can see for addition of new
> markup for the Web platform -- as part of what's being called Web
> Components -- the discussions we've been having are centered so far on just
> how to handle it in text/html ("tag soup") parsers. And some people in
> those discussions have gone so far as to say we shouldn't attempt to define
> how to handle that markup for the XML-serialization case, and that it can
> just be a feature that authors need to use the text/html serialization for
> if they want to use it at all.
>
> So I guess it would be fair to say that at least among the target Web
> developers for some of those new features, they want them for the text/html
> serialization, and the UA implementors who are looking at implementing them
> are focused on coming up with something that works for the text/html case.
>
> In other words I guess you could say it's safe to bet that all new features
> the get added to the Web Platform will work in the text/html serialization
> at least -- but it's not clear that they are all absolutely going to be
> available in the XML serialization.
>
>> If not then it seems that HTML more
>> than EPUB could be considered the special case,
>
> HTML is *the* case for the Web Platform. So to the degree that you want
> EPUB to be the portable-document packaging of the Web Platform, the EPUB is
> right now the special case in terms of it requiring XML well-formedness
> while the Web Platform does not.
>
>> and that being due to HTML's own backwards-compatibility reality.
>
> I don't think backwards-compatibility is the sole reason or even the main
> reasons why the Web Platform and HTML having evolved with parsing behavior
> that includes error recovery (as HTML does but XML currently doesn't).
>
> The reason the Web has parsers capable of error recovery and that most Web
> authors and developers choose to target content to them instead of to XML
> parsers is that it's a better fit for the realities of the Web -- or
> really, a better fit for document publishing and sharing in general.
>
> If nobody thought it was a better fit, we probably wouldn't have some
> really smart people spending their time planning to specify a new version
> of XML (MicroXML) that doesn't have XML1's "catch fire and fail" draconian
> error-handling requirement, or even a new version of XML (XML-ER) that
> defines error-recovery behavior that's very much like the error-recovery
> behavior that "tag soup" text/html has (in fact the specification for
> XML-ER error-recovery behavior is modeled on the behavior in text/html).
>
>   https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html
>   https://dvcs.w3.org/hg/xml-er/raw-file/tip/Overview.html
>
>> I'm not suggesting we go back to the days when we tried to ignore this
>> reality and pursued the quixotic goal of pushing everything to XHTML. But
>> EPUB is about structured, packaged content -  data that's generated,
>> consumed and manipulated by a variety of tools, not only rendered in a
>> browser,
>
> None of tool needs make XML an absolute requirement, especially over the
> long term. It's true that a lot of people have spent a lot of the last 10+
> years building up tooling environments around XML parsers, and we have XML
> parsers in a lot more places right now than we do good conforming HTML
> parsers. But I think the way to deal with that is not to keep sinking money
> and time exclusively into toolchains that require well-formed XML1 as an
> input, but instead to improve them toolchains so that they can consume and
> process all the same real-world Web content that browsers are capable of
> handling, not just the fraction of Web content that is well-formed XML.
>
>> and doesn't have the same backwards compatibility issues as web pages but
>> in fact the opposite due to EPUB's XML beginnings. So I'm not sure I
>> personally see a strong reason that this should change. But again I think
>> this ultimately will likely depend more on what W3C does around XML in
>> general, rather than anything else.
>
> I think there are actually some very strong reasons that EPUB reading
> systems and processing tools should change to being able to handle
> text/html content, as I hope I've made clear in my comments above.
>
>   --Mike
>
> --
> Michael[tm] Smith http://people.w3.org/mike
Received on Saturday, 26 January 2013 17:44:31 UTC