- From: Michael[tm] Smith <mike@w3.org>
- Date: Sat, 26 Jan 2013 21:29:53 +0900
- To: Bill McCoy <whmccoy@gmail.com>
- Cc: public-html@w3.org
Hi Bill, > @2013-01-23 13:25 -0800: ... > Whether it was a good idea to base EPUB on XHTML 1.1 back in the > Dark Ages of Year 2000 is kind of a moot point since so much has changed > since then including the HTML roadmap which at that time was quite > XHTML-centtric. Yeah, in that context it was an understandable decision back then. > But there is no fundamental requirement that EPUB version x+1 content be > compatible with reading systems for EPUB version x, and if W3C continues to > move farther away from XML-based encodings that should IMO be taken into > consideration in the development of future versions of EPUB. It is a goal > of IDPF to increase the alignment of EPUB with other W3C specifications It's great to hear all that. > & I see EPUB as simply the publication (portable document) packaging of > the Open Web Platform. That seems like a great way to describe it. And clearly the Open Web Platform is not restricted to well-formed XML, so ideally EPUB should not be restricted to XML if a goal is for it to be the portable-document packaging format for the Web Platform. > And it's true that supporting "tag soup" HTML in EPUB would have some > benefits especially when the same content is used by both websites and > publications. I think that's another excellent point. If somebody wants to take their existing Web content an package it up as and portable document for viewing in an EPUB reading system, I think they ideally should not be required to transform it into well-formed XML to do that. > That said I do think there are benefits to EPUB having only one > serialization for content, I think we could say the same about there being benefits to having HTML itself have only on serialization. But the reality is that it has two, and Web user agents that want to consume and process actual Web content properly need to have support for content in both serializations. > which is well formed It's true that conforming HTML parsers are currently not as ubiquitous as XML parsers. But I think in the case of HTML reading systems there would only be an actual advantage to enforcing XML well-formedness if the reading systems didn't already have HTML parsers. But as far as I understand it the case with current EPUB reading systems is that they're all using browser engines to parse and render EPUB HTML content, and those browser engines all already have HTML parsers. > and validatable: The text/html serialization of HTML is no less validatable than the XML serialization; we have a modern validator engine -- the validator.nu engine -- that's fully capable of validating current text/html documents: http://html5.validator.nu/ > the algorithm for "tag soup" conversion may now be well defined in HTML5 > but are not necessarily going to be valid against any schema. Not sure what you mean by that. The validator.nu engine includes a RELAX NG schema for HTML5 and I think the current EPUB3 validator is actually using a variant of that schema. The validator.nu engine also includes schemas for SVG, MathML, and ITS. In general all of the base schemas are agnostic about anything related to how the documents are serialized. It's true there are some things in the HTML spec that most current schema languages on their own are not capable of representing. For example, the HTML spec defines any attribute with a prefix of "data-" as being valid. But I don't know of any way to create a RELAX NG schema to express that (in the case of the validator.nu engine, it uses a SAX filter to drop the "data-" attributes before the document is exposed to RELAX NG validation). But that said, the HTML specs that "data-" attributes are valid both in the text/html serialization and in the XML serialization. So supporting only the XML serialization in a validator would not prevent you from needing to have the validator still consider "data-" attributes as valid and not report errors for them. > And, EPUB publications - like websites - are not solely made up of HTML > content. SVG and MathML are first-class citizens as well for example, and > AFAIK they are defined as XML-based markup languages, lacking an algorithm > like HTML5 for processing "tag soup" variants. The HTML parsing algorithm in the HTML spec itself actually already fully supports parsing of SVG and MathML in text/html, and all major browser engines have already implemented and shipped with that support. http://www.w3.org/html/wg/drafts/html/master/syntax.html#tree-construction > Is W3C is going to move away from XML altogether and define "tag soup" > parsing for every specificaiton that's part of the Open Web Platform? I don't think the W3C has yet moved away from XML altogether for any markup language -- not even for Web markup languages. The only existing Web languages we've run into so far that required parsing changes to be defined for "tag soup" parsing are SVG and MathML. But other than SVG and MathML pretty much all other markup-language specifications for the Open Web Platform were never bound to XML to begin with. And for one of the current proposals I can see for addition of new markup for the Web platform -- as part of what's being called Web Components -- the discussions we've been having are centered so far on just how to handle it in text/html ("tag soup") parsers. And some people in those discussions have gone so far as to say we shouldn't attempt to define how to handle that markup for the XML-serialization case, and that it can just be a feature that authors need to use the text/html serialization for if they want to use it at all. So I guess it would be fair to say that at least among the target Web developers for some of those new features, they want them for the text/html serialization, and the UA implementors who are looking at implementing them are focused on coming up with something that works for the text/html case. In other words I guess you could say it's safe to bet that all new features the get added to the Web Platform will work in the text/html serialization at least -- but it's not clear that they are all absolutely going to be available in the XML serialization. > If not then it seems that HTML more > than EPUB could be considered the special case, HTML is *the* case for the Web Platform. So to the degree that you want EPUB to be the portable-document packaging of the Web Platform, the EPUB is right now the special case in terms of it requiring XML well-formedness while the Web Platform does not. > and that being due to HTML's own backwards-compatibility reality. I don't think backwards-compatibility is the sole reason or even the main reasons why the Web Platform and HTML having evolved with parsing behavior that includes error recovery (as HTML does but XML currently doesn't). The reason the Web has parsers capable of error recovery and that most Web authors and developers choose to target content to them instead of to XML parsers is that it's a better fit for the realities of the Web -- or really, a better fit for document publishing and sharing in general. If nobody thought it was a better fit, we probably wouldn't have some really smart people spending their time planning to specify a new version of XML (MicroXML) that doesn't have XML1's "catch fire and fail" draconian error-handling requirement, or even a new version of XML (XML-ER) that defines error-recovery behavior that's very much like the error-recovery behavior that "tag soup" text/html has (in fact the specification for XML-ER error-recovery behavior is modeled on the behavior in text/html). https://dvcs.w3.org/hg/microxml/raw-file/tip/spec/microxml.html https://dvcs.w3.org/hg/xml-er/raw-file/tip/Overview.html > I'm not suggesting we go back to the days when we tried to ignore this > reality and pursued the quixotic goal of pushing everything to XHTML. But > EPUB is about structured, packaged content - data that's generated, > consumed and manipulated by a variety of tools, not only rendered in a > browser, None of tool needs make XML an absolute requirement, especially over the long term. It's true that a lot of people have spent a lot of the last 10+ years building up tooling environments around XML parsers, and we have XML parsers in a lot more places right now than we do good conforming HTML parsers. But I think the way to deal with that is not to keep sinking money and time exclusively into toolchains that require well-formed XML1 as an input, but instead to improve them toolchains so that they can consume and process all the same real-world Web content that browsers are capable of handling, not just the fraction of Web content that is well-formed XML. > and doesn't have the same backwards compatibility issues as web pages but > in fact the opposite due to EPUB's XML beginnings. So I'm not sure I > personally see a strong reason that this should change. But again I think > this ultimately will likely depend more on what W3C does around XML in > general, rather than anything else. I think there are actually some very strong reasons that EPUB reading systems and processing tools should change to being able to handle text/html content, as I hope I've made clear in my comments above. --Mike -- Michael[tm] Smith http://people.w3.org/mike
Received on Saturday, 26 January 2013 12:30:07 UTC