Use cases from Norman Walsh on 2010-12-30 (public-html-xml@w3.org from December 2010)

From: Norman Walsh <ndw@nwalsh.com>
Date: Thu, 30 Dec 2010 16:19:23 -0500
To: public-html-xml@w3.org
Message-ID: <m2r5czm010.fsf@nwalsh.com>
Hello world,

I think that we'll have trouble making any progress if we don't begin
by understanding what the problem we're tasked to address is and
finding a way to articulate it clearly and precisely.

Near as I can tell, there are five possible use cases for HTML+XML:

1. I have an XML toolchain and I want to consume HTML5 because I'd
   like to process HTML5 using XML tools.

In more constrained environments, it may be possible to arrange for
all the HTML5 content to be authored in XHTML5 and then there's no
parsing problem.

The wild and whooly HTML5 of the open internet is what it is. The only
way to get that stuff is going to be to use an HTML5 parser. HTML5
parsers that produce a stream of well-formed events suitable for
constructing XML already exist, so this looks like a mostly solved
parsing problem.

The semantics of the HTML5 elements are described by the HTML5
specification. It may be necessary/useful/convenient to shuffle
namespaces a bit in the parsed content, for example to put SVG and
MathML back in their respective namespaces so that your existing XML
tools will do the right thing.

2. I have an HTML5 toolchain and I want to consume XML because I'd
   like to process XML using HTML5 tools.

The HTML5 parser will be able to parse the XML so there's no parsing
problem. It'll build the DOM that the HTML5 spec says such a document
represents. There may be some namespace issues, but this should mostly
"just work".

A simpler subset of XML might be created to make life easier for the
cases that would be covered by such a subset.

3. I have an XML document and I want to embed islands of human prose
   marked up with HTML5 in it because I want to be able to extract
   those sections for use in, for example, documentation.

If you expect the document to remain well-formed XML, you'll have to
author with XHTML5 and then there won't be any parsing problems.

The same semantic questions that arise in point 1 still apply.

4. I have an HTML5 document and I want to embed islands of XML in it
   because I want to be able to write JavaScript and CSS to manipulate
   those elements, for example, in the browser.

On the surface, this would seem to be a perfectly straight-forward
proposition. The XML content will be, by definition, well formed. The
HTML5 parser might treat namespace declarations as simple attributes,
but one expects a tree with at least the isomorphic shape in terms of
elements and other nodes.

It turns out that this isn't the case. The HTML5 parsing rules
explicitly flatten parts of the XML content if any of a wide variety
of element names occur inside the fragment. (Including, but I do not
assert limited to, "b", "big", "blockquote", "body", "br", "center",
"code", "dd", "div", "dl", "dt", "em", "embed", "h1", "h2", "h3",
"h4", "h5", "h6", "head", "hr", "i", "img", "li", "listing", "menu",
"meta", "nobr", "ol", "p", "pre", "ruby", "s", "small", "span",
"strong", "strike", "sub", "sup", "table", "tt", "u", "ul", "var", and
"font" if certain attributes are present.)

There are a number of other rules in this area relating to how MathML
and SVG are parsed and various conditions under which parsing modes
shift in ways that I don't fully comprehend.

5. I have a deeper nesting, XML containing HTML5 containing XML or
   HTML5 containing XML containing HTML5 because I'm reusing content
   that independently arose through use cases 3 or 4.

I think the answer to this use case falls naturally out of whatever
resolution arises for cases 3 and 4, but it might be worth considering
explicitly along the way.

What other use cases are there?

If the five I've outlined pretty much cover the space in question (and
I make no such assertion, though it seems so to me) then I think the
two most obvious problems that might be amenable to a technical
solution are (a) how to simplify XML so that there's a shorter
cognitive distance from HTML5 to XML and (b) how to make it possible
to embed arbitrary XML fragments in HTML5 such that the resulting DOM
has a tree strucure at least broadly isomorphic to what an XML parser
would produce.

Have I gone totally off the rails somewhere?

                                        Be seeing you,
                                          norm

-- 
Norman Walsh
Lead Engineer
MarkLogic Corporation
www.marklogic.com
Received on Thursday, 30 December 2010 21:20:01 UTC