- From: Norman Walsh <ndw@nwalsh.com>
- Date: Thu, 30 Dec 2010 16:19:23 -0500
- To: public-html-xml@w3.org
- Message-ID: <m2r5czm010.fsf@nwalsh.com>
Hello world, I think that we'll have trouble making any progress if we don't begin by understanding what the problem we're tasked to address is and finding a way to articulate it clearly and precisely. Near as I can tell, there are five possible use cases for HTML+XML: 1. I have an XML toolchain and I want to consume HTML5 because I'd like to process HTML5 using XML tools. In more constrained environments, it may be possible to arrange for all the HTML5 content to be authored in XHTML5 and then there's no parsing problem. The wild and whooly HTML5 of the open internet is what it is. The only way to get that stuff is going to be to use an HTML5 parser. HTML5 parsers that produce a stream of well-formed events suitable for constructing XML already exist, so this looks like a mostly solved parsing problem. The semantics of the HTML5 elements are described by the HTML5 specification. It may be necessary/useful/convenient to shuffle namespaces a bit in the parsed content, for example to put SVG and MathML back in their respective namespaces so that your existing XML tools will do the right thing. 2. I have an HTML5 toolchain and I want to consume XML because I'd like to process XML using HTML5 tools. The HTML5 parser will be able to parse the XML so there's no parsing problem. It'll build the DOM that the HTML5 spec says such a document represents. There may be some namespace issues, but this should mostly "just work". A simpler subset of XML might be created to make life easier for the cases that would be covered by such a subset. 3. I have an XML document and I want to embed islands of human prose marked up with HTML5 in it because I want to be able to extract those sections for use in, for example, documentation. If you expect the document to remain well-formed XML, you'll have to author with XHTML5 and then there won't be any parsing problems. The same semantic questions that arise in point 1 still apply. 4. I have an HTML5 document and I want to embed islands of XML in it because I want to be able to write JavaScript and CSS to manipulate those elements, for example, in the browser. On the surface, this would seem to be a perfectly straight-forward proposition. The XML content will be, by definition, well formed. The HTML5 parser might treat namespace declarations as simple attributes, but one expects a tree with at least the isomorphic shape in terms of elements and other nodes. It turns out that this isn't the case. The HTML5 parsing rules explicitly flatten parts of the XML content if any of a wide variety of element names occur inside the fragment. (Including, but I do not assert limited to, "b", "big", "blockquote", "body", "br", "center", "code", "dd", "div", "dl", "dt", "em", "embed", "h1", "h2", "h3", "h4", "h5", "h6", "head", "hr", "i", "img", "li", "listing", "menu", "meta", "nobr", "ol", "p", "pre", "ruby", "s", "small", "span", "strong", "strike", "sub", "sup", "table", "tt", "u", "ul", "var", and "font" if certain attributes are present.) There are a number of other rules in this area relating to how MathML and SVG are parsed and various conditions under which parsing modes shift in ways that I don't fully comprehend. 5. I have a deeper nesting, XML containing HTML5 containing XML or HTML5 containing XML containing HTML5 because I'm reusing content that independently arose through use cases 3 or 4. I think the answer to this use case falls naturally out of whatever resolution arises for cases 3 and 4, but it might be worth considering explicitly along the way. What other use cases are there? If the five I've outlined pretty much cover the space in question (and I make no such assertion, though it seems so to me) then I think the two most obvious problems that might be amenable to a technical solution are (a) how to simplify XML so that there's a shorter cognitive distance from HTML5 to XML and (b) how to make it possible to embed arbitrary XML fragments in HTML5 such that the resulting DOM has a tree strucure at least broadly isomorphic to what an XML parser would produce. Have I gone totally off the rails somewhere? Be seeing you, norm -- Norman Walsh Lead Engineer MarkLogic Corporation www.marklogic.com
Received on Thursday, 30 December 2010 21:20:01 UTC