- From: Ian Hickson <ian@hixie.ch>
- Date: Wed, 17 Jun 2009 18:55:11 +0000 (UTC)
- To: Jonathan Rees <jar@creativecommons.org>
- Cc: Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, public-html@w3.org
On Wed, 17 Jun 2009, Jonathan Rees wrote: > > This question sounds so stupid that I didn't want to ask it in public. It is not at all a stupid question. > Many web-related languages that have idiosyncratic syntax also provide > an XML surface syntax. Examples are Turtle (RDF/XML), xquery, OWL 2 > (OWL/XML). To ensure that HTML5 can participate in XML pipelines in a > standard way, wouldn't it be a good idea to have a standard XML surface > syntax for HTML5, with semantics preserved over round trips? Perhaps > this even could be done using a set of extensions to XHTML. HTML5 defines a couple of ways to do this. One is that, like the examples you give, HTML5 defines an XML syntax ("XHTML5") that can directly partake in the XML ecosystem. While there are some limitations in XML that prevent it from supporting all the features of the text/html syntax, and similarly some limitations of the text/html syntax that prevent it from supporting all the features of the XML syntax, it is possible to use HTML5 in such a way that documents can be round-tripped from HTML to XML and back again. If you are ok with some loss of fidelity in the original conversion (e.g. the way that HTML Tidy loses some of the original document's precise state in the conversion to XHTML), then the simplest solution is to convert from HTML5 to XHTML5 and back again. Once you have a document that is expressible as XHTML5 and HTML5, it will round-trip safely. The second way for HTML5 documents to take part in the XML ecosystem is by using an HTML parser that is compatible with the XML pipeline. The HTML5 spec in fact defines a set of transformations that will take any text/html document and make it Infoset-compatible with XML tools. For example, it defines how you can take a comment that contains two consecutive "-" characters in its data (which is non-conforming in HTML but can nonetheless be parsed) and transform it such that it will be compatible with XML pipelines that enforce XML's rules on comment data (where the string "--" isn't possible in comment data). Now if the desire is to take any arbitrary text/html stream, including scripted streams, and merely package it as XML in the same way as, say, a binary executable might be packaged, and then to return it to text/html without having changed its semantics, the easiest solution is to change base64-encode the source document's bytes, and put that into an XML document directly. One could also imagine some solutions that try to preserve the structure of the original document as much as possible, but if the goal is to round-trip the actual bytes, doing so would be prohibitively difficult. (For example you'd have to have ways to encode unpaired surrogates in UTF-16 source documents, non-UTF-8 bytes in UTF-8 documents, unmatched tag pairs, etc.) It isn't clear that the latter is actually useful, though. The "XHTML5" solution seems to be the most practical solution for most purposes. HTH, -- Ian Hickson U+1047E )\._.,--....,'``. fL http://ln.hixie.ch/ U+263A /, _.. \ _\ ;`._ ,. Things that are impossible just take longer. `._.-(,_..'--(,_..'`-.;.'
Received on Wednesday, 17 June 2009 18:55:45 UTC