Re: question about XML and HTML5

On Wed, 17 Jun 2009, Jonathan Rees wrote:
>
> This question sounds so stupid that I didn't want to ask it in public.

It is not at all a stupid question.


> Many web-related languages that have idiosyncratic syntax also provide 
> an XML surface syntax. Examples are Turtle (RDF/XML), xquery, OWL 2 
> (OWL/XML). To ensure that HTML5 can participate in XML pipelines in a 
> standard way, wouldn't it be a good idea to have a standard XML surface 
> syntax for HTML5, with semantics preserved over round trips? Perhaps 
> this even could be done using a set of extensions to XHTML.

HTML5 defines a couple of ways to do this.

One is that, like the examples you give, HTML5 defines an XML syntax 
("XHTML5") that can directly partake in the XML ecosystem. While there are 
some limitations in XML that prevent it from supporting all the features 
of the text/html syntax, and similarly some limitations of the text/html 
syntax that prevent it from supporting all the features of the XML syntax, 
it is possible to use HTML5 in such a way that documents can be 
round-tripped from HTML to XML and back again.

If you are ok with some loss of fidelity in the original conversion (e.g. 
the way that HTML Tidy loses some of the original document's precise state 
in the conversion to XHTML), then the simplest solution is to convert from 
HTML5 to XHTML5 and back again. Once you have a document that is 
expressible as XHTML5 and HTML5, it will round-trip safely.

The second way for HTML5 documents to take part in the XML ecosystem is by 
using an HTML parser that is compatible with the XML pipeline. The HTML5 
spec in fact defines a set of transformations that will take any text/html 
document and make it Infoset-compatible with XML tools. For example, it 
defines how you can take a comment that contains two consecutive "-" 
characters in its data (which is non-conforming in HTML but can 
nonetheless be parsed) and transform it such that it will be compatible 
with XML pipelines that enforce XML's rules on comment data (where the 
string "--" isn't possible in comment data).

Now if the desire is to take any arbitrary text/html stream, including 
scripted streams, and merely package it as XML in the same way as, say, a 
binary executable might be packaged, and then to return it to text/html 
without having changed its semantics, the easiest solution is to change 
base64-encode the source document's bytes, and put that into an XML 
document directly. One could also imagine some solutions that try to 
preserve the structure of the original document as much as possible, but 
if the goal is to round-trip the actual bytes, doing so would be 
prohibitively difficult. (For example you'd have to have ways to encode 
unpaired surrogates in UTF-16 source documents, non-UTF-8 bytes in UTF-8 
documents, unmatched tag pairs, etc.)

It isn't clear that the latter is actually useful, though. The "XHTML5" 
solution seems to be the most practical solution for most purposes.

HTH,
-- 
Ian Hickson               U+1047E                )\._.,--....,'``.    fL
http://ln.hixie.ch/       U+263A                /,   _.. \   _\  ;`._ ,.
Things that are impossible just take longer.   `._.-(,_..'--(,_..'`-.;.'

Received on Wednesday, 17 June 2009 18:55:45 UTC