Re: question about XML and HTML5 from Henri Sivonen on 2009-06-18 (www-archive@w3.org from June 2009)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Thu, 18 Jun 2009 10:46:48 +0300
To: Jonathan Rees <jar@creativecommons.org>
Cc: Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
Message-Id: <5996C126-511A-45D4-BF12-912B97FC30C9@iki.fi>

On Jun 17, 2009, at 14:47, Jonathan Rees wrote:

> I don't see how your answer or the linked documents bear on my
> question, so let me amplify.

Anne's answer seems entirely relevant to me.

> The ideal situation:  you can take any HTML5 document, convert it to
> some XML-based language designed for the purpose (not necessarily
> XHTML), convert it back, and get a semantically equivalent HTML5
> document.

The only HTML5 to XML conversion we have defined is conversion to  
XHTML5, which is not a 100% reversible conversion for some edge cases.

The edge cases are all arbitrary restrictions that XML places on what  
characters may appear where. For example, the conversion of an HTML5  
document that has a form feed somewhere in element content is lossy,  
because XML doesn't allow form feed in element content. Likewise, the  
conversion is lossy when the source document has local names that are  
not NCNames. Also, the conversion is lossy for documents that have  
Unicode non-characters (e.g. U+FFFF) in element content.

However, the for *conforming* HTML5 documents, the only lossiness is  
form feed and the loss of semantically void talisman attributes  
(attributes in no namespace that have "xml:lang" or "xmlns" as the  
local name). Note that "xml:lang" in no namespace means nothing in  
text/html and conformance requires it to be accompanied with "lang" in  
no namespace with the same value and that does carry meaning. To the  
extent the semantics of a form feed in text/html are the same as the  
semantics of a space and the semantics of non-characters are the same  
as the semantics of of U+FFFD, for conforming documents, semantics are  
round-tripped.

So I think it's fair to say that for conforming HTML5 documents, HTML5- 
 >XHTML5->HTML5 round trips semantics. (Note, however, that the  
conversion from XHTML5 to HTML5 is lossless if the XHTML5 document was  
a result of an HTML5 to XHTML5 conversion but it isn't lossless for  
arbitrary XHTML5 documents.)

> The problem I'm worried about is the lack of interoperability between
> HTML5 and XML processors. (It has nothing to do with browsers.) Other
> specs such as OWL 2 and XQuery have addressed this problem by
> providing XML syntax as an alternative. But this only achieves the
> intended effect if semantics-preserving round trips work.

The Validator.nu HTML Parser works as a drop-in replacement for an XML  
parser in apps that have been programmed to consume XHTML using the  
DOM, SAX or XOM APIs. That is, the Validator.nu HTML Parser appears to  
the application as if it were an XML parser parsing XHTML5.

> For comparison, 'tidy' provides conversion from HTML4 to XHTML (I
> think), and the resulting XHTML is in a subset (I think) of HTML4, so
> the round trip property holds.

The Validator.nu HTML Parser comes with a sample application called  
HTML2XML. When the input is a conforming HTML5 document, the output is  
the semantically equivalent XHTML5 document. HTML2XML doesn't repair  
non-conforming documents.

You can obtain the Java version from http://about.validator.nu/htmlparser/

Sam Ruby is working on a version that doesn't require the JVM  
invocation overhead
http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C

If your pipeline is in Java, you don't need HTML2XML but you should  
just use the Validator.nu HTML Parser directly, which optimizes away  
the steps of serializing as XML and reparsing it.

> I assume this approach doesn't work for
> HTML5, which is why I do not necessarily have XHTML in mind as the
> representation.

In my opinion, it would be bad if XHTML5 weren't the XML  
representation for HTML5 you can use in this case.

Our draft Design Principles contain the DOM Consistency design  
principle that is intended to keep the design of HTML5 such that  
XHTML5 is that representation. ("DOM" is rather browser-oriented. It  
helps to read it as "Infoset Consistency".)
http://www.w3.org/TR/html-design-principles/#dom-consistency

-- 
Henri Sivonen
hsivonen@iki.fi
http://hsivonen.iki.fi/

Received on Thursday, 18 June 2009 07:47:30 UTC