- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Thu, 18 Jun 2009 10:46:48 +0300
- To: Jonathan Rees <jar@creativecommons.org>
- Cc: Anne van Kesteren <annevk@opera.com>, Dan Connolly <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
On Jun 17, 2009, at 14:47, Jonathan Rees wrote: > I don't see how your answer or the linked documents bear on my > question, so let me amplify. Anne's answer seems entirely relevant to me. > The ideal situation: you can take any HTML5 document, convert it to > some XML-based language designed for the purpose (not necessarily > XHTML), convert it back, and get a semantically equivalent HTML5 > document. The only HTML5 to XML conversion we have defined is conversion to XHTML5, which is not a 100% reversible conversion for some edge cases. The edge cases are all arbitrary restrictions that XML places on what characters may appear where. For example, the conversion of an HTML5 document that has a form feed somewhere in element content is lossy, because XML doesn't allow form feed in element content. Likewise, the conversion is lossy when the source document has local names that are not NCNames. Also, the conversion is lossy for documents that have Unicode non-characters (e.g. U+FFFF) in element content. However, the for *conforming* HTML5 documents, the only lossiness is form feed and the loss of semantically void talisman attributes (attributes in no namespace that have "xml:lang" or "xmlns" as the local name). Note that "xml:lang" in no namespace means nothing in text/html and conformance requires it to be accompanied with "lang" in no namespace with the same value and that does carry meaning. To the extent the semantics of a form feed in text/html are the same as the semantics of a space and the semantics of non-characters are the same as the semantics of of U+FFFD, for conforming documents, semantics are round-tripped. So I think it's fair to say that for conforming HTML5 documents, HTML5- >XHTML5->HTML5 round trips semantics. (Note, however, that the conversion from XHTML5 to HTML5 is lossless if the XHTML5 document was a result of an HTML5 to XHTML5 conversion but it isn't lossless for arbitrary XHTML5 documents.) > The problem I'm worried about is the lack of interoperability between > HTML5 and XML processors. (It has nothing to do with browsers.) Other > specs such as OWL 2 and XQuery have addressed this problem by > providing XML syntax as an alternative. But this only achieves the > intended effect if semantics-preserving round trips work. The Validator.nu HTML Parser works as a drop-in replacement for an XML parser in apps that have been programmed to consume XHTML using the DOM, SAX or XOM APIs. That is, the Validator.nu HTML Parser appears to the application as if it were an XML parser parsing XHTML5. > For comparison, 'tidy' provides conversion from HTML4 to XHTML (I > think), and the resulting XHTML is in a subset (I think) of HTML4, so > the round trip property holds. The Validator.nu HTML Parser comes with a sample application called HTML2XML. When the input is a conforming HTML5 document, the output is the semantically equivalent XHTML5 document. HTML2XML doesn't repair non-conforming documents. You can obtain the Java version from http://about.validator.nu/htmlparser/ Sam Ruby is working on a version that doesn't require the JVM invocation overhead http://intertwingly.net/blog/2009/06/15/Invoking-HtmlParser-from-C If your pipeline is in Java, you don't need HTML2XML but you should just use the Validator.nu HTML Parser directly, which optimizes away the steps of serializing as XML and reparsing it. > I assume this approach doesn't work for > HTML5, which is why I do not necessarily have XHTML in mind as the > representation. In my opinion, it would be bad if XHTML5 weren't the XML representation for HTML5 you can use in this case. Our draft Design Principles contain the DOM Consistency design principle that is intended to keep the design of HTML5 such that XHTML5 is that representation. ("DOM" is rather browser-oriented. It helps to read it as "Infoset Consistency".) http://www.w3.org/TR/html-design-principles/#dom-consistency -- Henri Sivonen hsivonen@iki.fi http://hsivonen.iki.fi/
Received on Thursday, 18 June 2009 07:47:30 UTC