- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Wed, 16 Apr 2008 13:16:16 +0300
On Apr 16, 2008, at 12:58, Paul Libbrecht wrote: >> In fact, the reason why the proportion of Web pages that get parsed >> as XML is negligible is that the XML approach totally failed to >> plug into the existing text/html network effects[...] > > My hypothesis here is that this problem is mostly a parsing problem > and not a model problem. HTML5 mixes the two. For backwards compatibility in scripted browser environments, the HTML DOM can't behave exactly like the XHTML5 DOM. For non-scripted non- browser environments, using an XML data model (XML DOM, XOM, JDOM, dom4j, SAX, ElementTree, lxml, etc., etc.) works fine. > There are tools that convert quite a lot of text/html pages (whose > compliance is user-defined to be "it works in my browser") to an XML > stream today NeckoHTML is one of them. The goal would be to > formalize this parsing, and just this parsing. Like NekoHTML and TagSoup, the Validator.nu HTML parser turns text/ html input into Java XML models. The difference is that the Validator.nu HTML parser implements the HTML5 algorithm instead of something the authors of NekoHTML and TagSoup figured out on their own. So if you are asking for a NekoHTML-like product for HTML5, it already exists and supports three popular Java XML APIs (SAX, DOM and XOM). Not XNI, though, at the moment. (It doesn't support the recent MathML addition, *yet*, though.) http://about.validator.nu/htmlparser/ >>> Currently HTML5 defines at the same time parsing and the model and >>> this is what can cause us to expect that XML is getting weaker. I >>> believe that the whole model-definition work of XML is rich, has >>> many libraries, has empowered a lot of great developments and it >>> is a bad idea to drop it instead of enriching it. >> >> The dominant design of non-browser HTML5 parsing libraries is >> exposing the document tree using an XML parser API. The non-browser >> HTML5 libraries, therefore, plug into the network of XML libraries. >> For example, Validator.nu's internals operate on SAX events that >> look like SAX events for an XHTML5 document. This allows >> Validator.nu to use libraries written for XML, such as oNVDL and >> Saxon. > > So, except for needing yet another XHTML version to accomodate all > wishes, I think it would be much saner that browsers' > implementations and related specifications rely on an XML-based > model of HTML (as the DOM is) instead of a coupled parsing-and- > modelling specification which has different interpretations at > different places. HTML5 already specifies parsing in terms of DOM output. However, when the DOM is in the HTML mode, it has to be slightly different. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Wednesday, 16 April 2008 03:16:16 UTC