- From: Henri Sivonen <hsivonen@iki.fi>
- Date: Sun, 19 Feb 2006 22:38:43 +0200
On Feb 16, 2006, at 00:56, Dan Brickley wrote: > Discussing some related work (GRDDL) in the W3C SemWeb CG, I was > wondering whether there is any way your parser spec could be > specified as input for a GRDDL transform. GRDDL provides techniques > for > transforming XML-based languages (including XHTML) into an RDF > representation; typically by reference to an XSLT. If the WHATWG > parser spec defined itself in terms of some XML-shaped output, the two > should chain nicely together. Have you considered defining the parser > behaviour in terms of XML concepts? HTML5 parsing for browsers that support scripting needs to be defined in such a way that a legacy-compatible HTML DOM is produced. However, there are apps other than browsers (eg. CMSs, conformance checkers and search engines) that will, in my opinion, be better off if they don't run their code against the HTML DOM but instead convert HTML documents into equivalent XHTML documents as early as possible and then work with XHTML internally. I guess whatever apps use GRDDL or XSLT are likely to be in the class of apps that are better off working with XHTML internally. (In the conversion from HTML to XHTML, the XHTML serialization can be optimized away and does not have to exist in memory at any stage. With HTML 4.01 and Java, TagSoup would be appropriate for the job.) To this end, I think it would be beneficial if for every conforming HTML5 document there was an unambiguous equivalent representation in canonicalized (per XML C14N) XHTML. I have not reviewed the spec lately to see if this is already the case, but I expect it to be. (Obviously, this cannot be the case for non-conforming documents since the output DOM of the parsing algorithm can have eg. attribute names that are forbidden in XML 1.0.) Off the top of my head, the changes from the HTML parsing output involve (besides lowercasing names and putting elements in the XHTML 1.x namespace) getting rid of the meta element conveying character encoding information, mapping the lang attribute to xml:lang, copying the name of boolean attributes into the value and perhaps some issues with line breaks in attribute values. Whether the spec needs to say any of this is another matter altogether. For interop, speccing what browsers need to do is the most important task. -- Henri Sivonen hsivonen at iki.fi http://hsivonen.iki.fi/
Received on Sunday, 19 February 2006 12:38:43 UTC