[whatwg] HTML5 Parsing spec first draft ready from Henri Sivonen on 2006-02-19 (public-whatwg-archive@w3.org from February 2006)

From: Henri Sivonen <hsivonen@iki.fi>
Date: Sun, 19 Feb 2006 22:38:43 +0200
Message-ID: <A24BD2AA-5C4A-4374-9347-23C3D6774741@iki.fi>

On Feb 16, 2006, at 00:56, Dan Brickley wrote:

> Discussing some related work (GRDDL) in the W3C SemWeb CG, I was
> wondering whether there is any way your parser spec could be
> specified as input for a GRDDL transform. GRDDL provides techniques  
> for
> transforming XML-based languages (including XHTML) into an RDF
> representation; typically by reference to an XSLT. If the WHATWG
> parser spec defined itself in terms of some XML-shaped output, the two
> should chain nicely together. Have you considered defining the parser
> behaviour in terms of XML concepts?

HTML5 parsing for browsers that support scripting needs to be defined  
in such a way that a legacy-compatible HTML DOM is produced. However,  
there are apps other than browsers (eg. CMSs, conformance checkers  
and search engines) that will, in my opinion, be better off if they  
don't run their code against the HTML DOM but instead convert HTML  
documents into equivalent XHTML documents as early as possible and  
then work with XHTML internally. I guess whatever apps use GRDDL or  
XSLT are likely to be in the class of apps that are better off  
working with XHTML internally.

(In the conversion from HTML to XHTML, the XHTML serialization can be  
optimized away and does not have to exist in memory at any stage.  
With HTML 4.01 and Java, TagSoup would be appropriate for the job.)

To this end, I think it would be beneficial if for every conforming  
HTML5 document there was an unambiguous equivalent representation in  
canonicalized (per XML C14N) XHTML. I have not reviewed the spec  
lately to see if this is already the case, but I expect it to be.  
(Obviously, this cannot be the case for non-conforming documents  
since the output DOM of the parsing algorithm can have eg. attribute  
names that are forbidden in XML 1.0.)

Off the top of my head, the changes from the HTML parsing output  
involve (besides lowercasing names and putting elements in the XHTML  
1.x namespace) getting rid of the meta element conveying character  
encoding information, mapping the lang attribute to xml:lang, copying  
the name of boolean attributes into the value and perhaps some issues  
with line breaks in attribute values.

Whether the spec needs to say any of this is another matter  
altogether. For interop, speccing what browsers need to do is the  
most important task.

-- 
Henri Sivonen
hsivonen at iki.fi
http://hsivonen.iki.fi/

Received on Sunday, 19 February 2006 12:38:43 UTC