Re: Comments on HTML WG face to face meetings in France Oct 08 from Anne van Kesteren on 2009-01-24 (www-tag@w3.org from January 2009)

From: Anne van Kesteren <annevk@opera.com>
Date: Sat, 24 Jan 2009 12:07:23 +0100
To: "David Orchard" <orchard@pacificspirit.com>, "Henri Sivonen" <hsivonen@iki.fi>
Cc: www-tag@w3.org
Message-ID: <op.un9c6l2b64w2qv@annevk-t60.oslo.opera.com>

On Sat, 24 Jan 2009 05:17:32 +0100, David Orchard  
<orchard@pacificspirit.com> wrote:
> That would be very interesting if we could actually create an XML5  
> parser,

I've done it (quite some time ago):

   http://code.google.com/p/xml5/

> and I'm in highly in favour of such a thing IFF it was used to allow XML  
> in HTML5.

Parsing XML 1.0 documents to the correct infoset as well as parsing HTML  
to the infoset required by Web pages is impossible in the same parser.

I suppose I should present proof for this though. Since I cannot think of  
a good way to put it, lets go through some examples.

   Stream:
   <table><input>

   Tree:
   html
    head
    body
     input
     table

   Stream:
   <table><input type="hidden">

   Tree:
   html
    head
    body
     table
      input type="hidden"

   (<input type="hidden"> is a special case)

   Stream:
   <div><x></div><p>

   Tree:
   html
    head
    body
     div
      x
     p

   Stream:
   <div><button></div><p>

   Tree:
   html
    head
    body
     div
      button
       p

   (<button> is scoping)

   Stream:
   </br>

   Tree:
   html
    head
    body
     br

   Stream:
   <image/>

   Tree:
   html
    head
    body
     img

   Stream:
   x</p>x

   Tree:
   html
    head
    body
     "x"
     p
     "x"

Hope that helps. HTML is a crazy format.

You can try this out for yourself here:

   http://livedom.validator.nu/
   http://james.html5.org/parsetree.html

(Two independent implementations of the HTML5 parsing algorithm by the  
way. The first uses Java and the second Python.)

> Absent such a thing, somebody would be forced to use an HTML5
> browser and then an API to extract the XML 1.0 infoset.  It's slightly  
> more palatable with the HTML5 language spec being separate from all the  
> rest of the browser functions, but not as ideal as XML5.

Organization of the specification has nothing to do with this. Since HTML  
syntax and language are intertwined you will never get the XML 1.0 infoset  
that the document actually represents. (It is also not clear to me why you  
would need an HTML5 browser, just an HTML5 parser should suffice.)

-- 
Anne van Kesteren
<http://annevankesteren.nl/>
<http://www.opera.com/>

Received on Saturday, 24 January 2009 11:08:25 UTC