Re: question about XML and HTML5 from Anne van Kesteren on 2009-06-17 (www-archive@w3.org from June 2009)

From: Anne van Kesteren <annevk@opera.com>
Date: Wed, 17 Jun 2009 15:05:29 +0200
To: "Jonathan Rees" <jar@creativecommons.org>
Cc: "Dan Connolly" <connolly@w3.org>, "Henry S. Thompson" <ht@inf.ed.ac.uk>, www-archive@w3.org
Message-ID: <op.uvn6nfc564w2qv@annevk-t60>

On Wed, 17 Jun 2009 14:54:54 +0200, Jonathan Rees <jar@creativecommons.org> wrote:
> On Wed, Jun 17, 2009 at 7:51 AM, Anne van Kesteren<annevk@opera.com>  
> wrote:
>> On Wed, 17 Jun 2009 13:47:12 +0200, Jonathan Rees  
>> <jar@creativecommons.org> wrote:
>>> I don't see how your answer or the linked documents bear on my
>>> question, so let me amplify.
>>>
>>> The ideal situation:  you can take any HTML5 document, convert it to
>>> some XML-based language designed for the purpose (not necessarily
>>> XHTML), convert it back, and get a semantically equivalent HTML5
>>> document.
>>
>> The parser of the HTML syntax is Turing-complete so that will not work.  
>> (You can inject characters into the tokenizer.)
>
> COBOL is also Turing-complete, so I guess I could use that.

That does not give you XML though :-) On IRC it was suggested you could wrap the HTML5 document inside a big CDATA wrapper which would theoretically do what you want, but would probably not be very useful.

If you ignore document.write() doing what is suggested in the links I provided earlier (especially how to map an HTML byte stream to an XML DOM) will get you quite close. I suppose you could also ignore script execution altogether and together with creating an infoset out of an HTML byte stream you might be able to get pretty far too, but I haven't thought about that in detail.

>> If 'tidy' is good enough and you consider it working I do not see why  
>> that would not work for HTML5.
>
> Because HTML5 is so different from HTML4, I have no reason to think it
> would work. I'm not even sure tidy works for HTML4. And it is not as
> well specified as OWL/XML or XQuery/XML far as I know.

I thought 'tidy' dealt with "tag soup" input and tried to make something out of it. In that respect I would not classify HTML5 as "so different" from HTML4 :-)

I do agree that 'tidy' is not well specified, but HTML5 is and has a way to get to XML and back. (And this is implemented as well.)

> The spirit of my question was not combative, but rather a request to
> some people I trust to supply me with reliable information. I think
> they understand the background of my question and will probably
> understand where I am going with this.
>
> The www-archive list is described as follows: "Miscellaneous.
> Mail-to-web gateway."  I was using it in the latter capacity, as I
> have seen others do. Sorry if my message was construed otherwise. If
> you are interested in pursuing this I think the discussion should be
> moved elsewhere.

I was just interested in trying to help you out. Due to lack of context I probably misunderstood what you wanted (or maybe not :-).

-- 
Anne van Kesteren
http://annevankesteren.nl/

Received on Wednesday, 17 June 2009 13:06:12 UTC