Re: Adding a datatype for HTML literals to RDF (ISSUE-63) from Ivan Herman on 2012-05-03 (public-rdf-wg@w3.org from May 2012)

From: Ivan Herman <ivan@w3.org>
Date: Thu, 3 May 2012 10:57:35 +0200
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <E0D1CACC-4D7E-430C-89C8-A75D45DA731D@w3.org>
On May 3, 2012, at 10:51 , Richard Cyganiak wrote:

> On 3 May 2012, at 09:39, Ivan Herman wrote:
>>> 8.2.7 Coercing an HTML DOM into an infoset
>>> http://www.w3.org/TR/html5/the-end.html#coercing-an-html-dom-into-an-infoset
>> 
>> I was looking at this section yesterday afternoon, and I am not convinced, unfortunately, that this is the right tool. At least it is not clear.
>> 
>> What the section says is:
>> 
>> "When an application uses an HTML parser in conjunction with an XML pipeline, it is possible that the constructed DOM is not compatible with the XML tool chain in certain subtle ways. "
> 
> And then it says: “This section specifies some rules for handling these issues.”
> 
> The section says how to turn the DOM produced by an HTML5 parser into an XML-compatible DOM. And I think we can take for granted that there's a 1:1 correspondence between XML DOM trees and XML infosets. (Well, we should check that anyways, but that should Just Work, no?)
> 

I would expect/hope so:-)

I will ask Liam Quin. If there is anybody who knows, he should be the one:-)

Ivan


> Best,
> Richard
> 
> 
>> 
>> and it then goes on, in the section, to describe the possible differences. Formally, it is _not_ a definition of the HTML5 infoset:-(
>> 
>> Ivan
>> 
>> 
>>> 
>>> So, no need to make the DOM -> XHTML5 -> XML Infoset detour.
>>> 
>>> Regarding innerHTML implementations, see below.
>>> 
>>> On 3 May 2012, at 08:36, Ivan Herman wrote:
>>>> Note that if we follow the official HTML5 algorithm in defining defined a value space, then what would happen is to issue an HTML5 Literal that is different in lexical space but is identical in value space. Which is, sort of, all right.
>>> 
>>> Yeah, it's actually the same situation as with our efforts to relax the lexical space of rdf:XMLLiteral.
>>> 
>>>>> (Or is this nonsense and the parser could always just do myDOMElement.innerHTML to get the original HTML?)
>>>> 
>>>> I am not sure whether this is available in all HTML5 parsers. I do not see it in the python HTML5Lib Parser that I use, for example (but I may have missed it).
>>> 
>>> Looking through the various HTML specs, it looks like:
>>> 
>>> • HTML5 defines parsing into an HTML DOM
>>> • HTML DOM implementations MUST support innerHTML and outerHTML
>>> • Both of these are defined in a separate spec [1]
>>> • That spec invokes the serialization algorithm defined in the HTML spec
>>> 
>>> So the result is that innerHTML and outerHTML should always be available, but will *not* produce the original HTML string, so it doesn't make much of a difference. Still, this should be good enough.
>>> 
>>> (There's a funny mess here where the W3C version of HTML5 normatively links to a WHATWG document [1] which then normatively references back into the WHATWG version of HTML5.)
>>> 
>>> Best,
>>> Richard
>>> 
>>> 
>>> [1] http://html5.org/specs/dom-parsing.html
>>> 
>>>> 
>>>> 
>>>>> Anyways, the advantage of having a value space that is isomorphic to the DOM is that you can parse and re-serialize the HTML and still get the same value.
>>>>> 
>>>> 
>>>> Yes, see above.
>>>> 
>>>> 
>>>>>> (Not all RDF systems have access to info set support code now that we are standardising Turtle and N-triples.)
>>>>> 
>>>>> Yeah and that's why we're trying to change rdf:XMLLiteral to make it optional and to relax its lexical space.
>>>>> 
>>>>> I imagine that rdf:HTMLLiteral would be optional too, and the lexical space should certainly be as unrestrictive as possible.
>>>>> 
>>>>> Only those who want to compare HTML literals, or those who *need* to parse and re-serialize HTML literals, need to care what the value space is. (And yeah, if we can't come up with evidence that some systems need to do one of those, then there's little point in defining anything more complicated than a 1:1 L2V mapping.)
>>>>> 
>>>>> Best,
>>>>> Richard
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> 	Andy
>>>>>> 
>>>>>>> 
>>>>>>> Ivan
>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Richard
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> And I guess in theory, DOMs and XML Infosets should be isomorphic, no?
>>>>>>>>> 
>>>>>>>>> In theory:-) To be checked. There may be corner cases.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Between all these transformations, there should be something that works for us. The devil is in the details of course.
>>>>>>>>> 
>>>>>>>>> Exactly...
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Or we could just avoid all of that trouble and simply define the value space of the HTML datatype as identical to the lexical space.
>>>>>>>>> 
>>>>>>>>> And then we are back to the same issue as we had with XML Literals. Except that... there is no such thing as a formal canonical HTML5
>>>>>>>>> 
>>>>>>>>> Ivan
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best,
>>>>>>>>>> Richard
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Just some food for thoughts...
>>>>>>>>>>> 
>>>>>>>>>>> Ivan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 1, 2012, at 18:41 , Gavin Carothers wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 1, 2012 at 6:46 AM, Richard Cyganiak<richard@cyganiak.de>  wrote:
>>>>>>>>>>>>> All,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The 2004 WG worked under the assumption that the future of HTML was XHTML, and that the use case of shipping HTML markup fragments as RDF payloads would be addressed by rdf:XMLLiteral. But in 2012, shipping HTML fragments really means HTML5. Is rdf:XMLLiteral still adequate for this task? Is a new datatype with a lexical space consisting of HTML5 fragments needed? This question is ISSUE-63.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it would be useful to have a straw poll sometime soon on this question:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PROPOSAL: RDF-WG will work on an HTML datatype that would be defined in RDF Concepts.
>>>>>>>>>>>> 
>>>>>>>>>>>> +1, and for internationalization should be a required datatype, might
>>>>>>>>>>>> also have a simple syntax in Turtle (though would likely require a new
>>>>>>>>>>>> last call but a Web formating that doesn't understand HTML doesn't
>>>>>>>>>>>> seem like much of a web format)
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If there is general support for this, then we could start work on the details of the datatype definition (lexical space, value space, L2V mapping and so on).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All the best,
>>>>>>>>>>>>> Richard
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ----
>>>>>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>>>>>> mobile: +31-641044153
>>>>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ----
>>>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>>>> mobile: +31-641044153
>>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ----
>>>>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>> Home: http://www.w3.org/People/Ivan/
>>>>>>> mobile: +31-641044153
>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>> 
>>>> 
>>>> ----
>>>> Ivan Herman, W3C Semantic Web Activity Lead
>>>> Home: http://www.w3.org/People/Ivan/
>>>> mobile: +31-641044153
>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>> 
>> 
>> ----
>> Ivan Herman, W3C Semantic Web Activity Lead
>> Home: http://www.w3.org/People/Ivan/
>> mobile: +31-641044153
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>> 
>> 
>> 
>> 
>> 
>> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Thursday, 3 May 2012 08:55:01 UTC