Re: Adding a datatype for HTML literals to RDF (ISSUE-63) from Ivan Herman on 2012-05-09 (public-rdf-wg@w3.org from May 2012)

From: Ivan Herman <ivan@w3.org>
Date: Wed, 9 May 2012 12:54:12 +0200
To: Steve Harris <steve.harris@garlik.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, Richard Cyganiak <richard@cyganiak.de>, public-rdf-wg@w3.org
Message-Id: <4A12F978-FE18-403C-82F2-D5C32A8CF716@w3.org>
As Richard emphasized in his mail, XML Literal and, if approved, HTML5 Literals are optional. If implementation do not want to implement equality checking on these literals, that is fine. However, if they _do_ want to do that, than we should define what equality means. That is where the value space issue comes into the picture.

I think that real issue we have to solve, however, is to keep the lexical space as unconstrained as possible. The current XML Literal definition seemed to be very ambiguous in this respect and it was never 100% clear whether an RDF file in, say, Turtle, should include a canonical XML for the literal or not. This led to long discussions among, eg, RDFa implementers at a time on what _exactly_ should be generated by an RDFa processor. If we say that the lexical space is very lax, and we have a clear definition of equality (whether we define equality via Infosets or DOM tree equality is a detail in this respect) then the situation becomes clear and, because these datatypes are not required, there is no issue with conformance either.

Ivan



On May 8, 2012, at 19:17 , Steve Harris wrote:

> +1, my guess is that it would mean there are not very conforming implementations, and an HTML datatype is useful without equality, for store and display, e.g. in a CMS.
> 
> - Steve
> 
> On 3 May 2012, at 02:27, Andy Seaborne wrote:
> 
>> 
>> 
>> On 03/05/12 09:19, Richard Cyganiak wrote:
>>> Hi Andy,
>>> 
>>> It sounds like you'd rather prefer an HTML datatype with a simple 1:1
>>> correspondence between lexical space and value space.
>> 
>> I think that's a viable approach, yes.
>> 
>>> Your objection seems to be that something more complex isn't really
>>> needed. Which might be true, but do you think that something more
>>> complex would actually do any harm, and would be worse?
>> 
>> I'm not objecting.
>> 
>> I'm simply putting forward a case because I felt that the conversation was heading to infoset-value without much consideration of usage.
>> 
>> The primary UC is passing around display fragments.  Better dc:title.
>> 
>> One (implementation) argument is that some systems only have DOM access.
>> Another is that other systems don't have an HTML5 parser at all.
>> 
>> Given experiences of rdf:XMLLiterals, not just the fact they are hard-wired into RDF, it is not obvious, to me at least, that a complex scheme is a good idea.
>> 
>>> And is this preference for a simpler scheme from an implementer's
>>> point of view, or is it from a WG resources/spec complexity point of
>>> view, or something else?
>> 
>> Yes (implementation generally).
>> 
>> If people in the WG want to spend time on infoset-value, that's fine.
>> 
>> 	Andy
>> 
>>> 
>>> Thanks, Richard
>>> 
>>> 
>>> On 2 May 2012, at 21:47, Andy Seaborne wrote:
>>>> On 02/05/12 20:29, Richard Cyganiak wrote:
>>>>> On 2 May 2012, at 19:15, Andy Seaborne wrote:
>>>>>> I think I'm saying, start simple, prove a need for more
>>>>>> complicated.
>>>>>> 
>>>>>> We can define a value space that is all character sequences
>>>>>> (and is disjoint from xsd:string).  Do we need to be more
>>>>>> complicated? What's the use case?
>>>>> 
>>>>> One use case might be RDFa parsers with HTML literal support.
>>>>> 
>>>>> Let's say you have @datatype="rdf:HTMLLiteral" on some element,
>>>>> and the element contains text with markup, and the desire is that
>>>>> the resulting HTML literal contains the text with markup intact.
>>>>> 
>>>>> Now the RDFa parser may not have access to the actual HTML
>>>>> string, but only to a representation that has already been parsed
>>>>> into a DOM tree.
>>>>> 
>>>>> So the parser may have to serialize the DOM into a string, which
>>>>> would probably be different from the original string.
>>>> 
>>>> Certainly something to consider.
>>>> 
>>>> Thought: if the original string isn't available, does it matter?
>>>> Will it be available to anyone else?
>>>> 
>>>>> 
>>>>> (Or is this nonsense and the parser could always just do
>>>>> myDOMElement.innerHTML to get the original HTML?)
>>>> 
>>>> I'm insufficiently up with the tool space to know.  (gavin?)
>>>> 
>>>>> 
>>>>> Anyways, the advantage of having a value space that is isomorphic
>>>>> to the DOM is that you can parse and re-serialize the HTML and
>>>>> still get the same value.
>>>>> 
>>>>>> (Not all RDF systems have access to info set support code now
>>>>>> that we are standardising Turtle and N-triples.)
>>>>> 
>>>>> Yeah and that's why we're trying to change rdf:XMLLiteral to make
>>>>> it optional and to relax its lexical space.
>>>>> 
>>>>> I imagine that rdf:HTMLLiteral would be optional too, and the
>>>>> lexical space should certainly be as unrestrictive as possible.
>>>>> 
>>>>> Only those who want to compare HTML literals, or those who *need*
>>>>> to parse and re-serialize HTML literals, need to care what the
>>>>> value space is. (And yeah, if we can't come up with evidence that
>>>>> some systems need to do one of those, then there's little point
>>>>> in defining anything more complicated than a 1:1 L2V mapping.)
>>>> 
>>>> Comparison may be done in another system - these literals are
>>>> published and ingested by another system that might be asked if two
>>>> literals are the same.  e.g. a reasoner or a SPARQL engine.
>>>> Whether the ability to value-equals two literals with different
>>>> lexical forms is sufficiently important, I can't say.
>>>> 
>>>> I feel that this isn't that likely - HTML5 literals are display
>>>> material to be passed about.  For that,  equality processing is
>>>> unlikely, and the fragments go in and come out on on some generated
>>>> HTML.
>>>> 
>>>> Andy
>>>> 
>>>> 
>>>>> 
>>>>> Best, Richard
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Andy
>>>>>> 
>>>>>>> 
>>>>>>> Ivan
>>>>>>> 
>>>>>>>> Best, Richard
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> And I guess in theory, DOMs and XML Infosets should be
>>>>>>>>>> isomorphic, no?
>>>>>>>>> 
>>>>>>>>> In theory:-) To be checked. There may be corner cases.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Between all these transformations, there should be
>>>>>>>>>> something that works for us. The devil is in the
>>>>>>>>>> details of course.
>>>>>>>>> 
>>>>>>>>> Exactly...
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Or we could just avoid all of that trouble and simply
>>>>>>>>>> define the value space of the HTML datatype as
>>>>>>>>>> identical to the lexical space.
>>>>>>>>> 
>>>>>>>>> And then we are back to the same issue as we had with
>>>>>>>>> XML Literals. Except that... there is no such thing as a
>>>>>>>>> formal canonical HTML5
>>>>>>>>> 
>>>>>>>>> Ivan
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best, Richard
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Just some food for thoughts...
>>>>>>>>>>> 
>>>>>>>>>>> Ivan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 1, 2012, at 18:41 , Gavin Carothers wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 1, 2012 at 6:46 AM, Richard
>>>>>>>>>>>> Cyganiak<richard@cyganiak.de>    wrote:
>>>>>>>>>>>>> All,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The 2004 WG worked under the assumption that the
>>>>>>>>>>>>> future of HTML was XHTML, and that the use case
>>>>>>>>>>>>> of shipping HTML markup fragments as RDF payloads
>>>>>>>>>>>>> would be addressed by rdf:XMLLiteral. But in
>>>>>>>>>>>>> 2012, shipping HTML fragments really means HTML5.
>>>>>>>>>>>>> Is rdf:XMLLiteral still adequate for this task?
>>>>>>>>>>>>> Is a new datatype with a lexical space consisting
>>>>>>>>>>>>> of HTML5 fragments needed? This question is
>>>>>>>>>>>>> ISSUE-63.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it would be useful to have a straw poll
>>>>>>>>>>>>> sometime soon on this question:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PROPOSAL: RDF-WG will work on an HTML datatype
>>>>>>>>>>>>> that would be defined in RDF Concepts.
>>>>>>>>>>>> 
>>>>>>>>>>>> +1, and for internationalization should be a
>>>>>>>>>>>> required datatype, might also have a simple syntax
>>>>>>>>>>>> in Turtle (though would likely require a new last
>>>>>>>>>>>> call but a Web formating that doesn't understand
>>>>>>>>>>>> HTML doesn't seem like much of a web format)
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If there is general support for this, then we
>>>>>>>>>>>>> could start work on the details of the datatype
>>>>>>>>>>>>> definition (lexical space, value space, L2V
>>>>>>>>>>>>> mapping and so on).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All the best, Richard
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>>>>> Home: http://www.w3.org/People/Ivan/ mobile:
>>>>>>>>>>> +31-641044153 FOAF:
>>>>>>>>>>> http://www.ivan-herman.net/foaf.rdf
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead Home:
>>>>>>>>> http://www.w3.org/People/Ivan/ mobile: +31-641044153
>>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead Home:
>>>>>>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 FOAF:
>>>>>>> http://www.ivan-herman.net/foaf.rdf
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> -- 
> Steve Harris, CTO
> Garlik, a part of Experian 
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 653331 VAT # 887 1335 93
> Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
> 
> 


----
Ivan Herman, W3C Semantic Web Activity Lead
Home: http://www.w3.org/People/Ivan/
mobile: +31-641044153
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Wednesday, 9 May 2012 10:51:23 UTC