Re: Adding a datatype for HTML literals to RDF (ISSUE-63) from Richard Cyganiak on 2012-05-08 (public-rdf-wg@w3.org from May 2012)

From: Richard Cyganiak <richard@cyganiak.de>
Date: Tue, 8 May 2012 23:12:54 +0100
To: Steve Harris <steve.harris@garlik.com>
Cc: Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <C4617D70-205C-4197-9F43-0B7B16D69CE1@cyganiak.de>
On 8 May 2012, at 18:17, Steve Harris wrote:
> +1, my guess is that it would mean there are not very conforming implementations, and an HTML datatype is useful without equality, for store and display, e.g. in a CMS.

There is no proposal on the table to make this a required datatype.

As long as the datatype is optional, implementations may decide not to implement it. Such implementations would not be able to compare HTML literals for equality, but would still be able to query, store and display them.

The main benefit of defining a complex value space is that it would help HTML5+RDFa parsers that only have access to the DOM by giving them license to re-serialize the DOM to a different HTML snippet.

The “simpler” proposal (requiring the lexical form to remain exactly as-is throughout all processing) is actually stricter.

Best,
Richard




> 
> - Steve
> 
> On 3 May 2012, at 02:27, Andy Seaborne wrote:
> 
>> 
>> 
>> On 03/05/12 09:19, Richard Cyganiak wrote:
>>> Hi Andy,
>>> 
>>> It sounds like you'd rather prefer an HTML datatype with a simple 1:1
>>> correspondence between lexical space and value space.
>> 
>> I think that's a viable approach, yes.
>> 
>>> Your objection seems to be that something more complex isn't really
>>> needed. Which might be true, but do you think that something more
>>> complex would actually do any harm, and would be worse?
>> 
>> I'm not objecting.
>> 
>> I'm simply putting forward a case because I felt that the conversation was heading to infoset-value without much consideration of usage.
>> 
>> The primary UC is passing around display fragments.  Better dc:title.
>> 
>> One (implementation) argument is that some systems only have DOM access.
>> Another is that other systems don't have an HTML5 parser at all.
>> 
>> Given experiences of rdf:XMLLiterals, not just the fact they are hard-wired into RDF, it is not obvious, to me at least, that a complex scheme is a good idea.
>> 
>>> And is this preference for a simpler scheme from an implementer's
>>> point of view, or is it from a WG resources/spec complexity point of
>>> view, or something else?
>> 
>> Yes (implementation generally).
>> 
>> If people in the WG want to spend time on infoset-value, that's fine.
>> 
>> 	Andy
>> 
>>> 
>>> Thanks, Richard
>>> 
>>> 
>>> On 2 May 2012, at 21:47, Andy Seaborne wrote:
>>>> On 02/05/12 20:29, Richard Cyganiak wrote:
>>>>> On 2 May 2012, at 19:15, Andy Seaborne wrote:
>>>>>> I think I'm saying, start simple, prove a need for more
>>>>>> complicated.
>>>>>> 
>>>>>> We can define a value space that is all character sequences
>>>>>> (and is disjoint from xsd:string).  Do we need to be more
>>>>>> complicated? What's the use case?
>>>>> 
>>>>> One use case might be RDFa parsers with HTML literal support.
>>>>> 
>>>>> Let's say you have @datatype="rdf:HTMLLiteral" on some element,
>>>>> and the element contains text with markup, and the desire is that
>>>>> the resulting HTML literal contains the text with markup intact.
>>>>> 
>>>>> Now the RDFa parser may not have access to the actual HTML
>>>>> string, but only to a representation that has already been parsed
>>>>> into a DOM tree.
>>>>> 
>>>>> So the parser may have to serialize the DOM into a string, which
>>>>> would probably be different from the original string.
>>>> 
>>>> Certainly something to consider.
>>>> 
>>>> Thought: if the original string isn't available, does it matter?
>>>> Will it be available to anyone else?
>>>> 
>>>>> 
>>>>> (Or is this nonsense and the parser could always just do
>>>>> myDOMElement.innerHTML to get the original HTML?)
>>>> 
>>>> I'm insufficiently up with the tool space to know.  (gavin?)
>>>> 
>>>>> 
>>>>> Anyways, the advantage of having a value space that is isomorphic
>>>>> to the DOM is that you can parse and re-serialize the HTML and
>>>>> still get the same value.
>>>>> 
>>>>>> (Not all RDF systems have access to info set support code now
>>>>>> that we are standardising Turtle and N-triples.)
>>>>> 
>>>>> Yeah and that's why we're trying to change rdf:XMLLiteral to make
>>>>> it optional and to relax its lexical space.
>>>>> 
>>>>> I imagine that rdf:HTMLLiteral would be optional too, and the
>>>>> lexical space should certainly be as unrestrictive as possible.
>>>>> 
>>>>> Only those who want to compare HTML literals, or those who *need*
>>>>> to parse and re-serialize HTML literals, need to care what the
>>>>> value space is. (And yeah, if we can't come up with evidence that
>>>>> some systems need to do one of those, then there's little point
>>>>> in defining anything more complicated than a 1:1 L2V mapping.)
>>>> 
>>>> Comparison may be done in another system - these literals are
>>>> published and ingested by another system that might be asked if two
>>>> literals are the same.  e.g. a reasoner or a SPARQL engine.
>>>> Whether the ability to value-equals two literals with different
>>>> lexical forms is sufficiently important, I can't say.
>>>> 
>>>> I feel that this isn't that likely - HTML5 literals are display
>>>> material to be passed about.  For that,  equality processing is
>>>> unlikely, and the fragments go in and come out on on some generated
>>>> HTML.
>>>> 
>>>> Andy
>>>> 
>>>> 
>>>>> 
>>>>> Best, Richard
>>>>> 
>>>>> 
>>>>> 
>>>>>> 
>>>>>> Andy
>>>>>> 
>>>>>>> 
>>>>>>> Ivan
>>>>>>> 
>>>>>>>> Best, Richard
>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>>>>> And I guess in theory, DOMs and XML Infosets should be
>>>>>>>>>> isomorphic, no?
>>>>>>>>> 
>>>>>>>>> In theory:-) To be checked. There may be corner cases.
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Between all these transformations, there should be
>>>>>>>>>> something that works for us. The devil is in the
>>>>>>>>>> details of course.
>>>>>>>>> 
>>>>>>>>> Exactly...
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Or we could just avoid all of that trouble and simply
>>>>>>>>>> define the value space of the HTML datatype as
>>>>>>>>>> identical to the lexical space.
>>>>>>>>> 
>>>>>>>>> And then we are back to the same issue as we had with
>>>>>>>>> XML Literals. Except that... there is no such thing as a
>>>>>>>>> formal canonical HTML5
>>>>>>>>> 
>>>>>>>>> Ivan
>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> Best, Richard
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> Just some food for thoughts...
>>>>>>>>>>> 
>>>>>>>>>>> Ivan
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> On May 1, 2012, at 18:41 , Gavin Carothers wrote:
>>>>>>>>>>> 
>>>>>>>>>>>> On Tue, May 1, 2012 at 6:46 AM, Richard
>>>>>>>>>>>> Cyganiak<richard@cyganiak.de>    wrote:
>>>>>>>>>>>>> All,
>>>>>>>>>>>>> 
>>>>>>>>>>>>> The 2004 WG worked under the assumption that the
>>>>>>>>>>>>> future of HTML was XHTML, and that the use case
>>>>>>>>>>>>> of shipping HTML markup fragments as RDF payloads
>>>>>>>>>>>>> would be addressed by rdf:XMLLiteral. But in
>>>>>>>>>>>>> 2012, shipping HTML fragments really means HTML5.
>>>>>>>>>>>>> Is rdf:XMLLiteral still adequate for this task?
>>>>>>>>>>>>> Is a new datatype with a lexical space consisting
>>>>>>>>>>>>> of HTML5 fragments needed? This question is
>>>>>>>>>>>>> ISSUE-63.
>>>>>>>>>>>>> 
>>>>>>>>>>>>> I think it would be useful to have a straw poll
>>>>>>>>>>>>> sometime soon on this question:
>>>>>>>>>>>>> 
>>>>>>>>>>>>> PROPOSAL: RDF-WG will work on an HTML datatype
>>>>>>>>>>>>> that would be defined in RDF Concepts.
>>>>>>>>>>>> 
>>>>>>>>>>>> +1, and for internationalization should be a
>>>>>>>>>>>> required datatype, might also have a simple syntax
>>>>>>>>>>>> in Turtle (though would likely require a new last
>>>>>>>>>>>> call but a Web formating that doesn't understand
>>>>>>>>>>>> HTML doesn't seem like much of a web format)
>>>>>>>>>>>> 
>>>>>>>>>>>>> 
>>>>>>>>>>>>> If there is general support for this, then we
>>>>>>>>>>>>> could start work on the details of the datatype
>>>>>>>>>>>>> definition (lexical space, value space, L2V
>>>>>>>>>>>>> mapping and so on).
>>>>>>>>>>>>> 
>>>>>>>>>>>>> All the best, Richard
>>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead
>>>>>>>>>>> Home: http://www.w3.org/People/Ivan/ mobile:
>>>>>>>>>>> +31-641044153 FOAF:
>>>>>>>>>>> http://www.ivan-herman.net/foaf.rdf
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead Home:
>>>>>>>>> http://www.w3.org/People/Ivan/ mobile: +31-641044153
>>>>>>>>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ---- Ivan Herman, W3C Semantic Web Activity Lead Home:
>>>>>>> http://www.w3.org/People/Ivan/ mobile: +31-641044153 FOAF:
>>>>>>> http://www.ivan-herman.net/foaf.rdf
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 
> -- 
> Steve Harris, CTO
> Garlik, a part of Experian 
> 1-3 Halford Road, Richmond, TW10 6AW, UK
> +44 20 8439 8203  http://www.garlik.com/
> Registered in England and Wales 653331 VAT # 887 1335 93
> Registered office: Landmark House, Experian Way, NG2 Business Park, Nottingham, Nottinghamshire, England NG80 1ZZ
> 
>
Received on Tuesday, 8 May 2012 22:13:26 UTC