Re: [RDFa] Default datatype should be a string from Ivan Herman on 2007-02-05 (public-rdf-in-xhtml-tf@w3.org from February 2007)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 05 Feb 2007 10:01:40 +0100
To: mark.birbeck@x-port.net
CC: public-rdf-in-xhtml-tf@w3.org
Message-ID: <45C6F274.80204@w3.org>
Mark,

I think there one consideration really missing from your argumentation
(and that is what made me become the proponent of the plain literal
solution): RDF information retrieved from RDFa should be easily mashed
up with RDF from other sources. And that is what becomes a problem. The
example I give in

http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2006Nov/0000

is real: the current setup with XMLLiteral made it impossible to mash up
the FOAF statements from and RDFa marked page with a foaf file simply
edited by hand. (ie, I had to change my demonstration back then for a
more complicated solution!) The default in the case of RDF/XML and
Turtle is plain literal; we should not depart from that.

I played with the idea of saying "let it be plain literal if no HTML
tags are present, and XMLLiteral otherwise". Ie, your RDFa example would
indeed lead to <> dc:title "RDFa Primer", whereas your einstein example
would lead to an XMLLiteral. But this is quite error prone to me.
Imagine that one makes up a page with

<div about="">
 <span property="bla">this is boring</span>
</div>

which would lead to

    <> bla "this is boring"

but then, *later*, changes the text by saying

<div about="">
 <span property="bla">this <i>is</i> boring</span>
</div>

which then suddently leads to

   <> bla "this <i>is</i> boring"^^XMLLiteral

Although the intention was clearly to change the text from a
presentation point of view only (that is why I used <i> and not <em>),
the triplet becomes very different, and previously working queries might
suddently fail.

As far as I am concerned, the practical issues raised by the XMLLiteral
approach clearly outweight other considerations...

My two pence...

Ivan


Mark Birbeck wrote:
> 
> Hello all,
> 
> I don't think the issue has been understood correctly, so I'll
> re-construct the thought processes I went through when working through
> the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not
> at all suggesting that my solution is beyond dispute :), but if people
> want to change it, I think we all need to understand what the problems
> were that we were originally trying to address.
> 
> The issue is not about which of "XML mark-up" or "strings" is the most
> common situation; I take it as given that 'strings' will be more
> common. :) The issue is essentially whether there is any need to
> distinguish between them, and if there isn't, whether we can use that
> fact to our advantage to make RDFa easier to author. I think it's
> important to state the problem this way round, since if there *is* a
> problem with always using XMLLiteral, then of course we can't do what
> I originally proposed!
> 
> 
> PLAIN VERSUS TYPED LITERALS
> 
> To set the context, the first thing to remember is that in RDF,
> xsd:string is *not* the default datatype. In RDF, plain literals do
> not have *any* datatype. The word 'string' is being used very loosely
> in this discussion--from the subject line of the thread to comments
> added--and we need to be clear on what is being proposed.
> 
> At first sight the lack of any type seems fine; after all, why should
> we worry that this:
> 
>  <div about="">
>    <h1 property="dc:title">RDFa Primer</h1>
>  </div>
> 
> can produce this:
> 
>  <> dc:title "RDFa Primer" .
> 
> But unfortunately we _do_ need to worry. If we take, for example,
> Einstein's famous 1946 article on nuclear weapons, we would obviously
> mark it up as follows:
> 
>  <div about="">
>    <h1 property="dc:title">
>      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>    </h1>
>  </div>
> 
> We have to ask what would we *like* this mark-up to generate, and I
> think it's clear we'd want this:
> 
>  <>
>    dc:title
>    "E = mc<sup>2</sup>: The Most Urgent Problem of Our
> Time"^^rdf:XMLLiteral
>    .
> 
> But of course this is the crux of the problem; our preference for the
> first example was a plain literal, but our preference for the second
> was an XML literal, so we must now ask what it is that could 'trigger'
> this difference in parsing behaviour.
> 
> 
> PROPOSAL 1: ALL TEXT IS PLAIN LITERAL
> 
> The first option is to say that actually there is no trigger, and that
> _all_ text should be treated as a plain literal unless the author says
> otherwise. So our example would produce this:
> 
>  <>
>    dc:title
>    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"
>    .
> 
> To create our original triples, the author would make use of
> @datatype, and write this:
> 
>  <div about="">
>    <h1 property="dc:title" datatype="rdf:XMLLiteral">
>      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>    </h1>
>  </div>
> 
> At the time I was working on this I rejected this as probably the
> worst solution. :) My reasoning was simply that in examples such as
> this, the title is _already_ mark-up, since it originates from an
> XHTML document. The author clearly knows what they are doing, and so
> for them to have to repeat the fact that the title is mark-up is
> counter-intuitive, and breaks with the idea that we are 'decorating'
> XHTML, rather than fundamentally modifying it.
> 
> 
> PROPOSAL 2: ALL TEXT IS XSD:STRING
> 
> The second option is also to say there is no trigger, but that instead
> of using plain literals, the data is automatically typed as an
> xsd:string:
> 
>  <>
>    dc:title
>    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^xsd:string
>    .
> 
> Although this solves some use cases, as I'll discuss at the end it
> doesn't solve all, and I think we should be very careful with this.
> 
> 
> PROPOSAL 3: ALL TEXT IS XML LITERAL
> 
> The third option--as we know, the one I actually went with--is to flip
> things round, and ask whether the ordinary string (or plain literal)
> couldn't be represented by an rdf:XMLLiteral? So this:
> 
>  <div about="">
>    <h1 property="dc:title">RDFa Primer</h1>
>  </div>
> 
> parses as this:
> 
>  <> dc:title "RDFa Primer"^^rdf:XMLLiteral .
> 
> In other words, the 'trigger' to create an rdf:XMLLiteral is any use
> of @property where the object of the statement appears in *mark-up*.
> There is a strong logic to this.
> 
> First, the object _really has_ appeared in mark-up. But second, at the
> level of XML itself, it is not a problem that we don't have any 'tags'
> surrounding our text, since (as XSLT makes great use of), "RDFa
> Primer" is XML as much as "<div>42</div>" is. For those not familiar
> with this idea, I'll explain.
> 
> Most people are probably familiar with XSLT, so we'll use that to
> illustrate. When XSLT 'outputs' XML, it creates 'external general
> parsed entities', which are defined as:
> 
>  [78] extParsedEnt ::= TextDecl? content
> 
> The key definition for us here is that of 'content', appearing after
> the optional TextDecl:
> 
>  [43] content ::=bCharData? ((element | Reference | CDSect | PI |
> Comment) CharData?)*
> 
> This covers all the 'atoms' of XML, such as elements, character data,
> comments, processing instructions, and so on. In other words, the
> output of an XSLT process does not have to be a full XML document,
> with only one root node, etc. It could be a string, a comment, a
> processing instruction, an element, a list of elements, an element
> followed by a comment followed by an element...you get the picture.
> 
> I've used XSLT to illustrate the concept, since that is probably what
> many are familiar with, but much closer to home the RDF Concepts
> document talks of rdf:XMLLiteral in *exactly* this way. The document
> links to production 43--the production I quoted above--which means
> that the definition of XML literals in RDF is _already_ that it is not
> just an XML element, but that it can be any of the 'atoms' of
> XML--strings, comments, PIs, nodelists, etc.
> 
> More significantly for our discussion, the RDF Concepts document has
> this note:
> 
>  Note: RDF applications may use additional equivalence relations, such as
>  that which relates an xsd:string with an rdf:XMLLiteral corresponding to a
>  single text node of the same string.
> 
> (See the end of section 5.1.)
> 
> What I had in mind was that some server storing the data as triples
> would somehow 'augment' the rdf:XMLLiteral data type to include
> something more specific; at least xsd:string, but perhaps also
> xsd:date, xsd:integer, and so on.
> 
> I'll come back to this 'casting' or post-processing in a moment, but
> the main point is that there is a strong argument for saying that any
> data that originates from an XHTML document is *by definition* an
> EGPE, and therefore at the very least cannot be a plain literal (and
> so #1 is out).
> 
> I'd also argue that we should be wary of making the default xsd:string
> since once done it can't be 'undone'. I don't have time to develop
> this point now, but at root is the fact that in XML Schemas, an
> xsd:integer is *not* derived from xsd:string. (A vote against #2.)
> 
> 
> NOTE: Just to tie up all loose ends, for the author who _wants_ plain
> literals--i.e., no datatype at all--the original proposal contained
> the idea that @content should provide 'non-typed' literals:
> 
>  <meta property="dc:title" content="RDFa Primer" />
> 
>  <> dc:title "RDFa Primer" .
> 
> The rationale was that attributes can't contain mark-up anyway, so
> @content could never contain an XMLLiteral.
> 
> 
> FINALLY...SPARQL
> 
> So, now we've looked at the question from the point of view of the
> mark-up, we should look at the problem raised by Ivan concerning
> SPARQL. The main point made is that by using rdf:XMLLiteral queries
> don't always match correctly. However, I don't think that choosing
> plain literals or xsd:strings over rdf:XMLLiterals will necessarily
> solve the problem Ivan is seeing, and I would suggest that in
> situations where you are querying data that you have no control over,
> the str() function should generally be used. (I'd also be interested
> to double-check whether the behaviour seen is correct in relation to
> SPARQL itself, but I'll have to look at that later.)
> 
> 
> CONCLUSION
> 
> The ideal solution in my view, is that we stick to rdf:XMLLiterals,
> but at some stage in the processing some level of augmentation takes
> place, and data that is identifiable as an XML Schema simple type is
> typed as such. This step could be carried out on the server that is
> storing the data into a triple store, but it might be possible to
> define the necessary regular expressions to incorporate this step into
> the RDFa specification.
> 
> Regards,
> 
> Mark
> 
> 
> On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote:
> 
>>
>> Just wanted to chime in on the following, if it's not too late:
>>
>> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017
>>
>> > I am inclined to agree with you on the default datatype: it should just
>> > be a string, except if you really want some XML. What do others think?
>> >
>> > -Ben
>>
>> We (our Semantic Web group here at IBM Cambridge) agree that it should
>> be a
>> string since this almost certainly going to be the common case. In our
>> use
>> of RDFa, we always want strings. XMLLiterals should be specified with the
>> datatype attribute.
>>
>> Wing Yung
>> Internet Technology
>> wingyung@us.ibm.com
>> 617.693.3763
>>
>>
>>
>>
> 
> 

-- 

Ivan Herman, W3C Semantic Web Activity Lead
URL: http://www.w3.org/People/Ivan/
PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 5 February 2007 09:01:47 UTC