Re: [RDFa] Default datatype should be a string from Ivan Herman on 2007-02-05 (public-rdf-in-xhtml-tf@w3.org from February 2007)

From: Ivan Herman <ivan@w3.org>
Date: Mon, 05 Feb 2007 18:02:46 +0100
To: mark.birbeck@x-port.net
CC: public-rdf-in-xhtml-tf@w3.org
Message-ID: <45C76336.2090502@w3.org>
Mark,

I stand corrected, I indeed missed the comment at the end of your mail
commenting on my comment. And yes, your argumentation below with the
language tags and the literals, etc, are correct.

Having said that, I am still in favour of pure literals, although I must
admit you shooted down my main argument:-). Well... mostly. Although I
cannot prove it, I still believe that the most probable situation *is*
that users would expect literals on both sides, ie, on RDFa and RDF/XML
sides and that is the way their expectation is, if for not other reasons
 than because that is the default in RDF/XML...

(By the way: how do I express if I *want* the output to be plain
literal? Ie, not xsd:string, which is an explicit datatype, but simply a
plain literal...)

I have another argument, a bit vague. To come back to my previous
example: if I use something like:

<div about=""><span property="bla">This <span class="red">is</span>
boring</span></div>

then the internal <span> is put into the text for formatting reasons and
not necessarily for content (I know, the boundary between content and
formatting may be a blur, but I presume you understand what I mean). Ie,
I would not expect the internal <span> to appear in my RDF output, it
only has a sense for the visual display of the HTML page. I know, I can
use datatype, but nevertheless....

I.

Mark Birbeck wrote:
> Hi Ivan,
> 
> With respect, you seem to have missed some of the key points of my
> mail, or at least you haven't replied to them. That mash-ups of data
> are desirable is not controversial--that's the whole raison d'etre of
> RDFa. (And that I might be less than pragmatic in my approach to
> trying to solve RDFa problems as it attempts to straddle the less than
> perfect worlds of RDF and XHTML, I'll leave for another day.)
> 
> Anyway, I won't rehearse all the arguments again, but instead I'll
> focus on one important issue; there are a number of situations where
> SPARQL will not match something that you might expect it to match, and
> those situations have nothing to do with RDFa. We'll put
> rdf:XMLLiteral and RDFa to one side for the moment, and look at the
> general issues with using SPARQL queries.
> 
> Using your FOAF files as a source, try the following query:
> 
> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> SELECT ?s
> FROM <http://www.ivan-herman.net/foaf.rdf>
> WHERE
> {
>  ?s foaf:name "Ivan Herman" .
> }
> 
> You'll see that it returns nothing! That's not my fault; the
> 'problem'--insofar as there is one--is that we have queried for the
> plain literal "Ivan Herman" when the RDF in
> <http://www.ivan-herman.net/foaf.rdf> is:
> 
>  <foaf:name xml:lang="en">Ivan Herman</foaf:name>
> 
> which is:
> 
>  "Ivan Herman"@en
> 
> The definition of equality in RDF Concepts says that not only must the
> datatype match (as you point out) but also the language (and obviously
> the string itself).
> 
> So, just to be clear, we *already* have a problem querying your
> standard-issue RDF file, which obviously means that any issues we now
> need to resolve are not necessarily to do with RDFa.
> 
> It's interesting to note that there is nothing that you can do to
> address this 'issue' at the level of your RDF file. For example, say
> you removed the language tag from your RDF; there is still nothing to
> stop someone else from using a language tag on your name in _their_
> RDF, and so the query we just used will still fail to find some data,
> somewhere.
> 
> Similarly, if you were to change your query so that you actually tried
> to match on the language:
> 
> .
> .
> .
> WHERE
> {
>  ?s foaf:name "Ivan Herman"@en .
> }
> 
> there would still be no way to ensure that everyone else used exactly
> the same language tag when they placed your name in their RDF, and as
> a consequence, there would still be some data that you couldn't find.
> 
> Whichever way we twist and turn, unless we have complete control over
> both the data and the queries, one cannot write SPARQL queries in such
> a way that they ignore the narrow 'RDF Concepts' definition of
> equality. (I've used 'language' to make the point here, but you'll
> find the same issue comes up with datatypes; since you cannot control
> whether I use "42" or "42"^^xsd:integer, you cannot write simple
> queries that find both.)
> 
> However, if we 'normalise' the values in foaf:name by using the str()
> function, things are much easier and predictable:
> 
> .
> .
> .
> WHERE
> {
>  ?s foaf:name ?name .
> 
>  FILTER (str(?name) = "Ivan Herman")
> }
> 
> Now our query is saying that we want to match against the *lexical*
> value of foaf:name, and therefore we're no longer 'hampered' with
> languages and datatypes. With this query, you'll see that the value in
> your RDF file correctly appears.
> 
> Note also that if we modify the query above to include a reference to
> your RDFa data:
> 
> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> SELECT ?s ?name
> FROM <http://torrez.us/services/rdfa/http://www.w3.org/People/Ivan/>
> FROM <http://www.ivan-herman.net/foaf.rdf>
> .
> .
> .
> 
> the foaf:name from _that_ document also now appears in our search
> results. In other words, we've got the correct data regardless of
> language and datatypes.
> 
> To recap, we've gone from getting no results, to getting two; the one
> from your RDF file that has a language:
> 
>  <foaf:name xml:lang="en">Ivan Herman</foaf:name>
> 
> and the one from your RDFa file which has an rdf:XMLLiteral datatype:
> 
>  <foaf:name
>    rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral"
>  >
>    Ivan Herman
>  </foaf:name>
> 
> We've created a query that returns the expected results, and it does
> so regardless of how the data was originally entered.
> 
> Given this, let's look at what needs to be done to your original query
> to make it work:
> 
> SELECT ?thisIsWhatIWant
> WHERE {
>  ?something foaf:name ?me.
>  ?otherResource contact:fullName ?me.
>  ?otherResource something:here ?thisIsWhatIWant.
> }
> 
> The reason this gives unexpected results has nothing to do with RDFa,
> but rather it relies on there being a narrow 'RDF Concepts' notion of
> equivalence between foaf:name and contact:fullName. As we saw above,
> if we don't control both the data and the queries, then there is no
> guarantee that there will be such equivalence; one value might have a
> language of "en" and the other might have no language, one might have
> a datatype of something that we've never heard of, that is derived
> from an xsd:string, and the other might not, and so on. In other words
> there are plenty of reasons why the two strings won't match, when
> using the very rigid notion of equality used in RDF Concepts, the
> least of them being RDFa's use of rdf:XMLLiteral.
> 
> Therefore, the 'defensive programming' approach to writing this query
> is simply to ensure that the string comparison takes place on the
> lexical version of the strings, obtained using str(). To illusrate,
> we'll get the date-of-birth of anyone who has two FOAF entries that
> use the same foaf:name, as follows:
> 
> PREFIX foaf: <http://xmlns.com/foaf/0.1/>
> SELECT ?bDOB
> FROM <http://torrez.us/services/rdfa/http://www.w3.org/People/Ivan/>
> FROM <http://www.ivan-herman.net/foaf.rdf>
> WHERE
> {
>  ?a foaf:name ?aName .
>  ?b foaf:name ?bName .
>  ?b foaf:dateOfBirth ?bDOB .
> 
>  FILTER
>  (
>    (str(?aName) = str(?bName))
>    &&
>    (?a != ?b)
>  )
> }
> 
> (This is similar to your original example, but not quite the same
> since I couldn't find a reference to contact:fullName in the two files
> I looked at, so I just reused foaf:name.)
> 
> This is not a hack or a workaround, but is actually a more correct
> SPARQL query, based on the notion of equality defined in RDF Concepts.
> You'd also need to do this if you were searching for dates, numbers
> and so on.
> 
> 
> CONCLUSION
> 
> The main point of this email is to clarify that SPARQL by its
> nature--because it is querying RDF--requires care when crafting
> queries, since there are a number of different ways at arriving at the
> same lexical value. That SPARQL has this aspect independent of RDFa is
> therefore not an argument for removing the use of rdf:XMLLiteral, and
> in turn flags up that simply using plain literals or xsd:string
> instead of rdf:XMLLiteral does *not* absolve the user from needing to
> be careful when constructing queries.
> 
> None of this is to say that we shouldn't review our use of
> rdf:XMLLiteral if there are other criteria, and I certainly look
> forward to seeing some replies to my comments on the specifics of the
> appropriateness of alternative data types. But the fact that we have
> to take care when crafting SPARQL queries is not an argument for
> changing from rdf:XMLLiteral.
> 
> Best regards,
> 
> Mark
> 
> 
> On 05/02/07, Ivan Herman <ivan@w3.org> wrote:
> 
>> Mark,
>>
>> I think there one consideration really missing from your argumentation
>> (and that is what made me become the proponent of the plain literal
>> solution): RDF information retrieved from RDFa should be easily mashed
>> up with RDF from other sources. And that is what becomes a problem. The
>> example I give in
>>
>> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2006Nov/0000
>>
>> is real: the current setup with XMLLiteral made it impossible to mash up
>> the FOAF statements from and RDFa marked page with a foaf file simply
>> edited by hand. (ie, I had to change my demonstration back then for a
>> more complicated solution!) The default in the case of RDF/XML and
>> Turtle is plain literal; we should not depart from that.
>>
>> I played with the idea of saying "let it be plain literal if no HTML
>> tags are present, and XMLLiteral otherwise". Ie, your RDFa example would
>> indeed lead to <> dc:title "RDFa Primer", whereas your einstein example
>> would lead to an XMLLiteral. But this is quite error prone to me.
>> Imagine that one makes up a page with
>>
>> <div about="">
>>         <span property="bla">this is boring</span>
>> </div>
>>
>> which would lead to
>>
>>     <> bla "this is boring"
>>
>> but then, *later*, changes the text by saying
>>
>> <div about="">
>>         <span property="bla">this <i>is</i> boring</span>
>> </div>
>>
>> which then suddently leads to
>>
>>    <> bla "this <i>is</i> boring"^^XMLLiteral
>>
>> Although the intention was clearly to change the text from a
>> presentation point of view only (that is why I used <i> and not <em>),
>> the triplet becomes very different, and previously working queries might
>> suddently fail.
>>
>> As far as I am concerned, the practical issues raised by the XMLLiteral
>> approach clearly outweight other considerations...
>>
>> My two pence...
>>
>> Ivan
>>
>>
>> Mark Birbeck wrote:
>> >
>> > Hello all,
>> >
>> > I don't think the issue has been understood correctly, so I'll
>> > re-construct the thought processes I went through when working through
>> > the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not
>> > at all suggesting that my solution is beyond dispute :), but if people
>> > want to change it, I think we all need to understand what the problems
>> > were that we were originally trying to address.
>> >
>> > The issue is not about which of "XML mark-up" or "strings" is the most
>> > common situation; I take it as given that 'strings' will be more
>> > common. :) The issue is essentially whether there is any need to
>> > distinguish between them, and if there isn't, whether we can use that
>> > fact to our advantage to make RDFa easier to author. I think it's
>> > important to state the problem this way round, since if there *is* a
>> > problem with always using XMLLiteral, then of course we can't do what
>> > I originally proposed!
>> >
>> >
>> > PLAIN VERSUS TYPED LITERALS
>> >
>> > To set the context, the first thing to remember is that in RDF,
>> > xsd:string is *not* the default datatype. In RDF, plain literals do
>> > not have *any* datatype. The word 'string' is being used very loosely
>> > in this discussion--from the subject line of the thread to comments
>> > added--and we need to be clear on what is being proposed.
>> >
>> > At first sight the lack of any type seems fine; after all, why should
>> > we worry that this:
>> >
>> >  <div about="">
>> >    <h1 property="dc:title">RDFa Primer</h1>
>> >  </div>
>> >
>> > can produce this:
>> >
>> >  <> dc:title "RDFa Primer" .
>> >
>> > But unfortunately we _do_ need to worry. If we take, for example,
>> > Einstein's famous 1946 article on nuclear weapons, we would obviously
>> > mark it up as follows:
>> >
>> >  <div about="">
>> >    <h1 property="dc:title">
>> >      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>> >    </h1>
>> >  </div>
>> >
>> > We have to ask what would we *like* this mark-up to generate, and I
>> > think it's clear we'd want this:
>> >
>> >  <>
>> >    dc:title
>> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our
>> > Time"^^rdf:XMLLiteral
>> >    .
>> >
>> > But of course this is the crux of the problem; our preference for the
>> > first example was a plain literal, but our preference for the second
>> > was an XML literal, so we must now ask what it is that could 'trigger'
>> > this difference in parsing behaviour.
>> >
>> >
>> > PROPOSAL 1: ALL TEXT IS PLAIN LITERAL
>> >
>> > The first option is to say that actually there is no trigger, and that
>> > _all_ text should be treated as a plain literal unless the author says
>> > otherwise. So our example would produce this:
>> >
>> >  <>
>> >    dc:title
>> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"
>> >    .
>> >
>> > To create our original triples, the author would make use of
>> > @datatype, and write this:
>> >
>> >  <div about="">
>> >    <h1 property="dc:title" datatype="rdf:XMLLiteral">
>> >      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>> >    </h1>
>> >  </div>
>> >
>> > At the time I was working on this I rejected this as probably the
>> > worst solution. :) My reasoning was simply that in examples such as
>> > this, the title is _already_ mark-up, since it originates from an
>> > XHTML document. The author clearly knows what they are doing, and so
>> > for them to have to repeat the fact that the title is mark-up is
>> > counter-intuitive, and breaks with the idea that we are 'decorating'
>> > XHTML, rather than fundamentally modifying it.
>> >
>> >
>> > PROPOSAL 2: ALL TEXT IS XSD:STRING
>> >
>> > The second option is also to say there is no trigger, but that instead
>> > of using plain literals, the data is automatically typed as an
>> > xsd:string:
>> >
>> >  <>
>> >    dc:title
>> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our
>> Time"^^xsd:string
>> >    .
>> >
>> > Although this solves some use cases, as I'll discuss at the end it
>> > doesn't solve all, and I think we should be very careful with this.
>> >
>> >
>> > PROPOSAL 3: ALL TEXT IS XML LITERAL
>> >
>> > The third option--as we know, the one I actually went with--is to flip
>> > things round, and ask whether the ordinary string (or plain literal)
>> > couldn't be represented by an rdf:XMLLiteral? So this:
>> >
>> >  <div about="">
>> >    <h1 property="dc:title">RDFa Primer</h1>
>> >  </div>
>> >
>> > parses as this:
>> >
>> >  <> dc:title "RDFa Primer"^^rdf:XMLLiteral .
>> >
>> > In other words, the 'trigger' to create an rdf:XMLLiteral is any use
>> > of @property where the object of the statement appears in *mark-up*.
>> > There is a strong logic to this.
>> >
>> > First, the object _really has_ appeared in mark-up. But second, at the
>> > level of XML itself, it is not a problem that we don't have any 'tags'
>> > surrounding our text, since (as XSLT makes great use of), "RDFa
>> > Primer" is XML as much as "<div>42</div>" is. For those not familiar
>> > with this idea, I'll explain.
>> >
>> > Most people are probably familiar with XSLT, so we'll use that to
>> > illustrate. When XSLT 'outputs' XML, it creates 'external general
>> > parsed entities', which are defined as:
>> >
>> >  [78] extParsedEnt ::= TextDecl? content
>> >
>> > The key definition for us here is that of 'content', appearing after
>> > the optional TextDecl:
>> >
>> >  [43] content ::=bCharData? ((element | Reference | CDSect | PI |
>> > Comment) CharData?)*
>> >
>> > This covers all the 'atoms' of XML, such as elements, character data,
>> > comments, processing instructions, and so on. In other words, the
>> > output of an XSLT process does not have to be a full XML document,
>> > with only one root node, etc. It could be a string, a comment, a
>> > processing instruction, an element, a list of elements, an element
>> > followed by a comment followed by an element...you get the picture.
>> >
>> > I've used XSLT to illustrate the concept, since that is probably what
>> > many are familiar with, but much closer to home the RDF Concepts
>> > document talks of rdf:XMLLiteral in *exactly* this way. The document
>> > links to production 43--the production I quoted above--which means
>> > that the definition of XML literals in RDF is _already_ that it is not
>> > just an XML element, but that it can be any of the 'atoms' of
>> > XML--strings, comments, PIs, nodelists, etc.
>> >
>> > More significantly for our discussion, the RDF Concepts document has
>> > this note:
>> >
>> >  Note: RDF applications may use additional equivalence relations,
>> such as
>> >  that which relates an xsd:string with an rdf:XMLLiteral
>> corresponding to a
>> >  single text node of the same string.
>> >
>> > (See the end of section 5.1.)
>> >
>> > What I had in mind was that some server storing the data as triples
>> > would somehow 'augment' the rdf:XMLLiteral data type to include
>> > something more specific; at least xsd:string, but perhaps also
>> > xsd:date, xsd:integer, and so on.
>> >
>> > I'll come back to this 'casting' or post-processing in a moment, but
>> > the main point is that there is a strong argument for saying that any
>> > data that originates from an XHTML document is *by definition* an
>> > EGPE, and therefore at the very least cannot be a plain literal (and
>> > so #1 is out).
>> >
>> > I'd also argue that we should be wary of making the default xsd:string
>> > since once done it can't be 'undone'. I don't have time to develop
>> > this point now, but at root is the fact that in XML Schemas, an
>> > xsd:integer is *not* derived from xsd:string. (A vote against #2.)
>> >
>> >
>> > NOTE: Just to tie up all loose ends, for the author who _wants_ plain
>> > literals--i.e., no datatype at all--the original proposal contained
>> > the idea that @content should provide 'non-typed' literals:
>> >
>> >  <meta property="dc:title" content="RDFa Primer" />
>> >
>> >  <> dc:title "RDFa Primer" .
>> >
>> > The rationale was that attributes can't contain mark-up anyway, so
>> > @content could never contain an XMLLiteral.
>> >
>> >
>> > FINALLY...SPARQL
>> >
>> > So, now we've looked at the question from the point of view of the
>> > mark-up, we should look at the problem raised by Ivan concerning
>> > SPARQL. The main point made is that by using rdf:XMLLiteral queries
>> > don't always match correctly. However, I don't think that choosing
>> > plain literals or xsd:strings over rdf:XMLLiterals will necessarily
>> > solve the problem Ivan is seeing, and I would suggest that in
>> > situations where you are querying data that you have no control over,
>> > the str() function should generally be used. (I'd also be interested
>> > to double-check whether the behaviour seen is correct in relation to
>> > SPARQL itself, but I'll have to look at that later.)
>> >
>> >
>> > CONCLUSION
>> >
>> > The ideal solution in my view, is that we stick to rdf:XMLLiterals,
>> > but at some stage in the processing some level of augmentation takes
>> > place, and data that is identifiable as an XML Schema simple type is
>> > typed as such. This step could be carried out on the server that is
>> > storing the data into a triple store, but it might be possible to
>> > define the necessary regular expressions to incorporate this step into
>> > the RDFa specification.
>> >
>> > Regards,
>> >
>> > Mark
>> >
>> >
>> > On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote:
>> >
>> >>
>> >> Just wanted to chime in on the following, if it's not too late:
>> >>
>> >>
>> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017
>> >>
>> >> > I am inclined to agree with you on the default datatype: it
>> should just
>> >> > be a string, except if you really want some XML. What do others
>> think?
>> >> >
>> >> > -Ben
>> >>
>> >> We (our Semantic Web group here at IBM Cambridge) agree that it should
>> >> be a
>> >> string since this almost certainly going to be the common case. In our
>> >> use
>> >> of RDFa, we always want strings. XMLLiterals should be specified
>> with the
>> >> datatype attribute.
>> >>
>> >> Wing Yung
>> >> Internet Technology
>> >> wingyung@us.ibm.com
>> >> 617.693.3763
>> >>
>> >>
>> >>
>> >>
>> >
>> >
>>
>> -- 
>>
>> Ivan Herman, W3C Semantic Web Activity Lead
>> URL: http://www.w3.org/People/Ivan/
>> PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
>> FOAF: http://www.ivan-herman.net/foaf.rdf
>>
>>
>>
> 
> 

-- 

Ivan Herman, W3C Semantic Web Activity Lead
URL: http://www.w3.org/People/Ivan/
PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 5 February 2007 17:03:25 UTC