W3C home > Mailing lists > Public > public-rdf-in-xhtml-tf@w3.org > February 2007

Re: [RDFa] Default datatype should be a string

From: Mark Birbeck <mark.birbeck@x-port.net>
Date: Mon, 5 Feb 2007 15:40:01 +0000
Message-ID: <640dd5060702050740q53bac030j631fe15374f26e62@mail.gmail.com>
To: "Ivan Herman" <ivan@w3.org>
Cc: public-rdf-in-xhtml-tf@w3.org

Hi Ivan,

With respect, you seem to have missed some of the key points of my
mail, or at least you haven't replied to them. That mash-ups of data
are desirable is not controversial--that's the whole raison d'etre of
RDFa. (And that I might be less than pragmatic in my approach to
trying to solve RDFa problems as it attempts to straddle the less than
perfect worlds of RDF and XHTML, I'll leave for another day.)

Anyway, I won't rehearse all the arguments again, but instead I'll
focus on one important issue; there are a number of situations where
SPARQL will not match something that you might expect it to match, and
those situations have nothing to do with RDFa. We'll put
rdf:XMLLiteral and RDFa to one side for the moment, and look at the
general issues with using SPARQL queries.

Using your FOAF files as a source, try the following query:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s
FROM <http://www.ivan-herman.net/foaf.rdf>
WHERE
{
  ?s foaf:name "Ivan Herman" .
}

You'll see that it returns nothing! That's not my fault; the
'problem'--insofar as there is one--is that we have queried for the
plain literal "Ivan Herman" when the RDF in
<http://www.ivan-herman.net/foaf.rdf> is:

  <foaf:name xml:lang="en">Ivan Herman</foaf:name>

which is:

  "Ivan Herman"@en

The definition of equality in RDF Concepts says that not only must the
datatype match (as you point out) but also the language (and obviously
the string itself).

So, just to be clear, we *already* have a problem querying your
standard-issue RDF file, which obviously means that any issues we now
need to resolve are not necessarily to do with RDFa.

It's interesting to note that there is nothing that you can do to
address this 'issue' at the level of your RDF file. For example, say
you removed the language tag from your RDF; there is still nothing to
stop someone else from using a language tag on your name in _their_
RDF, and so the query we just used will still fail to find some data,
somewhere.

Similarly, if you were to change your query so that you actually tried
to match on the language:

.
.
.
WHERE
{
  ?s foaf:name "Ivan Herman"@en .
}

there would still be no way to ensure that everyone else used exactly
the same language tag when they placed your name in their RDF, and as
a consequence, there would still be some data that you couldn't find.

Whichever way we twist and turn, unless we have complete control over
both the data and the queries, one cannot write SPARQL queries in such
a way that they ignore the narrow 'RDF Concepts' definition of
equality. (I've used 'language' to make the point here, but you'll
find the same issue comes up with datatypes; since you cannot control
whether I use "42" or "42"^^xsd:integer, you cannot write simple
queries that find both.)

However, if we 'normalise' the values in foaf:name by using the str()
function, things are much easier and predictable:

.
.
.
WHERE
{
  ?s foaf:name ?name .

  FILTER (str(?name) = "Ivan Herman")
}

Now our query is saying that we want to match against the *lexical*
value of foaf:name, and therefore we're no longer 'hampered' with
languages and datatypes. With this query, you'll see that the value in
your RDF file correctly appears.

Note also that if we modify the query above to include a reference to
your RDFa data:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?s ?name
FROM <http://torrez.us/services/rdfa/http://www.w3.org/People/Ivan/>
FROM <http://www.ivan-herman.net/foaf.rdf>
.
.
.

the foaf:name from _that_ document also now appears in our search
results. In other words, we've got the correct data regardless of
language and datatypes.

To recap, we've gone from getting no results, to getting two; the one
from your RDF file that has a language:

  <foaf:name xml:lang="en">Ivan Herman</foaf:name>

and the one from your RDFa file which has an rdf:XMLLiteral datatype:

  <foaf:name
    rdf:datatype="http://www.w3.org/1999/02/22-rdf-syntax-ns#XMLLiteral"
  >
    Ivan Herman
  </foaf:name>

We've created a query that returns the expected results, and it does
so regardless of how the data was originally entered.

Given this, let's look at what needs to be done to your original query
to make it work:

SELECT ?thisIsWhatIWant
WHERE {
  ?something foaf:name ?me.
  ?otherResource contact:fullName ?me.
  ?otherResource something:here ?thisIsWhatIWant.
}

The reason this gives unexpected results has nothing to do with RDFa,
but rather it relies on there being a narrow 'RDF Concepts' notion of
equivalence between foaf:name and contact:fullName. As we saw above,
if we don't control both the data and the queries, then there is no
guarantee that there will be such equivalence; one value might have a
language of "en" and the other might have no language, one might have
a datatype of something that we've never heard of, that is derived
from an xsd:string, and the other might not, and so on. In other words
there are plenty of reasons why the two strings won't match, when
using the very rigid notion of equality used in RDF Concepts, the
least of them being RDFa's use of rdf:XMLLiteral.

Therefore, the 'defensive programming' approach to writing this query
is simply to ensure that the string comparison takes place on the
lexical version of the strings, obtained using str(). To illusrate,
we'll get the date-of-birth of anyone who has two FOAF entries that
use the same foaf:name, as follows:

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?bDOB
FROM <http://torrez.us/services/rdfa/http://www.w3.org/People/Ivan/>
FROM <http://www.ivan-herman.net/foaf.rdf>
WHERE
{
  ?a foaf:name ?aName .
  ?b foaf:name ?bName .
  ?b foaf:dateOfBirth ?bDOB .

  FILTER
  (
    (str(?aName) = str(?bName))
    &&
    (?a != ?b)
  )
}

(This is similar to your original example, but not quite the same
since I couldn't find a reference to contact:fullName in the two files
I looked at, so I just reused foaf:name.)

This is not a hack or a workaround, but is actually a more correct
SPARQL query, based on the notion of equality defined in RDF Concepts.
You'd also need to do this if you were searching for dates, numbers
and so on.


CONCLUSION

The main point of this email is to clarify that SPARQL by its
nature--because it is querying RDF--requires care when crafting
queries, since there are a number of different ways at arriving at the
same lexical value. That SPARQL has this aspect independent of RDFa is
therefore not an argument for removing the use of rdf:XMLLiteral, and
in turn flags up that simply using plain literals or xsd:string
instead of rdf:XMLLiteral does *not* absolve the user from needing to
be careful when constructing queries.

None of this is to say that we shouldn't review our use of
rdf:XMLLiteral if there are other criteria, and I certainly look
forward to seeing some replies to my comments on the specifics of the
appropriateness of alternative data types. But the fact that we have
to take care when crafting SPARQL queries is not an argument for
changing from rdf:XMLLiteral.

Best regards,

Mark


On 05/02/07, Ivan Herman <ivan@w3.org> wrote:
> Mark,
>
> I think there one consideration really missing from your argumentation
> (and that is what made me become the proponent of the plain literal
> solution): RDF information retrieved from RDFa should be easily mashed
> up with RDF from other sources. And that is what becomes a problem. The
> example I give in
>
> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2006Nov/0000
>
> is real: the current setup with XMLLiteral made it impossible to mash up
> the FOAF statements from and RDFa marked page with a foaf file simply
> edited by hand. (ie, I had to change my demonstration back then for a
> more complicated solution!) The default in the case of RDF/XML and
> Turtle is plain literal; we should not depart from that.
>
> I played with the idea of saying "let it be plain literal if no HTML
> tags are present, and XMLLiteral otherwise". Ie, your RDFa example would
> indeed lead to <> dc:title "RDFa Primer", whereas your einstein example
> would lead to an XMLLiteral. But this is quite error prone to me.
> Imagine that one makes up a page with
>
> <div about="">
>         <span property="bla">this is boring</span>
> </div>
>
> which would lead to
>
>     <> bla "this is boring"
>
> but then, *later*, changes the text by saying
>
> <div about="">
>         <span property="bla">this <i>is</i> boring</span>
> </div>
>
> which then suddently leads to
>
>    <> bla "this <i>is</i> boring"^^XMLLiteral
>
> Although the intention was clearly to change the text from a
> presentation point of view only (that is why I used <i> and not <em>),
> the triplet becomes very different, and previously working queries might
> suddently fail.
>
> As far as I am concerned, the practical issues raised by the XMLLiteral
> approach clearly outweight other considerations...
>
> My two pence...
>
> Ivan
>
>
> Mark Birbeck wrote:
> >
> > Hello all,
> >
> > I don't think the issue has been understood correctly, so I'll
> > re-construct the thought processes I went through when working through
> > the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not
> > at all suggesting that my solution is beyond dispute :), but if people
> > want to change it, I think we all need to understand what the problems
> > were that we were originally trying to address.
> >
> > The issue is not about which of "XML mark-up" or "strings" is the most
> > common situation; I take it as given that 'strings' will be more
> > common. :) The issue is essentially whether there is any need to
> > distinguish between them, and if there isn't, whether we can use that
> > fact to our advantage to make RDFa easier to author. I think it's
> > important to state the problem this way round, since if there *is* a
> > problem with always using XMLLiteral, then of course we can't do what
> > I originally proposed!
> >
> >
> > PLAIN VERSUS TYPED LITERALS
> >
> > To set the context, the first thing to remember is that in RDF,
> > xsd:string is *not* the default datatype. In RDF, plain literals do
> > not have *any* datatype. The word 'string' is being used very loosely
> > in this discussion--from the subject line of the thread to comments
> > added--and we need to be clear on what is being proposed.
> >
> > At first sight the lack of any type seems fine; after all, why should
> > we worry that this:
> >
> >  <div about="">
> >    <h1 property="dc:title">RDFa Primer</h1>
> >  </div>
> >
> > can produce this:
> >
> >  <> dc:title "RDFa Primer" .
> >
> > But unfortunately we _do_ need to worry. If we take, for example,
> > Einstein's famous 1946 article on nuclear weapons, we would obviously
> > mark it up as follows:
> >
> >  <div about="">
> >    <h1 property="dc:title">
> >      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
> >    </h1>
> >  </div>
> >
> > We have to ask what would we *like* this mark-up to generate, and I
> > think it's clear we'd want this:
> >
> >  <>
> >    dc:title
> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our
> > Time"^^rdf:XMLLiteral
> >    .
> >
> > But of course this is the crux of the problem; our preference for the
> > first example was a plain literal, but our preference for the second
> > was an XML literal, so we must now ask what it is that could 'trigger'
> > this difference in parsing behaviour.
> >
> >
> > PROPOSAL 1: ALL TEXT IS PLAIN LITERAL
> >
> > The first option is to say that actually there is no trigger, and that
> > _all_ text should be treated as a plain literal unless the author says
> > otherwise. So our example would produce this:
> >
> >  <>
> >    dc:title
> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"
> >    .
> >
> > To create our original triples, the author would make use of
> > @datatype, and write this:
> >
> >  <div about="">
> >    <h1 property="dc:title" datatype="rdf:XMLLiteral">
> >      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
> >    </h1>
> >  </div>
> >
> > At the time I was working on this I rejected this as probably the
> > worst solution. :) My reasoning was simply that in examples such as
> > this, the title is _already_ mark-up, since it originates from an
> > XHTML document. The author clearly knows what they are doing, and so
> > for them to have to repeat the fact that the title is mark-up is
> > counter-intuitive, and breaks with the idea that we are 'decorating'
> > XHTML, rather than fundamentally modifying it.
> >
> >
> > PROPOSAL 2: ALL TEXT IS XSD:STRING
> >
> > The second option is also to say there is no trigger, but that instead
> > of using plain literals, the data is automatically typed as an
> > xsd:string:
> >
> >  <>
> >    dc:title
> >    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^xsd:string
> >    .
> >
> > Although this solves some use cases, as I'll discuss at the end it
> > doesn't solve all, and I think we should be very careful with this.
> >
> >
> > PROPOSAL 3: ALL TEXT IS XML LITERAL
> >
> > The third option--as we know, the one I actually went with--is to flip
> > things round, and ask whether the ordinary string (or plain literal)
> > couldn't be represented by an rdf:XMLLiteral? So this:
> >
> >  <div about="">
> >    <h1 property="dc:title">RDFa Primer</h1>
> >  </div>
> >
> > parses as this:
> >
> >  <> dc:title "RDFa Primer"^^rdf:XMLLiteral .
> >
> > In other words, the 'trigger' to create an rdf:XMLLiteral is any use
> > of @property where the object of the statement appears in *mark-up*.
> > There is a strong logic to this.
> >
> > First, the object _really has_ appeared in mark-up. But second, at the
> > level of XML itself, it is not a problem that we don't have any 'tags'
> > surrounding our text, since (as XSLT makes great use of), "RDFa
> > Primer" is XML as much as "<div>42</div>" is. For those not familiar
> > with this idea, I'll explain.
> >
> > Most people are probably familiar with XSLT, so we'll use that to
> > illustrate. When XSLT 'outputs' XML, it creates 'external general
> > parsed entities', which are defined as:
> >
> >  [78] extParsedEnt ::= TextDecl? content
> >
> > The key definition for us here is that of 'content', appearing after
> > the optional TextDecl:
> >
> >  [43] content ::=bCharData? ((element | Reference | CDSect | PI |
> > Comment) CharData?)*
> >
> > This covers all the 'atoms' of XML, such as elements, character data,
> > comments, processing instructions, and so on. In other words, the
> > output of an XSLT process does not have to be a full XML document,
> > with only one root node, etc. It could be a string, a comment, a
> > processing instruction, an element, a list of elements, an element
> > followed by a comment followed by an element...you get the picture.
> >
> > I've used XSLT to illustrate the concept, since that is probably what
> > many are familiar with, but much closer to home the RDF Concepts
> > document talks of rdf:XMLLiteral in *exactly* this way. The document
> > links to production 43--the production I quoted above--which means
> > that the definition of XML literals in RDF is _already_ that it is not
> > just an XML element, but that it can be any of the 'atoms' of
> > XML--strings, comments, PIs, nodelists, etc.
> >
> > More significantly for our discussion, the RDF Concepts document has
> > this note:
> >
> >  Note: RDF applications may use additional equivalence relations, such as
> >  that which relates an xsd:string with an rdf:XMLLiteral corresponding to a
> >  single text node of the same string.
> >
> > (See the end of section 5.1.)
> >
> > What I had in mind was that some server storing the data as triples
> > would somehow 'augment' the rdf:XMLLiteral data type to include
> > something more specific; at least xsd:string, but perhaps also
> > xsd:date, xsd:integer, and so on.
> >
> > I'll come back to this 'casting' or post-processing in a moment, but
> > the main point is that there is a strong argument for saying that any
> > data that originates from an XHTML document is *by definition* an
> > EGPE, and therefore at the very least cannot be a plain literal (and
> > so #1 is out).
> >
> > I'd also argue that we should be wary of making the default xsd:string
> > since once done it can't be 'undone'. I don't have time to develop
> > this point now, but at root is the fact that in XML Schemas, an
> > xsd:integer is *not* derived from xsd:string. (A vote against #2.)
> >
> >
> > NOTE: Just to tie up all loose ends, for the author who _wants_ plain
> > literals--i.e., no datatype at all--the original proposal contained
> > the idea that @content should provide 'non-typed' literals:
> >
> >  <meta property="dc:title" content="RDFa Primer" />
> >
> >  <> dc:title "RDFa Primer" .
> >
> > The rationale was that attributes can't contain mark-up anyway, so
> > @content could never contain an XMLLiteral.
> >
> >
> > FINALLY...SPARQL
> >
> > So, now we've looked at the question from the point of view of the
> > mark-up, we should look at the problem raised by Ivan concerning
> > SPARQL. The main point made is that by using rdf:XMLLiteral queries
> > don't always match correctly. However, I don't think that choosing
> > plain literals or xsd:strings over rdf:XMLLiterals will necessarily
> > solve the problem Ivan is seeing, and I would suggest that in
> > situations where you are querying data that you have no control over,
> > the str() function should generally be used. (I'd also be interested
> > to double-check whether the behaviour seen is correct in relation to
> > SPARQL itself, but I'll have to look at that later.)
> >
> >
> > CONCLUSION
> >
> > The ideal solution in my view, is that we stick to rdf:XMLLiterals,
> > but at some stage in the processing some level of augmentation takes
> > place, and data that is identifiable as an XML Schema simple type is
> > typed as such. This step could be carried out on the server that is
> > storing the data into a triple store, but it might be possible to
> > define the necessary regular expressions to incorporate this step into
> > the RDFa specification.
> >
> > Regards,
> >
> > Mark
> >
> >
> > On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote:
> >
> >>
> >> Just wanted to chime in on the following, if it's not too late:
> >>
> >> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017
> >>
> >> > I am inclined to agree with you on the default datatype: it should just
> >> > be a string, except if you really want some XML. What do others think?
> >> >
> >> > -Ben
> >>
> >> We (our Semantic Web group here at IBM Cambridge) agree that it should
> >> be a
> >> string since this almost certainly going to be the common case. In our
> >> use
> >> of RDFa, we always want strings. XMLLiterals should be specified with the
> >> datatype attribute.
> >>
> >> Wing Yung
> >> Internet Technology
> >> wingyung@us.ibm.com
> >> 617.693.3763
> >>
> >>
> >>
> >>
> >
> >
>
> --
>
> Ivan Herman, W3C Semantic Web Activity Lead
> URL: http://www.w3.org/People/Ivan/
> PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html
> FOAF: http://www.ivan-herman.net/foaf.rdf
>
>
>


-- 
Mark Birbeck
CEO
x-port.net Ltd.

e: Mark.Birbeck@x-port.net
t: +44 (0) 20 7689 9232
w: http://www.formsPlayer.com/
b: http://internet-apps.blogspot.com/

Download our XForms processor from
http://www.formsPlayer.com/
Received on Monday, 5 February 2007 15:40:15 GMT

This archive was generated by hypermail 2.2.0+W3C-0.50 : Tuesday, 8 January 2008 14:15:03 GMT