Re: [RDFa] Default datatype should be a string from Mark Birbeck on 2007-02-04 (public-rdf-in-xhtml-tf@w3.org from February 2007)

From: Mark Birbeck <mark.birbeck@x-port.net>
Date: Sun, 4 Feb 2007 22:41:55 +0000
To: public-rdf-in-xhtml-tf@w3.org
Message-ID: <640dd5060702041441k174f961cl989c570cc057877f@mail.gmail.com>
Hello all,

I don't think the issue has been understood correctly, so I'll
re-construct the thought processes I went through when working through
the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not
at all suggesting that my solution is beyond dispute :), but if people
want to change it, I think we all need to understand what the problems
were that we were originally trying to address.

The issue is not about which of "XML mark-up" or "strings" is the most
common situation; I take it as given that 'strings' will be more
common. :) The issue is essentially whether there is any need to
distinguish between them, and if there isn't, whether we can use that
fact to our advantage to make RDFa easier to author. I think it's
important to state the problem this way round, since if there *is* a
problem with always using XMLLiteral, then of course we can't do what
I originally proposed!


PLAIN VERSUS TYPED LITERALS

To set the context, the first thing to remember is that in RDF,
xsd:string is *not* the default datatype. In RDF, plain literals do
not have *any* datatype. The word 'string' is being used very loosely
in this discussion--from the subject line of the thread to comments
added--and we need to be clear on what is being proposed.

At first sight the lack of any type seems fine; after all, why should
we worry that this:

  <div about="">
    <h1 property="dc:title">RDFa Primer</h1>
  </div>

can produce this:

  <> dc:title "RDFa Primer" .

But unfortunately we _do_ need to worry. If we take, for example,
Einstein's famous 1946 article on nuclear weapons, we would obviously
mark it up as follows:

  <div about="">
    <h1 property="dc:title">
      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
    </h1>
  </div>

We have to ask what would we *like* this mark-up to generate, and I
think it's clear we'd want this:

  <>
    dc:title
    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^rdf:XMLLiteral
    .

But of course this is the crux of the problem; our preference for the
first example was a plain literal, but our preference for the second
was an XML literal, so we must now ask what it is that could 'trigger'
this difference in parsing behaviour.


PROPOSAL 1: ALL TEXT IS PLAIN LITERAL

The first option is to say that actually there is no trigger, and that
_all_ text should be treated as a plain literal unless the author says
otherwise. So our example would produce this:

  <>
    dc:title
    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"
    .

To create our original triples, the author would make use of
@datatype, and write this:

  <div about="">
    <h1 property="dc:title" datatype="rdf:XMLLiteral">
      E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
    </h1>
  </div>

At the time I was working on this I rejected this as probably the
worst solution. :) My reasoning was simply that in examples such as
this, the title is _already_ mark-up, since it originates from an
XHTML document. The author clearly knows what they are doing, and so
for them to have to repeat the fact that the title is mark-up is
counter-intuitive, and breaks with the idea that we are 'decorating'
XHTML, rather than fundamentally modifying it.


PROPOSAL 2: ALL TEXT IS XSD:STRING

The second option is also to say there is no trigger, but that instead
of using plain literals, the data is automatically typed as an
xsd:string:

  <>
    dc:title
    "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^xsd:string
    .

Although this solves some use cases, as I'll discuss at the end it
doesn't solve all, and I think we should be very careful with this.


PROPOSAL 3: ALL TEXT IS XML LITERAL

The third option--as we know, the one I actually went with--is to flip
things round, and ask whether the ordinary string (or plain literal)
couldn't be represented by an rdf:XMLLiteral? So this:

  <div about="">
    <h1 property="dc:title">RDFa Primer</h1>
  </div>

parses as this:

  <> dc:title "RDFa Primer"^^rdf:XMLLiteral .

In other words, the 'trigger' to create an rdf:XMLLiteral is any use
of @property where the object of the statement appears in *mark-up*.
There is a strong logic to this.

First, the object _really has_ appeared in mark-up. But second, at the
level of XML itself, it is not a problem that we don't have any 'tags'
surrounding our text, since (as XSLT makes great use of), "RDFa
Primer" is XML as much as "<div>42</div>" is. For those not familiar
with this idea, I'll explain.

Most people are probably familiar with XSLT, so we'll use that to
illustrate. When XSLT 'outputs' XML, it creates 'external general
parsed entities', which are defined as:

  [78] extParsedEnt ::= TextDecl? content

The key definition for us here is that of 'content', appearing after
the optional TextDecl:

  [43] content ::=bCharData? ((element | Reference | CDSect | PI |
Comment) CharData?)*

This covers all the 'atoms' of XML, such as elements, character data,
comments, processing instructions, and so on. In other words, the
output of an XSLT process does not have to be a full XML document,
with only one root node, etc. It could be a string, a comment, a
processing instruction, an element, a list of elements, an element
followed by a comment followed by an element...you get the picture.

I've used XSLT to illustrate the concept, since that is probably what
many are familiar with, but much closer to home the RDF Concepts
document talks of rdf:XMLLiteral in *exactly* this way. The document
links to production 43--the production I quoted above--which means
that the definition of XML literals in RDF is _already_ that it is not
just an XML element, but that it can be any of the 'atoms' of
XML--strings, comments, PIs, nodelists, etc.

More significantly for our discussion, the RDF Concepts document has this note:

  Note: RDF applications may use additional equivalence relations, such as
  that which relates an xsd:string with an rdf:XMLLiteral corresponding to a
  single text node of the same string.

(See the end of section 5.1.)

What I had in mind was that some server storing the data as triples
would somehow 'augment' the rdf:XMLLiteral data type to include
something more specific; at least xsd:string, but perhaps also
xsd:date, xsd:integer, and so on.

I'll come back to this 'casting' or post-processing in a moment, but
the main point is that there is a strong argument for saying that any
data that originates from an XHTML document is *by definition* an
EGPE, and therefore at the very least cannot be a plain literal (and
so #1 is out).

I'd also argue that we should be wary of making the default xsd:string
since once done it can't be 'undone'. I don't have time to develop
this point now, but at root is the fact that in XML Schemas, an
xsd:integer is *not* derived from xsd:string. (A vote against #2.)


NOTE: Just to tie up all loose ends, for the author who _wants_ plain
literals--i.e., no datatype at all--the original proposal contained
the idea that @content should provide 'non-typed' literals:

  <meta property="dc:title" content="RDFa Primer" />

  <> dc:title "RDFa Primer" .

The rationale was that attributes can't contain mark-up anyway, so
@content could never contain an XMLLiteral.


FINALLY...SPARQL

So, now we've looked at the question from the point of view of the
mark-up, we should look at the problem raised by Ivan concerning
SPARQL. The main point made is that by using rdf:XMLLiteral queries
don't always match correctly. However, I don't think that choosing
plain literals or xsd:strings over rdf:XMLLiterals will necessarily
solve the problem Ivan is seeing, and I would suggest that in
situations where you are querying data that you have no control over,
the str() function should generally be used. (I'd also be interested
to double-check whether the behaviour seen is correct in relation to
SPARQL itself, but I'll have to look at that later.)


CONCLUSION

The ideal solution in my view, is that we stick to rdf:XMLLiterals,
but at some stage in the processing some level of augmentation takes
place, and data that is identifiable as an XML Schema simple type is
typed as such. This step could be carried out on the server that is
storing the data into a triple store, but it might be possible to
define the necessary regular expressions to incorporate this step into
the RDFa specification.

Regards,

Mark


On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote:
>
> Just wanted to chime in on the following, if it's not too late:
>
> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017
>
> > I am inclined to agree with you on the default datatype: it should just
> > be a string, except if you really want some XML. What do others think?
> >
> > -Ben
>
> We (our Semantic Web group here at IBM Cambridge) agree that it should be a
> string since this almost certainly going to be the common case. In our use
> of RDFa, we always want strings. XMLLiterals should be specified with the
> datatype attribute.
>
> Wing Yung
> Internet Technology
> wingyung@us.ibm.com
> 617.693.3763
>
>
>
>


-- 
Mark Birbeck
CEO
x-port.net Ltd.

e: Mark.Birbeck@x-port.net
t: +44 (0) 20 7689 9232
w: http://www.formsPlayer.com/
b: http://internet-apps.blogspot.com/

Download our XForms processor from
http://www.formsPlayer.com/
Received on Sunday, 4 February 2007 22:42:12 UTC