- From: Ivan Herman <ivan@w3.org>
- Date: Mon, 05 Feb 2007 10:01:40 +0100
- To: mark.birbeck@x-port.net
- CC: public-rdf-in-xhtml-tf@w3.org
- Message-ID: <45C6F274.80204@w3.org>
Mark, I think there one consideration really missing from your argumentation (and that is what made me become the proponent of the plain literal solution): RDF information retrieved from RDFa should be easily mashed up with RDF from other sources. And that is what becomes a problem. The example I give in http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2006Nov/0000 is real: the current setup with XMLLiteral made it impossible to mash up the FOAF statements from and RDFa marked page with a foaf file simply edited by hand. (ie, I had to change my demonstration back then for a more complicated solution!) The default in the case of RDF/XML and Turtle is plain literal; we should not depart from that. I played with the idea of saying "let it be plain literal if no HTML tags are present, and XMLLiteral otherwise". Ie, your RDFa example would indeed lead to <> dc:title "RDFa Primer", whereas your einstein example would lead to an XMLLiteral. But this is quite error prone to me. Imagine that one makes up a page with <div about=""> <span property="bla">this is boring</span> </div> which would lead to <> bla "this is boring" but then, *later*, changes the text by saying <div about=""> <span property="bla">this <i>is</i> boring</span> </div> which then suddently leads to <> bla "this <i>is</i> boring"^^XMLLiteral Although the intention was clearly to change the text from a presentation point of view only (that is why I used <i> and not <em>), the triplet becomes very different, and previously working queries might suddently fail. As far as I am concerned, the practical issues raised by the XMLLiteral approach clearly outweight other considerations... My two pence... Ivan Mark Birbeck wrote: > > Hello all, > > I don't think the issue has been understood correctly, so I'll > re-construct the thought processes I went through when working through > the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not > at all suggesting that my solution is beyond dispute :), but if people > want to change it, I think we all need to understand what the problems > were that we were originally trying to address. > > The issue is not about which of "XML mark-up" or "strings" is the most > common situation; I take it as given that 'strings' will be more > common. :) The issue is essentially whether there is any need to > distinguish between them, and if there isn't, whether we can use that > fact to our advantage to make RDFa easier to author. I think it's > important to state the problem this way round, since if there *is* a > problem with always using XMLLiteral, then of course we can't do what > I originally proposed! > > > PLAIN VERSUS TYPED LITERALS > > To set the context, the first thing to remember is that in RDF, > xsd:string is *not* the default datatype. In RDF, plain literals do > not have *any* datatype. The word 'string' is being used very loosely > in this discussion--from the subject line of the thread to comments > added--and we need to be clear on what is being proposed. > > At first sight the lack of any type seems fine; after all, why should > we worry that this: > > <div about=""> > <h1 property="dc:title">RDFa Primer</h1> > </div> > > can produce this: > > <> dc:title "RDFa Primer" . > > But unfortunately we _do_ need to worry. If we take, for example, > Einstein's famous 1946 article on nuclear weapons, we would obviously > mark it up as follows: > > <div about=""> > <h1 property="dc:title"> > E = mc<sup>2</sup>: The Most Urgent Problem of Our Time > </h1> > </div> > > We have to ask what would we *like* this mark-up to generate, and I > think it's clear we'd want this: > > <> > dc:title > "E = mc<sup>2</sup>: The Most Urgent Problem of Our > Time"^^rdf:XMLLiteral > . > > But of course this is the crux of the problem; our preference for the > first example was a plain literal, but our preference for the second > was an XML literal, so we must now ask what it is that could 'trigger' > this difference in parsing behaviour. > > > PROPOSAL 1: ALL TEXT IS PLAIN LITERAL > > The first option is to say that actually there is no trigger, and that > _all_ text should be treated as a plain literal unless the author says > otherwise. So our example would produce this: > > <> > dc:title > "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time" > . > > To create our original triples, the author would make use of > @datatype, and write this: > > <div about=""> > <h1 property="dc:title" datatype="rdf:XMLLiteral"> > E = mc<sup>2</sup>: The Most Urgent Problem of Our Time > </h1> > </div> > > At the time I was working on this I rejected this as probably the > worst solution. :) My reasoning was simply that in examples such as > this, the title is _already_ mark-up, since it originates from an > XHTML document. The author clearly knows what they are doing, and so > for them to have to repeat the fact that the title is mark-up is > counter-intuitive, and breaks with the idea that we are 'decorating' > XHTML, rather than fundamentally modifying it. > > > PROPOSAL 2: ALL TEXT IS XSD:STRING > > The second option is also to say there is no trigger, but that instead > of using plain literals, the data is automatically typed as an > xsd:string: > > <> > dc:title > "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^xsd:string > . > > Although this solves some use cases, as I'll discuss at the end it > doesn't solve all, and I think we should be very careful with this. > > > PROPOSAL 3: ALL TEXT IS XML LITERAL > > The third option--as we know, the one I actually went with--is to flip > things round, and ask whether the ordinary string (or plain literal) > couldn't be represented by an rdf:XMLLiteral? So this: > > <div about=""> > <h1 property="dc:title">RDFa Primer</h1> > </div> > > parses as this: > > <> dc:title "RDFa Primer"^^rdf:XMLLiteral . > > In other words, the 'trigger' to create an rdf:XMLLiteral is any use > of @property where the object of the statement appears in *mark-up*. > There is a strong logic to this. > > First, the object _really has_ appeared in mark-up. But second, at the > level of XML itself, it is not a problem that we don't have any 'tags' > surrounding our text, since (as XSLT makes great use of), "RDFa > Primer" is XML as much as "<div>42</div>" is. For those not familiar > with this idea, I'll explain. > > Most people are probably familiar with XSLT, so we'll use that to > illustrate. When XSLT 'outputs' XML, it creates 'external general > parsed entities', which are defined as: > > [78] extParsedEnt ::= TextDecl? content > > The key definition for us here is that of 'content', appearing after > the optional TextDecl: > > [43] content ::=bCharData? ((element | Reference | CDSect | PI | > Comment) CharData?)* > > This covers all the 'atoms' of XML, such as elements, character data, > comments, processing instructions, and so on. In other words, the > output of an XSLT process does not have to be a full XML document, > with only one root node, etc. It could be a string, a comment, a > processing instruction, an element, a list of elements, an element > followed by a comment followed by an element...you get the picture. > > I've used XSLT to illustrate the concept, since that is probably what > many are familiar with, but much closer to home the RDF Concepts > document talks of rdf:XMLLiteral in *exactly* this way. The document > links to production 43--the production I quoted above--which means > that the definition of XML literals in RDF is _already_ that it is not > just an XML element, but that it can be any of the 'atoms' of > XML--strings, comments, PIs, nodelists, etc. > > More significantly for our discussion, the RDF Concepts document has > this note: > > Note: RDF applications may use additional equivalence relations, such as > that which relates an xsd:string with an rdf:XMLLiteral corresponding to a > single text node of the same string. > > (See the end of section 5.1.) > > What I had in mind was that some server storing the data as triples > would somehow 'augment' the rdf:XMLLiteral data type to include > something more specific; at least xsd:string, but perhaps also > xsd:date, xsd:integer, and so on. > > I'll come back to this 'casting' or post-processing in a moment, but > the main point is that there is a strong argument for saying that any > data that originates from an XHTML document is *by definition* an > EGPE, and therefore at the very least cannot be a plain literal (and > so #1 is out). > > I'd also argue that we should be wary of making the default xsd:string > since once done it can't be 'undone'. I don't have time to develop > this point now, but at root is the fact that in XML Schemas, an > xsd:integer is *not* derived from xsd:string. (A vote against #2.) > > > NOTE: Just to tie up all loose ends, for the author who _wants_ plain > literals--i.e., no datatype at all--the original proposal contained > the idea that @content should provide 'non-typed' literals: > > <meta property="dc:title" content="RDFa Primer" /> > > <> dc:title "RDFa Primer" . > > The rationale was that attributes can't contain mark-up anyway, so > @content could never contain an XMLLiteral. > > > FINALLY...SPARQL > > So, now we've looked at the question from the point of view of the > mark-up, we should look at the problem raised by Ivan concerning > SPARQL. The main point made is that by using rdf:XMLLiteral queries > don't always match correctly. However, I don't think that choosing > plain literals or xsd:strings over rdf:XMLLiterals will necessarily > solve the problem Ivan is seeing, and I would suggest that in > situations where you are querying data that you have no control over, > the str() function should generally be used. (I'd also be interested > to double-check whether the behaviour seen is correct in relation to > SPARQL itself, but I'll have to look at that later.) > > > CONCLUSION > > The ideal solution in my view, is that we stick to rdf:XMLLiterals, > but at some stage in the processing some level of augmentation takes > place, and data that is identifiable as an XML Schema simple type is > typed as such. This step could be carried out on the server that is > storing the data into a triple store, but it might be possible to > define the necessary regular expressions to incorporate this step into > the RDFa specification. > > Regards, > > Mark > > > On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote: > >> >> Just wanted to chime in on the following, if it's not too late: >> >> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017 >> >> > I am inclined to agree with you on the default datatype: it should just >> > be a string, except if you really want some XML. What do others think? >> > >> > -Ben >> >> We (our Semantic Web group here at IBM Cambridge) agree that it should >> be a >> string since this almost certainly going to be the common case. In our >> use >> of RDFa, we always want strings. XMLLiterals should be specified with the >> datatype attribute. >> >> Wing Yung >> Internet Technology >> wingyung@us.ibm.com >> 617.693.3763 >> >> >> >> > > -- Ivan Herman, W3C Semantic Web Activity Lead URL: http://www.w3.org/People/Ivan/ PGP Key: http://www.cwi.nl/%7Eivan/AboutMe/pgpkey.html FOAF: http://www.ivan-herman.net/foaf.rdf
Received on Monday, 5 February 2007 09:01:47 UTC