Re: [RDFa] Default datatype should be a string

Mark,

Good reply. There is just possibly one more approach, though it is a  
little less attractive in some aspects:

<h1 property="dc:title">E = mc<sup>2</sup>: woh!</h1>

gives an XMLLiteral, and

<h1 property="dc:title">My life in the bush of ghosts</h1>

is a plain literal. This latter is then equivalent (in triples produced) to

<h1 property="dc:title" content="My life in the bush of ghosts"/>

Steven

On Sun, 04 Feb 2007 23:41:55 +0100, Mark Birbeck <mark.birbeck@x-port.net>  
wrote:

>
> Hello all,
>
> I don't think the issue has been understood correctly, so I'll
> re-construct the thought processes I went through when working through
> the use of rdf:XMLLiteral, way, way back, in an early draft. I'm not
> at all suggesting that my solution is beyond dispute :), but if people
> want to change it, I think we all need to understand what the problems
> were that we were originally trying to address.
>
> The issue is not about which of "XML mark-up" or "strings" is the most
> common situation; I take it as given that 'strings' will be more
> common. :) The issue is essentially whether there is any need to
> distinguish between them, and if there isn't, whether we can use that
> fact to our advantage to make RDFa easier to author. I think it's
> important to state the problem this way round, since if there *is* a
> problem with always using XMLLiteral, then of course we can't do what
> I originally proposed!
>
>
> PLAIN VERSUS TYPED LITERALS
>
> To set the context, the first thing to remember is that in RDF,
> xsd:string is *not* the default datatype. In RDF, plain literals do
> not have *any* datatype. The word 'string' is being used very loosely
> in this discussion--from the subject line of the thread to comments
> added--and we need to be clear on what is being proposed.
>
> At first sight the lack of any type seems fine; after all, why should
> we worry that this:
>
>   <div about="">
>     <h1 property="dc:title">RDFa Primer</h1>
>   </div>
>
> can produce this:
>
>   <> dc:title "RDFa Primer" .
>
> But unfortunately we _do_ need to worry. If we take, for example,
> Einstein's famous 1946 article on nuclear weapons, we would obviously
> mark it up as follows:
>
>   <div about="">
>     <h1 property="dc:title">
>       E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>     </h1>
>   </div>
>
> We have to ask what would we *like* this mark-up to generate, and I
> think it's clear we'd want this:
>
>   <>
>     dc:title
>     "E = mc<sup>2</sup>: The Most Urgent Problem of Our  
> Time"^^rdf:XMLLiteral
>     .
>
> But of course this is the crux of the problem; our preference for the
> first example was a plain literal, but our preference for the second
> was an XML literal, so we must now ask what it is that could 'trigger'
> this difference in parsing behaviour.
>
>
> PROPOSAL 1: ALL TEXT IS PLAIN LITERAL
>
> The first option is to say that actually there is no trigger, and that
> _all_ text should be treated as a plain literal unless the author says
> otherwise. So our example would produce this:
>
>   <>
>     dc:title
>     "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"
>     .
>
> To create our original triples, the author would make use of
> @datatype, and write this:
>
>   <div about="">
>     <h1 property="dc:title" datatype="rdf:XMLLiteral">
>       E = mc<sup>2</sup>: The Most Urgent Problem of Our Time
>     </h1>
>   </div>
>
> At the time I was working on this I rejected this as probably the
> worst solution. :) My reasoning was simply that in examples such as
> this, the title is _already_ mark-up, since it originates from an
> XHTML document. The author clearly knows what they are doing, and so
> for them to have to repeat the fact that the title is mark-up is
> counter-intuitive, and breaks with the idea that we are 'decorating'
> XHTML, rather than fundamentally modifying it.
>
>
> PROPOSAL 2: ALL TEXT IS XSD:STRING
>
> The second option is also to say there is no trigger, but that instead
> of using plain literals, the data is automatically typed as an
> xsd:string:
>
>   <>
>     dc:title
>     "E = mc<sup>2</sup>: The Most Urgent Problem of Our Time"^^xsd:string
>     .
>
> Although this solves some use cases, as I'll discuss at the end it
> doesn't solve all, and I think we should be very careful with this.
>
>
> PROPOSAL 3: ALL TEXT IS XML LITERAL
>
> The third option--as we know, the one I actually went with--is to flip
> things round, and ask whether the ordinary string (or plain literal)
> couldn't be represented by an rdf:XMLLiteral? So this:
>
>   <div about="">
>     <h1 property="dc:title">RDFa Primer</h1>
>   </div>
>
> parses as this:
>
>   <> dc:title "RDFa Primer"^^rdf:XMLLiteral .
>
> In other words, the 'trigger' to create an rdf:XMLLiteral is any use
> of @property where the object of the statement appears in *mark-up*.
> There is a strong logic to this.
>
> First, the object _really has_ appeared in mark-up. But second, at the
> level of XML itself, it is not a problem that we don't have any 'tags'
> surrounding our text, since (as XSLT makes great use of), "RDFa
> Primer" is XML as much as "<div>42</div>" is. For those not familiar
> with this idea, I'll explain.
>
> Most people are probably familiar with XSLT, so we'll use that to
> illustrate. When XSLT 'outputs' XML, it creates 'external general
> parsed entities', which are defined as:
>
>   [78] extParsedEnt ::= TextDecl? content
>
> The key definition for us here is that of 'content', appearing after
> the optional TextDecl:
>
>   [43] content ::=bCharData? ((element | Reference | CDSect | PI |
> Comment) CharData?)*
>
> This covers all the 'atoms' of XML, such as elements, character data,
> comments, processing instructions, and so on. In other words, the
> output of an XSLT process does not have to be a full XML document,
> with only one root node, etc. It could be a string, a comment, a
> processing instruction, an element, a list of elements, an element
> followed by a comment followed by an element...you get the picture.
>
> I've used XSLT to illustrate the concept, since that is probably what
> many are familiar with, but much closer to home the RDF Concepts
> document talks of rdf:XMLLiteral in *exactly* this way. The document
> links to production 43--the production I quoted above--which means
> that the definition of XML literals in RDF is _already_ that it is not
> just an XML element, but that it can be any of the 'atoms' of
> XML--strings, comments, PIs, nodelists, etc.
>
> More significantly for our discussion, the RDF Concepts document has  
> this note:
>
>   Note: RDF applications may use additional equivalence relations, such  
> as
>   that which relates an xsd:string with an rdf:XMLLiteral corresponding  
> to a
>   single text node of the same string.
>
> (See the end of section 5.1.)
>
> What I had in mind was that some server storing the data as triples
> would somehow 'augment' the rdf:XMLLiteral data type to include
> something more specific; at least xsd:string, but perhaps also
> xsd:date, xsd:integer, and so on.
>
> I'll come back to this 'casting' or post-processing in a moment, but
> the main point is that there is a strong argument for saying that any
> data that originates from an XHTML document is *by definition* an
> EGPE, and therefore at the very least cannot be a plain literal (and
> so #1 is out).
>
> I'd also argue that we should be wary of making the default xsd:string
> since once done it can't be 'undone'. I don't have time to develop
> this point now, but at root is the fact that in XML Schemas, an
> xsd:integer is *not* derived from xsd:string. (A vote against #2.)
>
>
> NOTE: Just to tie up all loose ends, for the author who _wants_ plain
> literals--i.e., no datatype at all--the original proposal contained
> the idea that @content should provide 'non-typed' literals:
>
>   <meta property="dc:title" content="RDFa Primer" />
>
>   <> dc:title "RDFa Primer" .
>
> The rationale was that attributes can't contain mark-up anyway, so
> @content could never contain an XMLLiteral.
>
>
> FINALLY...SPARQL
>
> So, now we've looked at the question from the point of view of the
> mark-up, we should look at the problem raised by Ivan concerning
> SPARQL. The main point made is that by using rdf:XMLLiteral queries
> don't always match correctly. However, I don't think that choosing
> plain literals or xsd:strings over rdf:XMLLiterals will necessarily
> solve the problem Ivan is seeing, and I would suggest that in
> situations where you are querying data that you have no control over,
> the str() function should generally be used. (I'd also be interested
> to double-check whether the behaviour seen is correct in relation to
> SPARQL itself, but I'll have to look at that later.)
>
>
> CONCLUSION
>
> The ideal solution in my view, is that we stick to rdf:XMLLiterals,
> but at some stage in the processing some level of augmentation takes
> place, and data that is identifiable as an XML Schema simple type is
> typed as such. This step could be carried out on the server that is
> storing the data into a triple store, but it might be possible to
> define the necessary regular expressions to incorporate this step into
> the RDFa specification.
>
> Regards,
>
> Mark
>
>
> On 02/02/07, Wing C Yung <wingyung@us.ibm.com> wrote:
>>
>> Just wanted to chime in on the following, if it's not too late:
>>
>> http://lists.w3.org/Archives/Public/public-rdf-in-xhtml-tf/2007Jan/0017
>>
>> > I am inclined to agree with you on the default datatype: it should  
>> just
>> > be a string, except if you really want some XML. What do others think?
>> >
>> > -Ben
>>
>> We (our Semantic Web group here at IBM Cambridge) agree that it should  
>> be a
>> string since this almost certainly going to be the common case. In our  
>> use
>> of RDFa, we always want strings. XMLLiterals should be specified with  
>> the
>> datatype attribute.
>>
>> Wing Yung
>> Internet Technology
>> wingyung@us.ibm.com
>> 617.693.3763
>>
>>
>>
>>
>
>

Received on Monday, 5 February 2007 13:25:34 UTC