Re: literals, again. from Graham Klyne on 2002-06-27 (w3c-rdfcore-wg@w3.org from June 2002)

From: Graham Klyne <Graham.Klyne@MIMEsweeper.com>
Date: Thu, 27 Jun 2002 20:52:32 +0100
To: pat hayes <phayes@ai.uwf.edu>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <5.1.0.14.2.20020627202909.03ce80b0@joy.songbird.com>
I guess a number of questions are raised by Pat's message:
(a) has enough new information been provided to reopen the issue?
(b) do we agree that literals-as-tuples is ugly?
(c) do we agree that extending the graph syntax to allow literals as 
subjects is desirable?

Regarding (a), I think the answer is procedurally based rather than 
technically based, and don't have a view.  My remaining comments are 
predicated on the group and/or chair deciding the answer is yes.

Regarding (b), I tend to agree, but I've yet to understand fully why this 
is such a serious issue as Pat suggests.

Regarding (c), I think that allowing literals as subjects in the graph 
syntax would be a natural and useful thing, and in the absence of 
procedural considerations would wholeheartedly support such a move.  In 
saying this, I presume it's understood that not all well-formed graphs can 
be represented in the RDF/XML syntax, but that's nothing new -- we already 
have that situation.

If (c) is accepted, then I think it reasonable that some other issues be 
reconsidered in light of that.

#g
--

At 01:23 PM 6/27/02 -0500, pat hayes wrote:
>Apologies to all for reopening this business, but Ive been descending into 
>a black hole on this issue since the F2F. I really, really do not like the 
>situation we seem to be in here regarding literals, for a whole lot of 
>reasons. This message tries to enumerate them and to suggest a possible 
>solution. I know this has been done before and apologize in advance for 
>missing the obvious objections that have probably already been raised.
>
>Summary: literals ought to be strings, not 3-tuples. To achieve this, 
>literals should be allowed in subject position. If we do this, a number of 
>issues are clarified and the overall design of the language is 
>rationalized and simplified, at almost no cost. Also, datastructuring 
>suddenly gets easier.
>
>--------
>
>First let me summarize my understanding of the current state of play.
>
>1. Literals are not strings.
>2. Literals are 3-tuples, consisting of a bit and two strings, the second 
>of which is an XML lang tag.
>
>There are some constraints on these three parts of a literal.
>
>3. If the bit is set (Im not sure if that is a one or a zero) then the 
>second string is required to be well-formed XML.
>4. If the bit is set, then the lang tag is understood to be the lang tag 
>of the second string.
>5. If the bit is not set, then the lang tag is understood to indicate that 
>the second string is an expression of that language.
>
>This has some curious consequences.
>
>First, (3) has the consequence that any RDF engine must include an XML 
>parser, even if it is not expected to parse RDF/XML. The graph syntax has 
>all of XML syntax incorporated into it. This seems to me to be the most 
>serious problem. Under these circumstances it hardly seems worth having 
>the graph syntax: it would be better to just identify RDF with RDF/XML and 
>stop pretending.
>Second, (4) and (5) interact in odd ways. Notice that the meaning of the 
>lang tag changes according to the bit setting. Suppose that the second 
>string is, in fact, well-formed XML and the lang tag is "FR". If the bit 
>is set, then tout est bien; but if the bit is not set, then this 
>combination is a disaster, since that well-formed XML is a mere string, 
>its resemblance to XML a mere accident, and then the lang tag is claiming 
>that it is a French text, which of course (being well-formed XML) it is 
>not, in fact. I am not sure what an RDF engine is supposed to do at this 
>point: would it be expected to reject this as an incoherent literal?
>
>Third, since we have decided that literals denote themselves, this 
>construction means that these ad-hoc 3-tuples must be in the semantic 
>domain of any RDF interpretation, and hence of any interpretation of any 
>language which extends RDF. This is at best an unattractive consequence, 
>and it might be disastrous if other languages expect to handle literals 
>differently (as I am sure they will). It seems particularly weird to 
>incorporate XML *syntax* into the semantic domain of all web ontology 
>languages. Notice that we do not have the option of making these 3-tuples 
>'optional' once they are in the semantic domain.
>
>Further to the previous point, datatyping simply does not work. All the 
>datatyping proposals so far considered - ALL of them - have been 
>predicated on the assumption that literals are members of the lexical 
>space of XML datatypes. But 3-tuples consisting of a bit and two strings 
>are not in the lexical space of any XML datatype. So we simply cannot do 
>datatyping in RDF at present, seems to me, with literals the way they are.
>
>Finally, there is a purely aesthetic reason which one might reject, but it 
>is worth mentioning: there is an obvious analogy between the lang tag and 
>a datatype, and it would be nice if the overall RDF scheme of things 
>preserved this analogy.
>
>For all these reasons, I would suggest that the decision to make literals 
>(in the graph syntax) into things other than strings was a very bad one, 
>and should be reconsidered. (I would have said this a long time ago if I 
>had realized that this was in fact the decision that had been taken, and 
>Im sorry that I missed that in the weeks following the Cannes meeting.)
>
>------
>I gather that the motivation for this decision was twofold: the bit is 
>there to record the parsetype being XML, to allow accurate XML 
>round-tripping; and the lang tag is needed by some RDF users. (I find 
>myself sceptical that the bit is actually needed: would it really be a 
>disaster if a string which just happened to be legal XML, but hadn't been 
>parsed from XML input, was accidentally misreported as being true XML? 
>This seems to me like worrying about the possibility that monkeys might 
>accidentally type out some Milton sonnets. But never mind.)
>
>Now, it seems to me that both of these are examples of information *about* 
>a literal string which needs to be recorded in an unambiguous form. Since 
>RDF is itself a language for asserting things about things, the obvious 
>way to record information in an RDF graph is to use triples to make such 
>assertions; but the obvious problem in doing that is that the literal in 
>question would naturally be the subject of the triple, and literals cannot 
>be subjects. Damn.
>
>We have been here before. Datatyping would be more natural if literals 
>could be subjects. Tim has asked us to consider the possibility that 
>literals could be subjects, and to recommend this change to the next WG if 
>we feel unable to do it. It seems to me that everyone who has ever 
>considered this matter has agreed that the restriction against literals in 
>subject position is irrational. So, lets consider what would happen if we 
>lifted it. In what follows, therefore, I will assume that literal nodes in 
>the graph are labelled with strings, and that these nodes can be in the 
>subject position of a triple.
>The information recorded in the literals at present as extra syntactic 
>constructs could be represented naturally by introducing special 
>properties, eg:
>
>"10"  rdf:sourceDeclaration  "xml version="1.0"" .
>"10"  rdf:xmlLang  "FR"  .
>
>(where this is supposed to be a graph with three nodes and two arcs, by 
>the way.) Obviously, the names can be changed to protect the innocent. 
>Both of these mention xml deliberately, since this stuff is entirely to do 
>with xml round-tripping and preserving xml-specific data in an RDF graph.
>
>Having literal subjects also allows datatyping constructions which are all 
>based on triples of the form
>
><literal> <datatype mapping> <bnode> .
>
>where the datatype mapping goes, as one would expect, from the lexical 
>form to the value. For example, we could write
>
>Jenny ex:age "10"  .
>"10" xsd:number _:x .
>Jim ex:age _:x .
>
>(4  nodes total) which means that Jenny's age is a string and Jims age is 
>the number ten. Depending which end of the datatype triple you use as your 
>object node, you get either the lexical thingie or the value thingie. The 
>same datatype triple asserts that the literal is in the lexical space and 
>that the value is the right value. Simple, regular, conforms to the XML 
>description, and easy to understand.
>
>Also, range datatyping now can be done using a matched pair of closure rules:
>
>PPP  rdfd:drange DDD .
>AAA PPP LLL .
>-->
>LLL DDD _:x .
>
>PPP  rdfd:drange DDD .
>AAA PPP BBB .
>BBB rdfd:lex LLL .
>-->
>LLL DDD BBB .
>
>for literals LLL and non-literals BBB., which again is easy to follow, and 
>kind of fits under a single idiom, where the datatype mapping has the same 
>relationship to the datatype class that rdf:type has to a particular 
>rdfs:class.
>
>By the way, amazingly enough, these rules work both for the current 
>'stake' datatyping AND for the alternative; the difference is in how you 
>interpret what a datatype triple asserts, exactly. The current version 
>says that the literal denotes itself and the property is the datatype 
>mapping; the alternative would say that the literal denotes the value and 
>the property is identity. (View in fixed-width font:)
>Current:
>
>literal subject ---<datatype property> --> object
>    |                                         |                   (graph 
> syntax)
>  denotes                                 denotes     -----------------------
>    |                                         |        (semantic 
> interpretation)
>    \/                                        \/
>lexical item ----- datatype-mapping ----> value
>
>Alternative:
>
>
>literal subject ---<datatype property> --> object
>    |                                         |                   (graph 
> syntax)
>  denotes via datatype mapping             denotes     -----------------------
>    |                                         |        (semantic 
> interpretation)
>    \/                                        \/
>value     -----       equality       ----> value
>
>The only difference here is that the label of the bottom arc in the first 
>square has been moved to the right-hand arc in the second one. The top and 
>right-hand sides are the same, and both squares commute, ie you get to the 
>same place by any path.
>
>-----
>
>The cost would be that we would have to allow untidy literals. But I am 
>increasingly thinking that our insistence on (syntactically) tidy literals 
>is irrational. Bear in mind that this is an issue ONLY in the graph 
>syntax. In any lexicalisation of an RDF graph (RDF/XML or Ntriples, but 
>also any other lexicalization that could be sent as a character string), 
>the graph-ids themselves cannot possibly be syntactically tidy.
>
>The interesting thing is that we can have (a kind of) semantic tidyness 
>even when the graph syntax isn't tidy. The way to do this is to say that 
>what a literal node denotes is a particular *occurrence* of a character 
>string; the one on that node, in fact. That is, literal *nodes* denote 
>themselves. There can be several character strings "10" going around, and 
>one of them might be in French and another of them might be labelled as 
>being XML, and so on. But any properties of them which depend only on the 
>actual characters in them apply to all of them or to none of them 
>together. The situation is just like properties in RDF: there could be two 
>properties with the same property extension (each is a subPropertyOf the 
>other) but themselves having different properties. Similarly for classes. 
>We would be putting literals into the same kind of 'intensional' category: 
>there could be two different literals with the same string. Identity for 
>literals is indicated by the actual node in the graph syntax. Notice that 
>all the examples so far have been tidy, in fact.
>
>As with any other untidy literals proposal, this would require a slight 
>extension to the N-triples notation to provide a way to indicate which 
>literals in which triples were supposed to be on the same node; we could 
>do this by the same device that we currently use for bnodes, so that the 
>above examples might look like this:
>
>_:x "10"  rdf:sourceDeclaration  "xml version="1.0"" .
>_:x "10"  rdf:xmlLang  "FR"  .
>
>Jenny ex:age _:y "10" .
>_:y "10" xsd:number _:x .
>Jim ex:age _:x .
>
>where the first one uses the additional (handy) convention that if no node 
>ID is indicated for a literal, then the literal is unique to its node.
>
>What about the Cannes entailment? Well, it depends now on the details of 
>the graph. If there is only one literal *node* which is both the age of 
>Jenny and the title of some movie, then indeed you can conclude that there 
>is a single thing which is both an age and a title. The inference rule can 
>be succinctly stated as follows: given any node in any graph, it is OK to 
>erase its label. That is, you can replace a uriref with a bnode, and you 
>can rub out any literal label of a node (leaving a bnode). On the other 
>hand, if the original graph has got two literal nodes, both labelled "10", 
>and one is the title and the other is the age, then no, you can't infer 
>that there is a single thing that is both. But then you couldn't do that 
>even if the nodes were bnodes, in this case, so you shouldn't *expect* to 
>be able to make this inference in this case, seems to me. (Im sure that 
>Dan's response will be that he doesn't want this to be a legal graph, for 
>just this reason.)
>
>----
>
>Anyway, I offer this idea for consideration by the WG. I think that (apart 
>from the above-mentioned change to Ntriples) allowing literal subjects 
>will only simplify the rest of the documentation. The graph syntax will be 
>easier to specify, and the MT rules will be simplified. Datatyping will be 
>easier to specify and easier to understand, and various exceptions and 
>restrictions can just be forgotten about.
>
>One final rationalization is that we could have an RDFS closure rule
>
>LLL rdf:type rdfs:Literal .
>
>for any literal LLL, which obviously makes very good sense and is 
>conspicuously missing at the moment.
>
>Pat
>
>
>
>
>
>--
>
>
>---------------------------------------------------------------------
>IHMC                                    (850)434 8903   home
>40 South Alcaniz St.                    (850)202 4416   office
>Pensacola,  FL 32501                    (850)202 4440   fax
>phayes@ai.uwf.edu 
>http://www.coginst.uwf.edu/~phayes

-------------------
Graham Klyne
<GK@NineByNine.org>
Received on Thursday, 27 June 2002 15:39:20 UTC