literals, again. from pat hayes on 2002-06-27 (w3c-rdfcore-wg@w3.org from June 2002)

From: pat hayes <phayes@ai.uwf.edu>
Date: Thu, 27 Jun 2002 13:23:27 -0500
To: w3c-rdfcore-wg@w3.org
Message-Id: <p05111b1ab93eb59e47b5@[65.217.30.113]>
Apologies to all for reopening this business, but Ive been descending 
into a black hole on this issue since the F2F. I really, really do 
not like the situation we seem to be in here regarding literals, for 
a whole lot of reasons. This message tries to enumerate them and to 
suggest a possible solution. I know this has been done before and 
apologize in advance for missing the obvious objections that have 
probably already been raised.

Summary: literals ought to be strings, not 3-tuples. To achieve this, 
literals should be allowed in subject position. If we do this, a 
number of issues are clarified and the overall design of the language 
is rationalized and simplified, at almost no cost. Also, 
datastructuring suddenly gets easier.

--------

First let me summarize my understanding of the current state of play.

1. Literals are not strings.
2. Literals are 3-tuples, consisting of a bit and two strings, the 
second of which is an XML lang tag.

There are some constraints on these three parts of a literal.

3. If the bit is set (Im not sure if that is a one or a zero) then 
the second string is required to be well-formed XML. 
4. If the bit is set, then the lang tag is understood to be the lang 
tag of the second string.
5. If the bit is not set, then the lang tag is understood to indicate 
that the second string is an expression of that language.

This has some curious consequences.

First, (3) has the consequence that any RDF engine must include an 
XML parser, even if it is not expected to parse RDF/XML. The graph 
syntax has all of XML syntax incorporated into it. This seems to me 
to be the most serious problem. Under these circumstances it hardly 
seems worth having the graph syntax: it would be better to just 
identify RDF with RDF/XML and stop pretending.

Second, (4) and (5) interact in odd ways. Notice that the meaning of 
the lang tag changes according to the bit setting. Suppose that the 
second string is, in fact, well-formed XML and the lang tag is "FR". 
If the bit is set, then tout est bien; but if the bit is not set, 
then this combination is a disaster, since that well-formed XML is a 
mere string, its resemblance to XML a mere accident, and then the 
lang tag is claiming that it is a French text, which of course (being 
well-formed XML) it is not, in fact. I am not sure what an RDF engine 
is supposed to do at this point: would it be expected to reject this 
as an incoherent literal?

Third, since we have decided that literals denote themselves, this 
construction means that these ad-hoc 3-tuples must be in the semantic 
domain of any RDF interpretation, and hence of any interpretation of 
any language which extends RDF. This is at best an unattractive 
consequence, and it might be disastrous if other languages expect to 
handle literals differently (as I am sure they will). It seems 
particularly weird to incorporate XML *syntax* into the semantic 
domain of all web ontology languages. Notice that we do not have the 
option of making these 3-tuples 'optional' once they are in the 
semantic domain.

Further to the previous point, datatyping simply does not work. All 
the datatyping proposals so far considered - ALL of them - have been 
predicated on the assumption that literals are members of the lexical 
space of XML datatypes. But 3-tuples consisting of a bit and two 
strings are not in the lexical space of any XML datatype. So we 
simply cannot do datatyping in RDF at present, seems to me, with 
literals the way they are.

Finally, there is a purely aesthetic reason which one might reject, 
but it is worth mentioning: there is an obvious analogy between the 
lang tag and a datatype, and it would be nice if the overall RDF 
scheme of things preserved this analogy.

For all these reasons, I would suggest that the decision to make 
literals (in the graph syntax) into things other than strings was a 
very bad one, and should be reconsidered. (I would have said this a 
long time ago if I had realized that this was in fact the decision 
that had been taken, and Im sorry that I missed that in the weeks 
following the Cannes meeting.)

------

I gather that the motivation for this decision was twofold: the bit 
is there to record the parsetype being XML, to allow accurate XML 
round-tripping; and the lang tag is needed by some RDF users. (I find 
myself sceptical that the bit is actually needed: would it really be 
a disaster if a string which just happened to be legal XML, but 
hadn't been parsed from XML input, was accidentally misreported as 
being true XML? This seems to me like worrying about the possibility 
that monkeys might accidentally type out some Milton sonnets. But 
never mind.)

Now, it seems to me that both of these are examples of information 
*about* a literal string which needs to be recorded in an unambiguous 
form. Since RDF is itself a language for asserting things about 
things, the obvious way to record information in an RDF graph is to 
use triples to make such assertions; but the obvious problem in doing 
that is that the literal in question would naturally be the subject 
of the triple, and literals cannot be subjects. Damn.

We have been here before. Datatyping would be more natural if 
literals could be subjects. Tim has asked us to consider the 
possibility that literals could be subjects, and to recommend this 
change to the next WG if we feel unable to do it. It seems to me that 
everyone who has ever considered this matter has agreed that the 
restriction against literals in subject position is irrational. So, 
lets consider what would happen if we lifted it. In what follows, 
therefore, I will assume that literal nodes in the graph are labelled 
with strings, and that these nodes can be in the subject position of 
a triple.

The information recorded in the literals at present as extra 
syntactic constructs could be represented naturally by introducing 
special properties, eg:

"10"  rdf:sourceDeclaration  "xml version="1.0"" .
"10"  rdf:xmlLang  "FR"  .

(where this is supposed to be a graph with three nodes and two arcs, 
by the way.) Obviously, the names can be changed to protect the 
innocent. Both of these mention xml deliberately, since this stuff is 
entirely to do with xml round-tripping and preserving xml-specific 
data in an RDF graph.

Having literal subjects also allows datatyping constructions which 
are all based on triples of the form

<literal> <datatype mapping> <bnode> .

where the datatype mapping goes, as one would expect, from the 
lexical form to the value. For example, we could write

Jenny ex:age "10"  .
"10" xsd:number _:x .
Jim ex:age _:x .

(4  nodes total) which means that Jenny's age is a string and Jims 
age is the number ten. Depending which end of the datatype triple you 
use as your object node, you get either the lexical thingie or the 
value thingie. The same datatype triple asserts that the literal is 
in the lexical space and that the value is the right value. Simple, 
regular, conforms to the XML description, and easy to understand.

Also, range datatyping now can be done using a matched pair of closure rules:

PPP  rdfd:drange DDD .
AAA PPP LLL .
-->
LLL DDD _:x .

PPP  rdfd:drange DDD .
AAA PPP BBB .
BBB rdfd:lex LLL .
-->
LLL DDD BBB .

for literals LLL and non-literals BBB., which again is easy to 
follow, and kind of fits under a single idiom, where the datatype 
mapping has the same relationship to the datatype class that rdf:type 
has to a particular rdfs:class.

By the way, amazingly enough, these rules work both for the current 
'stake' datatyping AND for the alternative; the difference is in how 
you interpret what a datatype triple asserts, exactly. The current 
version says that the literal denotes itself and the property is the 
datatype mapping; the alternative would say that the literal denotes 
the value and the property is identity. (View in fixed-width font:)
Current:

literal subject ---<datatype property> --> object
    |                                         |                   (graph syntax)
  denotes                                 denotes     -----------------------
    |                                         |        (semantic interpretation)
    \/                                        \/
lexical item ----- datatype-mapping ----> value

Alternative:


literal subject ---<datatype property> --> object
    |                                         |                   (graph syntax)
  denotes via datatype mapping             denotes     -----------------------
    |                                         |        (semantic interpretation)
    \/                                        \/
value     -----       equality       ----> value

The only difference here is that the label of the bottom arc in the 
first square has been moved to the right-hand arc in the second one. 
The top and right-hand sides are the same, and both squares commute, 
ie you get to the same place by any path.

-----

The cost would be that we would have to allow untidy literals. But I 
am increasingly thinking that our insistence on (syntactically) tidy 
literals is irrational. Bear in mind that this is an issue ONLY in 
the graph syntax. In any lexicalisation of an RDF graph (RDF/XML or 
Ntriples, but also any other lexicalization that could be sent as a 
character string), the graph-ids themselves cannot possibly be 
syntactically tidy.

The interesting thing is that we can have (a kind of) semantic 
tidyness even when the graph syntax isn't tidy. The way to do this is 
to say that what a literal node denotes is a particular *occurrence* 
of a character string; the one on that node, in fact. That is, 
literal *nodes* denote themselves. There can be several character 
strings "10" going around, and one of them might be in French and 
another of them might be labelled as being XML, and so on. But any 
properties of them which depend only on the actual characters in them 
apply to all of them or to none of them together. The situation is 
just like properties in RDF: there could be two properties with the 
same property extension (each is a subPropertyOf the other) but 
themselves having different properties. Similarly for classes. We 
would be putting literals into the same kind of 'intensional' 
category: there could be two different literals with the same string. 
Identity for literals is indicated by the actual node in the graph 
syntax. Notice that all the examples so far have been tidy, in fact.

As with any other untidy literals proposal, this would require a 
slight extension to the N-triples notation to provide a way to 
indicate which literals in which triples were supposed to be on the 
same node; we could do this by the same device that we currently use 
for bnodes, so that the above examples might look like this:

_:x "10"  rdf:sourceDeclaration  "xml version="1.0"" .
_:x "10"  rdf:xmlLang  "FR"  .

Jenny ex:age _:y "10" .
_:y "10" xsd:number _:x .
Jim ex:age _:x .

where the first one uses the additional (handy) convention that if no 
node ID is indicated for a literal, then the literal is unique to its 
node.

What about the Cannes entailment? Well, it depends now on the details 
of the graph. If there is only one literal *node* which is both the 
age of Jenny and the title of some movie, then indeed you can 
conclude that there is a single thing which is both an age and a 
title. The inference rule can be succinctly stated as follows: given 
any node in any graph, it is OK to erase its label. That is, you can 
replace a uriref with a bnode, and you can rub out any literal label 
of a node (leaving a bnode). On the other hand, if the original graph 
has got two literal nodes, both labelled "10", and one is the title 
and the other is the age, then no, you can't infer that there is a 
single thing that is both. But then you couldn't do that even if the 
nodes were bnodes, in this case, so you shouldn't *expect* to be able 
to make this inference in this case, seems to me. (Im sure that Dan's 
response will be that he doesn't want this to be a legal graph, for 
just this reason.)

----

Anyway, I offer this idea for consideration by the WG. I think that 
(apart from the above-mentioned change to Ntriples) allowing literal 
subjects will only simplify the rest of the documentation. The graph 
syntax will be easier to specify, and the MT rules will be 
simplified. Datatyping will be easier to specify and easier to 
understand, and various exceptions and restrictions can just be 
forgotten about.

One final rationalization is that we could have an RDFS closure rule

LLL rdf:type rdfs:Literal .

for any literal LLL, which obviously makes very good sense and is 
conspicuously missing at the moment.

Pat



-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 27 June 2002 14:28:06 UTC