- From: pat hayes <phayes@ai.uwf.edu>
- Date: Thu, 27 Jun 2002 13:23:27 -0500
- To: w3c-rdfcore-wg@w3.org
- Message-Id: <p05111b1ab93eb59e47b5@[65.217.30.113]>
Apologies to all for reopening this business, but Ive been descending
into a black hole on this issue since the F2F. I really, really do
not like the situation we seem to be in here regarding literals, for
a whole lot of reasons. This message tries to enumerate them and to
suggest a possible solution. I know this has been done before and
apologize in advance for missing the obvious objections that have
probably already been raised.
Summary: literals ought to be strings, not 3-tuples. To achieve this,
literals should be allowed in subject position. If we do this, a
number of issues are clarified and the overall design of the language
is rationalized and simplified, at almost no cost. Also,
datastructuring suddenly gets easier.
--------
First let me summarize my understanding of the current state of play.
1. Literals are not strings.
2. Literals are 3-tuples, consisting of a bit and two strings, the
second of which is an XML lang tag.
There are some constraints on these three parts of a literal.
3. If the bit is set (Im not sure if that is a one or a zero) then
the second string is required to be well-formed XML.
4. If the bit is set, then the lang tag is understood to be the lang
tag of the second string.
5. If the bit is not set, then the lang tag is understood to indicate
that the second string is an expression of that language.
This has some curious consequences.
First, (3) has the consequence that any RDF engine must include an
XML parser, even if it is not expected to parse RDF/XML. The graph
syntax has all of XML syntax incorporated into it. This seems to me
to be the most serious problem. Under these circumstances it hardly
seems worth having the graph syntax: it would be better to just
identify RDF with RDF/XML and stop pretending.
Second, (4) and (5) interact in odd ways. Notice that the meaning of
the lang tag changes according to the bit setting. Suppose that the
second string is, in fact, well-formed XML and the lang tag is "FR".
If the bit is set, then tout est bien; but if the bit is not set,
then this combination is a disaster, since that well-formed XML is a
mere string, its resemblance to XML a mere accident, and then the
lang tag is claiming that it is a French text, which of course (being
well-formed XML) it is not, in fact. I am not sure what an RDF engine
is supposed to do at this point: would it be expected to reject this
as an incoherent literal?
Third, since we have decided that literals denote themselves, this
construction means that these ad-hoc 3-tuples must be in the semantic
domain of any RDF interpretation, and hence of any interpretation of
any language which extends RDF. This is at best an unattractive
consequence, and it might be disastrous if other languages expect to
handle literals differently (as I am sure they will). It seems
particularly weird to incorporate XML *syntax* into the semantic
domain of all web ontology languages. Notice that we do not have the
option of making these 3-tuples 'optional' once they are in the
semantic domain.
Further to the previous point, datatyping simply does not work. All
the datatyping proposals so far considered - ALL of them - have been
predicated on the assumption that literals are members of the lexical
space of XML datatypes. But 3-tuples consisting of a bit and two
strings are not in the lexical space of any XML datatype. So we
simply cannot do datatyping in RDF at present, seems to me, with
literals the way they are.
Finally, there is a purely aesthetic reason which one might reject,
but it is worth mentioning: there is an obvious analogy between the
lang tag and a datatype, and it would be nice if the overall RDF
scheme of things preserved this analogy.
For all these reasons, I would suggest that the decision to make
literals (in the graph syntax) into things other than strings was a
very bad one, and should be reconsidered. (I would have said this a
long time ago if I had realized that this was in fact the decision
that had been taken, and Im sorry that I missed that in the weeks
following the Cannes meeting.)
------
I gather that the motivation for this decision was twofold: the bit
is there to record the parsetype being XML, to allow accurate XML
round-tripping; and the lang tag is needed by some RDF users. (I find
myself sceptical that the bit is actually needed: would it really be
a disaster if a string which just happened to be legal XML, but
hadn't been parsed from XML input, was accidentally misreported as
being true XML? This seems to me like worrying about the possibility
that monkeys might accidentally type out some Milton sonnets. But
never mind.)
Now, it seems to me that both of these are examples of information
*about* a literal string which needs to be recorded in an unambiguous
form. Since RDF is itself a language for asserting things about
things, the obvious way to record information in an RDF graph is to
use triples to make such assertions; but the obvious problem in doing
that is that the literal in question would naturally be the subject
of the triple, and literals cannot be subjects. Damn.
We have been here before. Datatyping would be more natural if
literals could be subjects. Tim has asked us to consider the
possibility that literals could be subjects, and to recommend this
change to the next WG if we feel unable to do it. It seems to me that
everyone who has ever considered this matter has agreed that the
restriction against literals in subject position is irrational. So,
lets consider what would happen if we lifted it. In what follows,
therefore, I will assume that literal nodes in the graph are labelled
with strings, and that these nodes can be in the subject position of
a triple.
The information recorded in the literals at present as extra
syntactic constructs could be represented naturally by introducing
special properties, eg:
"10" rdf:sourceDeclaration "xml version="1.0"" .
"10" rdf:xmlLang "FR" .
(where this is supposed to be a graph with three nodes and two arcs,
by the way.) Obviously, the names can be changed to protect the
innocent. Both of these mention xml deliberately, since this stuff is
entirely to do with xml round-tripping and preserving xml-specific
data in an RDF graph.
Having literal subjects also allows datatyping constructions which
are all based on triples of the form
<literal> <datatype mapping> <bnode> .
where the datatype mapping goes, as one would expect, from the
lexical form to the value. For example, we could write
Jenny ex:age "10" .
"10" xsd:number _:x .
Jim ex:age _:x .
(4 nodes total) which means that Jenny's age is a string and Jims
age is the number ten. Depending which end of the datatype triple you
use as your object node, you get either the lexical thingie or the
value thingie. The same datatype triple asserts that the literal is
in the lexical space and that the value is the right value. Simple,
regular, conforms to the XML description, and easy to understand.
Also, range datatyping now can be done using a matched pair of closure rules:
PPP rdfd:drange DDD .
AAA PPP LLL .
-->
LLL DDD _:x .
PPP rdfd:drange DDD .
AAA PPP BBB .
BBB rdfd:lex LLL .
-->
LLL DDD BBB .
for literals LLL and non-literals BBB., which again is easy to
follow, and kind of fits under a single idiom, where the datatype
mapping has the same relationship to the datatype class that rdf:type
has to a particular rdfs:class.
By the way, amazingly enough, these rules work both for the current
'stake' datatyping AND for the alternative; the difference is in how
you interpret what a datatype triple asserts, exactly. The current
version says that the literal denotes itself and the property is the
datatype mapping; the alternative would say that the literal denotes
the value and the property is identity. (View in fixed-width font:)
Current:
literal subject ---<datatype property> --> object
| | (graph syntax)
denotes denotes -----------------------
| | (semantic interpretation)
\/ \/
lexical item ----- datatype-mapping ----> value
Alternative:
literal subject ---<datatype property> --> object
| | (graph syntax)
denotes via datatype mapping denotes -----------------------
| | (semantic interpretation)
\/ \/
value ----- equality ----> value
The only difference here is that the label of the bottom arc in the
first square has been moved to the right-hand arc in the second one.
The top and right-hand sides are the same, and both squares commute,
ie you get to the same place by any path.
-----
The cost would be that we would have to allow untidy literals. But I
am increasingly thinking that our insistence on (syntactically) tidy
literals is irrational. Bear in mind that this is an issue ONLY in
the graph syntax. In any lexicalisation of an RDF graph (RDF/XML or
Ntriples, but also any other lexicalization that could be sent as a
character string), the graph-ids themselves cannot possibly be
syntactically tidy.
The interesting thing is that we can have (a kind of) semantic
tidyness even when the graph syntax isn't tidy. The way to do this is
to say that what a literal node denotes is a particular *occurrence*
of a character string; the one on that node, in fact. That is,
literal *nodes* denote themselves. There can be several character
strings "10" going around, and one of them might be in French and
another of them might be labelled as being XML, and so on. But any
properties of them which depend only on the actual characters in them
apply to all of them or to none of them together. The situation is
just like properties in RDF: there could be two properties with the
same property extension (each is a subPropertyOf the other) but
themselves having different properties. Similarly for classes. We
would be putting literals into the same kind of 'intensional'
category: there could be two different literals with the same string.
Identity for literals is indicated by the actual node in the graph
syntax. Notice that all the examples so far have been tidy, in fact.
As with any other untidy literals proposal, this would require a
slight extension to the N-triples notation to provide a way to
indicate which literals in which triples were supposed to be on the
same node; we could do this by the same device that we currently use
for bnodes, so that the above examples might look like this:
_:x "10" rdf:sourceDeclaration "xml version="1.0"" .
_:x "10" rdf:xmlLang "FR" .
Jenny ex:age _:y "10" .
_:y "10" xsd:number _:x .
Jim ex:age _:x .
where the first one uses the additional (handy) convention that if no
node ID is indicated for a literal, then the literal is unique to its
node.
What about the Cannes entailment? Well, it depends now on the details
of the graph. If there is only one literal *node* which is both the
age of Jenny and the title of some movie, then indeed you can
conclude that there is a single thing which is both an age and a
title. The inference rule can be succinctly stated as follows: given
any node in any graph, it is OK to erase its label. That is, you can
replace a uriref with a bnode, and you can rub out any literal label
of a node (leaving a bnode). On the other hand, if the original graph
has got two literal nodes, both labelled "10", and one is the title
and the other is the age, then no, you can't infer that there is a
single thing that is both. But then you couldn't do that even if the
nodes were bnodes, in this case, so you shouldn't *expect* to be able
to make this inference in this case, seems to me. (Im sure that Dan's
response will be that he doesn't want this to be a legal graph, for
just this reason.)
----
Anyway, I offer this idea for consideration by the WG. I think that
(apart from the above-mentioned change to Ntriples) allowing literal
subjects will only simplify the rest of the documentation. The graph
syntax will be easier to specify, and the MT rules will be
simplified. Datatyping will be easier to specify and easier to
understand, and various exceptions and restrictions can just be
forgotten about.
One final rationalization is that we could have an RDFS closure rule
LLL rdf:type rdfs:Literal .
for any literal LLL, which obviously makes very good sense and is
conspicuously missing at the moment.
Pat
--
---------------------------------------------------------------------
IHMC (850)434 8903 home
40 South Alcaniz St. (850)202 4416 office
Pensacola, FL 32501 (850)202 4440 fax
phayes@ai.uwf.edu
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 27 June 2002 14:28:06 UTC