- From: Graham Klyne <Graham.Klyne@MIMEsweeper.com>
- Date: Thu, 27 Jun 2002 20:52:32 +0100
- To: pat hayes <phayes@ai.uwf.edu>
- Cc: w3c-rdfcore-wg@w3.org
I guess a number of questions are raised by Pat's message: (a) has enough new information been provided to reopen the issue? (b) do we agree that literals-as-tuples is ugly? (c) do we agree that extending the graph syntax to allow literals as subjects is desirable? Regarding (a), I think the answer is procedurally based rather than technically based, and don't have a view. My remaining comments are predicated on the group and/or chair deciding the answer is yes. Regarding (b), I tend to agree, but I've yet to understand fully why this is such a serious issue as Pat suggests. Regarding (c), I think that allowing literals as subjects in the graph syntax would be a natural and useful thing, and in the absence of procedural considerations would wholeheartedly support such a move. In saying this, I presume it's understood that not all well-formed graphs can be represented in the RDF/XML syntax, but that's nothing new -- we already have that situation. If (c) is accepted, then I think it reasonable that some other issues be reconsidered in light of that. #g -- At 01:23 PM 6/27/02 -0500, pat hayes wrote: >Apologies to all for reopening this business, but Ive been descending into >a black hole on this issue since the F2F. I really, really do not like the >situation we seem to be in here regarding literals, for a whole lot of >reasons. This message tries to enumerate them and to suggest a possible >solution. I know this has been done before and apologize in advance for >missing the obvious objections that have probably already been raised. > >Summary: literals ought to be strings, not 3-tuples. To achieve this, >literals should be allowed in subject position. If we do this, a number of >issues are clarified and the overall design of the language is >rationalized and simplified, at almost no cost. Also, datastructuring >suddenly gets easier. > >-------- > >First let me summarize my understanding of the current state of play. > >1. Literals are not strings. >2. Literals are 3-tuples, consisting of a bit and two strings, the second >of which is an XML lang tag. > >There are some constraints on these three parts of a literal. > >3. If the bit is set (Im not sure if that is a one or a zero) then the >second string is required to be well-formed XML. >4. If the bit is set, then the lang tag is understood to be the lang tag >of the second string. >5. If the bit is not set, then the lang tag is understood to indicate that >the second string is an expression of that language. > >This has some curious consequences. > >First, (3) has the consequence that any RDF engine must include an XML >parser, even if it is not expected to parse RDF/XML. The graph syntax has >all of XML syntax incorporated into it. This seems to me to be the most >serious problem. Under these circumstances it hardly seems worth having >the graph syntax: it would be better to just identify RDF with RDF/XML and >stop pretending. >Second, (4) and (5) interact in odd ways. Notice that the meaning of the >lang tag changes according to the bit setting. Suppose that the second >string is, in fact, well-formed XML and the lang tag is "FR". If the bit >is set, then tout est bien; but if the bit is not set, then this >combination is a disaster, since that well-formed XML is a mere string, >its resemblance to XML a mere accident, and then the lang tag is claiming >that it is a French text, which of course (being well-formed XML) it is >not, in fact. I am not sure what an RDF engine is supposed to do at this >point: would it be expected to reject this as an incoherent literal? > >Third, since we have decided that literals denote themselves, this >construction means that these ad-hoc 3-tuples must be in the semantic >domain of any RDF interpretation, and hence of any interpretation of any >language which extends RDF. This is at best an unattractive consequence, >and it might be disastrous if other languages expect to handle literals >differently (as I am sure they will). It seems particularly weird to >incorporate XML *syntax* into the semantic domain of all web ontology >languages. Notice that we do not have the option of making these 3-tuples >'optional' once they are in the semantic domain. > >Further to the previous point, datatyping simply does not work. All the >datatyping proposals so far considered - ALL of them - have been >predicated on the assumption that literals are members of the lexical >space of XML datatypes. But 3-tuples consisting of a bit and two strings >are not in the lexical space of any XML datatype. So we simply cannot do >datatyping in RDF at present, seems to me, with literals the way they are. > >Finally, there is a purely aesthetic reason which one might reject, but it >is worth mentioning: there is an obvious analogy between the lang tag and >a datatype, and it would be nice if the overall RDF scheme of things >preserved this analogy. > >For all these reasons, I would suggest that the decision to make literals >(in the graph syntax) into things other than strings was a very bad one, >and should be reconsidered. (I would have said this a long time ago if I >had realized that this was in fact the decision that had been taken, and >Im sorry that I missed that in the weeks following the Cannes meeting.) > >------ >I gather that the motivation for this decision was twofold: the bit is >there to record the parsetype being XML, to allow accurate XML >round-tripping; and the lang tag is needed by some RDF users. (I find >myself sceptical that the bit is actually needed: would it really be a >disaster if a string which just happened to be legal XML, but hadn't been >parsed from XML input, was accidentally misreported as being true XML? >This seems to me like worrying about the possibility that monkeys might >accidentally type out some Milton sonnets. But never mind.) > >Now, it seems to me that both of these are examples of information *about* >a literal string which needs to be recorded in an unambiguous form. Since >RDF is itself a language for asserting things about things, the obvious >way to record information in an RDF graph is to use triples to make such >assertions; but the obvious problem in doing that is that the literal in >question would naturally be the subject of the triple, and literals cannot >be subjects. Damn. > >We have been here before. Datatyping would be more natural if literals >could be subjects. Tim has asked us to consider the possibility that >literals could be subjects, and to recommend this change to the next WG if >we feel unable to do it. It seems to me that everyone who has ever >considered this matter has agreed that the restriction against literals in >subject position is irrational. So, lets consider what would happen if we >lifted it. In what follows, therefore, I will assume that literal nodes in >the graph are labelled with strings, and that these nodes can be in the >subject position of a triple. >The information recorded in the literals at present as extra syntactic >constructs could be represented naturally by introducing special >properties, eg: > >"10" rdf:sourceDeclaration "xml version="1.0"" . >"10" rdf:xmlLang "FR" . > >(where this is supposed to be a graph with three nodes and two arcs, by >the way.) Obviously, the names can be changed to protect the innocent. >Both of these mention xml deliberately, since this stuff is entirely to do >with xml round-tripping and preserving xml-specific data in an RDF graph. > >Having literal subjects also allows datatyping constructions which are all >based on triples of the form > ><literal> <datatype mapping> <bnode> . > >where the datatype mapping goes, as one would expect, from the lexical >form to the value. For example, we could write > >Jenny ex:age "10" . >"10" xsd:number _:x . >Jim ex:age _:x . > >(4 nodes total) which means that Jenny's age is a string and Jims age is >the number ten. Depending which end of the datatype triple you use as your >object node, you get either the lexical thingie or the value thingie. The >same datatype triple asserts that the literal is in the lexical space and >that the value is the right value. Simple, regular, conforms to the XML >description, and easy to understand. > >Also, range datatyping now can be done using a matched pair of closure rules: > >PPP rdfd:drange DDD . >AAA PPP LLL . >--> >LLL DDD _:x . > >PPP rdfd:drange DDD . >AAA PPP BBB . >BBB rdfd:lex LLL . >--> >LLL DDD BBB . > >for literals LLL and non-literals BBB., which again is easy to follow, and >kind of fits under a single idiom, where the datatype mapping has the same >relationship to the datatype class that rdf:type has to a particular >rdfs:class. > >By the way, amazingly enough, these rules work both for the current >'stake' datatyping AND for the alternative; the difference is in how you >interpret what a datatype triple asserts, exactly. The current version >says that the literal denotes itself and the property is the datatype >mapping; the alternative would say that the literal denotes the value and >the property is identity. (View in fixed-width font:) >Current: > >literal subject ---<datatype property> --> object > | | (graph > syntax) > denotes denotes ----------------------- > | | (semantic > interpretation) > \/ \/ >lexical item ----- datatype-mapping ----> value > >Alternative: > > >literal subject ---<datatype property> --> object > | | (graph > syntax) > denotes via datatype mapping denotes ----------------------- > | | (semantic > interpretation) > \/ \/ >value ----- equality ----> value > >The only difference here is that the label of the bottom arc in the first >square has been moved to the right-hand arc in the second one. The top and >right-hand sides are the same, and both squares commute, ie you get to the >same place by any path. > >----- > >The cost would be that we would have to allow untidy literals. But I am >increasingly thinking that our insistence on (syntactically) tidy literals >is irrational. Bear in mind that this is an issue ONLY in the graph >syntax. In any lexicalisation of an RDF graph (RDF/XML or Ntriples, but >also any other lexicalization that could be sent as a character string), >the graph-ids themselves cannot possibly be syntactically tidy. > >The interesting thing is that we can have (a kind of) semantic tidyness >even when the graph syntax isn't tidy. The way to do this is to say that >what a literal node denotes is a particular *occurrence* of a character >string; the one on that node, in fact. That is, literal *nodes* denote >themselves. There can be several character strings "10" going around, and >one of them might be in French and another of them might be labelled as >being XML, and so on. But any properties of them which depend only on the >actual characters in them apply to all of them or to none of them >together. The situation is just like properties in RDF: there could be two >properties with the same property extension (each is a subPropertyOf the >other) but themselves having different properties. Similarly for classes. >We would be putting literals into the same kind of 'intensional' category: >there could be two different literals with the same string. Identity for >literals is indicated by the actual node in the graph syntax. Notice that >all the examples so far have been tidy, in fact. > >As with any other untidy literals proposal, this would require a slight >extension to the N-triples notation to provide a way to indicate which >literals in which triples were supposed to be on the same node; we could >do this by the same device that we currently use for bnodes, so that the >above examples might look like this: > >_:x "10" rdf:sourceDeclaration "xml version="1.0"" . >_:x "10" rdf:xmlLang "FR" . > >Jenny ex:age _:y "10" . >_:y "10" xsd:number _:x . >Jim ex:age _:x . > >where the first one uses the additional (handy) convention that if no node >ID is indicated for a literal, then the literal is unique to its node. > >What about the Cannes entailment? Well, it depends now on the details of >the graph. If there is only one literal *node* which is both the age of >Jenny and the title of some movie, then indeed you can conclude that there >is a single thing which is both an age and a title. The inference rule can >be succinctly stated as follows: given any node in any graph, it is OK to >erase its label. That is, you can replace a uriref with a bnode, and you >can rub out any literal label of a node (leaving a bnode). On the other >hand, if the original graph has got two literal nodes, both labelled "10", >and one is the title and the other is the age, then no, you can't infer >that there is a single thing that is both. But then you couldn't do that >even if the nodes were bnodes, in this case, so you shouldn't *expect* to >be able to make this inference in this case, seems to me. (Im sure that >Dan's response will be that he doesn't want this to be a legal graph, for >just this reason.) > >---- > >Anyway, I offer this idea for consideration by the WG. I think that (apart >from the above-mentioned change to Ntriples) allowing literal subjects >will only simplify the rest of the documentation. The graph syntax will be >easier to specify, and the MT rules will be simplified. Datatyping will be >easier to specify and easier to understand, and various exceptions and >restrictions can just be forgotten about. > >One final rationalization is that we could have an RDFS closure rule > >LLL rdf:type rdfs:Literal . > >for any literal LLL, which obviously makes very good sense and is >conspicuously missing at the moment. > >Pat > > > > > >-- > > >--------------------------------------------------------------------- >IHMC (850)434 8903 home >40 South Alcaniz St. (850)202 4416 office >Pensacola, FL 32501 (850)202 4440 fax >phayes@ai.uwf.edu >http://www.coginst.uwf.edu/~phayes ------------------- Graham Klyne <GK@NineByNine.org>
Received on Thursday, 27 June 2002 15:39:20 UTC