Re: literals, again. from Patrick Stickler on 2002-06-28 (w3c-rdfcore-wg@w3.org from June 2002)

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Fri, 28 Jun 2002 11:43:09 +0300
To: Pat Hayes <phayes@ai.uwf.edu>, RDF Core <w3c-rdfcore-wg@w3.org>
Message-ID: <B941FC4D.1791C%patrick.stickler@nokia.com>
Pat,

I support the "gist" of essentially all that you are saying
below, but would offer an even more simplified proposal.

Rather than

> Jenny ex:age _:y "10" .
> _:y "10" xsd:number _:x .
> Jim ex:age _:x .

simply

Jenny ex:age _:y"10" .
_:y rdf:type xsd:integer .
Jim ex:age _:y .

so that datatyping is no different than "normal" RDF
typing, and the above works perfectly in conjuction
with rdfs:range. E.g.

   ex:age rdfs:range xsd:integer .

Likewise, rather than

> _:x"10"  rdf:sourceDeclaration  "xml version="1.0"" .
> _:x"10"  rdf:xmlLang  "FR"  .

simply

_:x"10" rdf:type rdf:XML .
_:x"10" xml:lang  _:z"fr"  .

Thus, the datatype of the literal is well formed XML, such
that the lexical space of rdf:XML is the set of all possible
well formed canonical XML serializations and the value space
is the set of infosets which those serializations map to.

Specific complex datatypes (e.g. xhtml:html) can simply
be subclasses of the "generic" rdf:XML datatype.

Thus, when round-tripping, one can simply test whether a given
literal is of type rdf:XML and if so, serialize it as XML,
otherwise, serialize it as a literal.

And since xml:lang is what is being used in the serialization,
that should be the name of the property qualifying the literal
node. The datatype for the xml:lang property can then be
presumed to be xsd:lang

  xml:lang rdfs:range xsd:lang .

Regards,

Patrick




On 2002-06-27 21:23, "ext pat hayes" <phayes@ai.uwf.edu> wrote:

> Apologies to all for reopening this business, but Ive been descending
> into a black hole on this issue since the F2F. I really, really do
> not like the situation we seem to be in here regarding literals, for
> a whole lot of reasons. This message tries to enumerate them and to
> suggest a possible solution. I know this has been done before and
> apologize in advance for missing the obvious objections that have
> probably already been raised.
> 
> Summary: literals ought to be strings, not 3-tuples. To achieve this,
> literals should be allowed in subject position. If we do this, a
> number of issues are clarified and the overall design of the language
> is rationalized and simplified, at almost no cost. Also,
> datastructuring suddenly gets easier.
> 
> --------
> 
> First let me summarize my understanding of the current state of play.
> 
> 1. Literals are not strings.
> 2. Literals are 3-tuples, consisting of a bit and two strings, the
> second of which is an XML lang tag.
> 
> There are some constraints on these three parts of a literal.
> 
> 3. If the bit is set (Im not sure if that is a one or a zero) then
> the second string is required to be well-formed XML.
> 4. If the bit is set, then the lang tag is understood to be the lang
> tag of the second string.
> 5. If the bit is not set, then the lang tag is understood to indicate
> that the second string is an expression of that language.
> 
> This has some curious consequences.
> 
> First, (3) has the consequence that any RDF engine must include an
> XML parser, even if it is not expected to parse RDF/XML. The graph
> syntax has all of XML syntax incorporated into it. This seems to me
> to be the most serious problem. Under these circumstances it hardly
> seems worth having the graph syntax: it would be better to just
> identify RDF with RDF/XML and stop pretending.
> 
> Second, (4) and (5) interact in odd ways. Notice that the meaning of
> the lang tag changes according to the bit setting. Suppose that the
> second string is, in fact, well-formed XML and the lang tag is "FR".
> If the bit is set, then tout est bien; but if the bit is not set,
> then this combination is a disaster, since that well-formed XML is a
> mere string, its resemblance to XML a mere accident, and then the
> lang tag is claiming that it is a French text, which of course (being
> well-formed XML) it is not, in fact. I am not sure what an RDF engine
> is supposed to do at this point: would it be expected to reject this
> as an incoherent literal?
> 
> Third, since we have decided that literals denote themselves, this
> construction means that these ad-hoc 3-tuples must be in the semantic
> domain of any RDF interpretation, and hence of any interpretation of
> any language which extends RDF. This is at best an unattractive
> consequence, and it might be disastrous if other languages expect to
> handle literals differently (as I am sure they will). It seems
> particularly weird to incorporate XML *syntax* into the semantic
> domain of all web ontology languages. Notice that we do not have the
> option of making these 3-tuples 'optional' once they are in the
> semantic domain.
> 
> Further to the previous point, datatyping simply does not work. All
> the datatyping proposals so far considered - ALL of them - have been
> predicated on the assumption that literals are members of the lexical
> space of XML datatypes. But 3-tuples consisting of a bit and two
> strings are not in the lexical space of any XML datatype. So we
> simply cannot do datatyping in RDF at present, seems to me, with
> literals the way they are.
> 
> Finally, there is a purely aesthetic reason which one might reject,
> but it is worth mentioning: there is an obvious analogy between the
> lang tag and a datatype, and it would be nice if the overall RDF
> scheme of things preserved this analogy.
> 
> For all these reasons, I would suggest that the decision to make
> literals (in the graph syntax) into things other than strings was a
> very bad one, and should be reconsidered. (I would have said this a
> long time ago if I had realized that this was in fact the decision
> that had been taken, and Im sorry that I missed that in the weeks
> following the Cannes meeting.)
> 
> ------
> 
> I gather that the motivation for this decision was twofold: the bit
> is there to record the parsetype being XML, to allow accurate XML
> round-tripping; and the lang tag is needed by some RDF users. (I find
> myself sceptical that the bit is actually needed: would it really be
> a disaster if a string which just happened to be legal XML, but
> hadn't been parsed from XML input, was accidentally misreported as
> being true XML? This seems to me like worrying about the possibility
> that monkeys might accidentally type out some Milton sonnets. But
> never mind.)
> 
> Now, it seems to me that both of these are examples of information
> *about* a literal string which needs to be recorded in an unambiguous
> form. Since RDF is itself a language for asserting things about
> things, the obvious way to record information in an RDF graph is to
> use triples to make such assertions; but the obvious problem in doing
> that is that the literal in question would naturally be the subject
> of the triple, and literals cannot be subjects. Damn.
> 
> We have been here before. Datatyping would be more natural if
> literals could be subjects. Tim has asked us to consider the
> possibility that literals could be subjects, and to recommend this
> change to the next WG if we feel unable to do it. It seems to me that
> everyone who has ever considered this matter has agreed that the
> restriction against literals in subject position is irrational. So,
> lets consider what would happen if we lifted it. In what follows,
> therefore, I will assume that literal nodes in the graph are labelled
> with strings, and that these nodes can be in the subject position of
> a triple.
> 
> The information recorded in the literals at present as extra
> syntactic constructs could be represented naturally by introducing
> special properties, eg:
> 
> "10"  rdf:sourceDeclaration  "xml version="1.0"" .
> "10"  rdf:xmlLang  "FR"  .
> 
> (where this is supposed to be a graph with three nodes and two arcs,
> by the way.) Obviously, the names can be changed to protect the
> innocent. Both of these mention xml deliberately, since this stuff is
> entirely to do with xml round-tripping and preserving xml-specific
> data in an RDF graph.
> 
> Having literal subjects also allows datatyping constructions which
> are all based on triples of the form
> 
> <literal> <datatype mapping> <bnode> .
> 
> where the datatype mapping goes, as one would expect, from the
> lexical form to the value. For example, we could write
> 
> Jenny ex:age "10"  .
> "10" xsd:number _:x .
> Jim ex:age _:x .
> 
> (4  nodes total) which means that Jenny's age is a string and Jims
> age is the number ten. Depending which end of the datatype triple you
> use as your object node, you get either the lexical thingie or the
> value thingie. The same datatype triple asserts that the literal is
> in the lexical space and that the value is the right value. Simple,
> regular, conforms to the XML description, and easy to understand.
> 
> Also, range datatyping now can be done using a matched pair of closure rules:
> 
> PPP  rdfd:drange DDD .
> AAA PPP LLL .
> -->
> LLL DDD _:x .
> 
> PPP  rdfd:drange DDD .
> AAA PPP BBB .
> BBB rdfd:lex LLL .
> -->
> LLL DDD BBB .
> 
> for literals LLL and non-literals BBB., which again is easy to
> follow, and kind of fits under a single idiom, where the datatype
> mapping has the same relationship to the datatype class that rdf:type
> has to a particular rdfs:class.
> 
> By the way, amazingly enough, these rules work both for the current
> 'stake' datatyping AND for the alternative; the difference is in how
> you interpret what a datatype triple asserts, exactly. The current
> version says that the literal denotes itself and the property is the
> datatype mapping; the alternative would say that the literal denotes
> the value and the property is identity. (View in fixed-width font:)
> Current:
> 
> literal subject ---<datatype property> --> object
>   |                                         |                   (graph syntax)
> denotes                                 denotes     -----------------------
>   |                                         |        (semantic interpretation)
>   \/                                        \/
> lexical item ----- datatype-mapping ----> value
> 
> Alternative:
> 
> 
> literal subject ---<datatype property> --> object
>   |                                         |                   (graph syntax)
> denotes via datatype mapping             denotes     -----------------------
>   |                                         |        (semantic interpretation)
>   \/                                        \/
> value     -----       equality       ----> value
> 
> The only difference here is that the label of the bottom arc in the
> first square has been moved to the right-hand arc in the second one.
> The top and right-hand sides are the same, and both squares commute,
> ie you get to the same place by any path.
> 
> -----
> 
> The cost would be that we would have to allow untidy literals. But I
> am increasingly thinking that our insistence on (syntactically) tidy
> literals is irrational. Bear in mind that this is an issue ONLY in
> the graph syntax. In any lexicalisation of an RDF graph (RDF/XML or
> Ntriples, but also any other lexicalization that could be sent as a
> character string), the graph-ids themselves cannot possibly be
> syntactically tidy.
> 
> The interesting thing is that we can have (a kind of) semantic
> tidyness even when the graph syntax isn't tidy. The way to do this is
> to say that what a literal node denotes is a particular *occurrence*
> of a character string; the one on that node, in fact. That is,
> literal *nodes* denote themselves. There can be several character
> strings "10" going around, and one of them might be in French and
> another of them might be labelled as being XML, and so on. But any
> properties of them which depend only on the actual characters in them
> apply to all of them or to none of them together. The situation is
> just like properties in RDF: there could be two properties with the
> same property extension (each is a subPropertyOf the other) but
> themselves having different properties. Similarly for classes. We
> would be putting literals into the same kind of 'intensional'
> category: there could be two different literals with the same string.
> Identity for literals is indicated by the actual node in the graph
> syntax. Notice that all the examples so far have been tidy, in fact.
> 
> As with any other untidy literals proposal, this would require a
> slight extension to the N-triples notation to provide a way to
> indicate which literals in which triples were supposed to be on the
> same node; we could do this by the same device that we currently use
> for bnodes, so that the above examples might look like this:
> 
> _:x "10"  rdf:sourceDeclaration  "xml version="1.0"" .
> _:x "10"  rdf:xmlLang  "FR"  .
> 
> Jenny ex:age _:y "10" .
> _:y "10" xsd:number _:x .
> Jim ex:age _:x .
> 
> where the first one uses the additional (handy) convention that if no
> node ID is indicated for a literal, then the literal is unique to its
> node.
> 
> What about the Cannes entailment? Well, it depends now on the details
> of the graph. If there is only one literal *node* which is both the
> age of Jenny and the title of some movie, then indeed you can
> conclude that there is a single thing which is both an age and a
> title. The inference rule can be succinctly stated as follows: given
> any node in any graph, it is OK to erase its label. That is, you can
> replace a uriref with a bnode, and you can rub out any literal label
> of a node (leaving a bnode). On the other hand, if the original graph
> has got two literal nodes, both labelled "10", and one is the title
> and the other is the age, then no, you can't infer that there is a
> single thing that is both. But then you couldn't do that even if the
> nodes were bnodes, in this case, so you shouldn't *expect* to be able
> to make this inference in this case, seems to me. (Im sure that Dan's
> response will be that he doesn't want this to be a legal graph, for
> just this reason.)
> 
> ----
> 
> Anyway, I offer this idea for consideration by the WG. I think that
> (apart from the above-mentioned change to Ntriples) allowing literal
> subjects will only simplify the rest of the documentation. The graph
> syntax will be easier to specify, and the MT rules will be
> simplified. Datatyping will be easier to specify and easier to
> understand, and various exceptions and restrictions can just be
> forgotten about.
> 
> One final rationalization is that we could have an RDFS closure rule
> 
> LLL rdf:type rdfs:Literal .
> 
> for any literal LLL, which obviously makes very good sense and is
> conspicuously missing at the moment.
> 
> Pat
> 
> 

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com
Received on Friday, 28 June 2002 04:38:38 UTC