RE: Untidy literals from Patrick.Stickler@nokia.com on 2002-08-29 (www-rdf-comments@w3.org from July to September 2002)

From: <Patrick.Stickler@nokia.com>
Date: Thu, 29 Aug 2002 12:27:50 +0300
To: <seth@robustai.net>, <www-rdf-comments@w3.org>
Message-ID: <A03E60B17132A84F9B4BB5EEDE57957B5FBAAE@trebe006.europe.nokia.com>
> -----Original Message-----
> From: ext Seth Russell [mailto:seth@robustai.net]
> Sent: 28 August, 2002 20:26
> To: www-rdf-comments@w3.org
> Subject: Untidy literals
> 
> 
> 
> re: 
> http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Aug/0247.html
> 
> Where Patrick.Stickler@nokia.com says:
> 
> [[ The present situation, as I see it, is that
> 4. The community clearly favors untidy literals ]]
>  
> Well I was there and I certainly don't remember being asked 
> if I favored
> untidy literals or not.  I do remember being asked to choose between
> mutually distasteful options.

Fair enough ;-)

> ... that being said ...
> 
> As a implementer I'm not necessarily against untidy literals, 
> I just simply
> do not understand how literals being untidy in the MT will effect my
> implementation, if at all.
> 
>     How will (should) untidy literals in the MT affect an 
> implementation of
> a RDF application ??
> 
> ... that being asked ....
> 
> Let me see if my application view of untidy literals matches 
> with the WG :

Well, I won't speak for the WG, but I'll offer some comments
in terms of what my understanding of tidy versus untidy literals 
encompasses.

> I think of a literal as a fixed sequence of binary digits .. 
> for example
> '1001100110011001'  that is presented to my application as a 
> sequence of
> Unicode characters of some other such thing depending on the 
> middleware I'm
> using.  My application can store that sequence of characters 
> in dozens of
> places in memory ... in that sense I would be dealing with 
> that literal as
> untidy .. just like I deal with a bNodes.  

Well, there is the issue of syntactic untidyness (multiple
occurrences of the same literal string repeated in memory)
and more importantly semantic untidyness (each occurrence
of the same string-equal literal may denote a different
datatype value).

What you are talking about here is syntactic untidyness, which
one would expect to avoid in an actual implementation, so long
as it can be done without losing semantic untidyness.

I.e., compressing multiple occurrences of the same string-equal
literal into a single memory location is fine, so long as
that doesn't preclude assigning different interpretations to
the occurrences themselves, based on the context of the literal
occurrence.

> To be efficient, 
> (because there
> are a lot of these strings and some of them are extremely long), my
> application contrives to store that string just once and points to it
> wherever it is used.  In that sense, may I assume that is 
> dealing with the
> literal itself as tidy.  

Syntactically tidy, yes.

> Now I can contrive that nobody form 
> the outside of
> my application can tell whether I am doing that or not .. 
> this I can do by
> dealing with the pointers to the literals in a untidy manner. 
>  But must I
> build in this extra level of untidiness in my application?   
> I simply do not
> know based upon the discussions I have heard.
> 
> Philosophically speaking, are literals actually untidy?  

Insofar as literals may constitute lexical forms and the
interpretation of a lexical form is contextual according
to the datatype in question, yes.

Much, if not most use of inline literals presumes a datatype
akin to xsd:string, but alot of inline literals are meant
to be intepreted according to other datatypes, which to
date has simply been left unspecified at the RDF layer and
relegated to the application specific semantics.

CC/PP is a good example of this, where e.g. BytesPerPixel
takes an inline literal, a lexical form, which is interpreted
as denoting an integer value. The true value of the
BytesPerPixel property is not a string, it's an integer,
so this should be explicit at the RDF layer, not the
application layer (IMO, others may disagree).

> I 
> mean every time
> you encounter '1001100110011001' do you encounter the *same*
> '1001100110011001'  or is it a different one?  

It depends on whether you are talking about the lexical form
(string) or what that lexical form denotes. You may 
repeatedly encounter the same lexical form but not 
necessarily encounter the same value as denoted by
that lexical form. In fact, every single occurrence of
that lexical form may denote a completely different value,
a completely different thing in the universe.

Literals are just local names, and local names are ambiguous.

That's why we have constructs such as URIs, so that we have
a means to attach names to things which have globally consistent
meaning and are never ambiguous.

> Certainly you 
> encounter it
> in a different context, ..... yes ... but is it a different 
> thing every time
> you encounter it ?   Well,  *outside of the context of the 
> encounter* , can
> you distinguish one of the  '1001100110011001'  from another 
> one of the
> '1001100110011001'  ?
> 
>      I think not.

Well, given the lack of machinery in RDF at present, I agree
that it is difficult to distinguish between different contextual
interpretations of the same lexical form (at least in a standardized
manner). But that is what the
untidy datatyping approach is meant to rectify (the alternative
tidy approach simply formalizes this inability to express the
contextual meaning of inline literals at the RDF layer).

Let's take a simple example. Given the lexical representation
"10", does that always mean the same thing? Does that always denote
the same value? Consider the following literals-in-context:

  (xsd:integer, "10")
  (xsd:gDay, "10")
  (xsd:string, "10")

Now, in the first case, "10" denotes the integer value 'ten'. In
the second case, "10" denotes the tenth day of the month. And in
the third case, "10" denotes the unicode string '10'. Thus, the semantics
of the lexical form "10" is contextual and untidy -- it does not act
as a global constant as does a URIref or bnode ID. It does not always
mean the same thing. The integer ten is not equal to the tenth day
of the month is not equal to the string '10' even if they all have
identical lexical representations.

In this sense, a literal is similar to an XML local name. And a
datatype context is similar to a namespace. The local name 'foo' may
mean different things in different namespaces, just as a given
lexical form such as "10" may mean different things for different
datatypes.

Now, we could in fact decide that at the RDF layer, we won't capture
the contextual untidy semantics of literals, but just say that all
we are dealing with are lexical forms (strings) and applications are
free to impose contextualized interpretations on those strings as
they choose. This would be the tidy option. Fair enough (technically, 
at least).

But that means that RDF reasoners which base their inferences
on the RDF MT alone will never be able to capture the fact that
two different values are in fact meant (at some level) by the
same lexical form
and have no choice but to treat all string-equal literals
as equivalent in meaning (which technically they would be, given
a tidy MT) but this could lead to entailments which would arguably be 
non-intuitive to users and contrary to the intended meaning. E.g. 
with a tidy MT, the following entailment would hold:

   Jenny age "10" .
   Fred payday "10" .
   Movie title "10" .
 
entails

   Jenny age _:x .
   Fred payday _:x .
   Movie title _:x .

i.e., the precise technical meaning in this case of the above
entailment is that the lexical form for Jenny's age is the same as 
the lexical form for Fred's payday is the same as the lexical form
for the movie's title -- which is technically correct -- but the
meaning of the above entailment, per the likely intended and/or percieved 
meaning of the above statements (and in terms of what applications are
likely to interpret them as meaning) is that Jenny's age is the same
as Fred's payday is the same as the movie's title -- i.e. an integer is 
the same as a day of the month is the same as a string, which clearly is false 
insofar as the real world is concerned (at least the one I live in ;-)

Now, if we took untidy literal semantics (and abstract syntax), then the
above entailment does not hold, as the RDF MT would not be able to assert
any equality between the lexical representations disjunct from their
context and the determination of value equality other than the single
case of both identical datatype and lexical form would be relegated
to an extra-RDF application that groks the datatypes in question. I.e.

   Jenny age _:a"10" .
   Fred payday _:b"10" .
   Movie title _:c"10" .

does not entail

   Jenny age _:x .
   Fred payday _:x .
   Movie title _:x .

For all we know, the above literals *could* have an equivalent
meaning, but we can't know that given the information provided
above.

However, we may make the datatyping assertions which
are implicit in the propery names explicit in the RDF thus

   age rdfs:range xsd:integer .
   payday rdfs:range xsd:gDay .
   title rdfs:range xsd:string .

where, knowing the semantics of the above datatypes, it becomes crystal 
clear that we are talking about an integer
value, a day of the month value, and a string value which, simply
by coincidence, happen to have the same lexical representation.

We could also make this distinction explicit for each occurrence,
in each statement, by specifying the datatype for each of the 
literals:

  Jenny age xsd:integer"10" .
  Fred payday xsd:gDay"10" .
  Movie title xsd:string"10" .

etc.

In the case of the implicit, inline literal _:a"10" the
systemID '_:a' is taken to denote "some" datatype, which is simply not
specified for the individual occurrence, but is provided
by a global range assertion on the property. Thus given

   age rdfs:range xsd:integer .
   Jenny age _:a"10" .

then the MT gives us

   I(_:a"10") = I(xsd:integer"10")

I.e., the node _:a"10" denotes the integer value ten.

> In fact, when you say a literal is untidy, I believe you are 
> confusing the
> mark with the use of the mark.  Isn't that distinction very 
> much like the
> distinction that  Frege introduced by distinguishing between 
> the sense and
> denotation of a name ?    I think the sense of a literal must 
> be untidy, but
> the literal itself (which sits in the model in the domain of 
> discourse as
> that thing denoted) must me be fixed and tidy.
> 
>    ... or am I confused as usual .... ?

Well, with untidy literal semantics, the sense would be untidy,
and contextual, but one could allow the denotation, the mark, to 
be syntactically tidy, as an issue of graph compression and 
memory efficiency, so long as one would not infer semantic tidyness
from the syntactic tidyness.
  
In fact, I would not expect any triples store worth its salt to
mirror an untidy abstract graph syntax religiously, even if it must
maintain the untidy semantics reflected in that abstract syntax.
There are many ways to optimize the internal representation of
the abstract graph with untidy literal nodes while preserving
the untidy semantics.

I hope the above was at least a little bit helpful.

Cheers,

Patrick
Received on Thursday, 29 August 2002 05:27:53 UTC