Re: Any use cases for untidy literals except long range datatyping? from Sergey Melnik on 2002-08-26 (w3c-rdfcore-wg@w3.org from August 2002)

From: Sergey Melnik <melnik@DB.Stanford.EDU>
Date: Mon, 26 Aug 2002 06:50:20 -0700 (PDT)
To: pat hayes <phayes@ai.uwf.edu>
cc: Sergey Melnik <melnik@DB.Stanford.EDU>, w3c-rdfcore-wg@w3.org
Message-ID: <Pine.GSO.3.94.1020826053121.29217B-100000@Hake.Stanford.EDU>
On Tue, 20 Aug 2002, pat hayes wrote:

> 
> >I'd like to restate the questions, which Jan raised recently, more explicitly.
> >
> >Much of the ongoing discussion about tidy/untidy literals amounts to 
> >arguing about different readings of a given piece of RDF/XML or 
> >NTriples syntax. From what I can tell, both tidy and untidy literals 
> >are implementable, so we have to pick one and wrap up.
> >
> >To my knowledge, untidy literals have been first suggested in the 
> >context of long range datatyping (aka implicit/global idiom). 
> >Specifically, untidy literals provide a shortcut for using a bNode 
> >with a property (two triples are essentially merged into one).
> 
> I think it is rather more fundamental than this.

Can you provide a technical argument for that? What makes an untidy
literal to be "more" than a shortcut?

> First, bear in mind 
> that this entire issue only arises in the context of RDF graph 
> syntax; all normal lexicalized notations, including XML, are 
> inherently untidy. 

RDF makes an explicit distinction between an abstract syntax and concrete
syntaxes. Concrete syntaxes may be inherehtly untidy. RDF abstract syntax
is, however, a data model.

> I think that databases are usually untidy as well 
> (eg consider a RDB table where the second column is all integers and 
> the third column is all strings: we could have ten in one and "10" 
> in the other, and that would not be considered an implementation 
> issue, right?)

There is a difference between defining a column type as integer and
treating it as a social security number in applications. For a database
management system, the values of this column are integers, not more and
not less, with whatever semantics integers have. The relational model is
damn clear about that.

A couple of days ago I listened to a talk by Chris Date at VLDB'02 in Hong
Kong (to a large extent, it's due to Chris that the relational model took
off two decades ago). He talked about the foundations of the relational
model, and about datatyping as well. Guess what: his presented datatyping
just like suggested in the latest proposal. The gist is that the
relational model is agnostic about the nature and internal structure of
datatype values, which may just as well be XML documents and what not.

Database systems have been doing perfectly well w/o untidy
literals in the data model.

> Second, the basic point is that people will, whether 
> we like it or not, tend to use things like numerals as names of 
> numbers. They are so used almost universally throughout human 
> discourse and all programming languages.

It's fine to use all kinds of implicit notations in concrete syntaxes. RDF
abstract syntax is a modeling language, it's about being explicit, or you
are in trouble. If you have a numeral that is a name of a number, so why
not model this explicitly?

> However, we are committed to 
> RDF incorporating XML datatypes, which means that we cannot build 
> this assumption into the language, since the meaning of a numeral 
> string depends on the datatype applied to it (it *could* be a string. 
> ) So there is a 'natural' default that we are prevented from 
> assuming. Our options are to provide a different default (literals 
> are strings unless stated otherwise) which will make some 
> implementors happy but is unlikely to be found congenial by the rest 
> of the world; or, to provide a mechanism which treats 'bare' literals 
> as having a incomplete meaning and allows the datatyping information 
> to be supplied from elsewhere (range datatyping or some external 
> assumption). But the second requires that we allow one occurrence of 
> a bare literal to be associated with a different datatype than 
> another occurrence of the same literal.

I agree that untidy literals are a neat way of doing long range
datatyping. What I'm failing to see is that a) this use case alone is
worth the trouble of introducing untidiness b) there is another, more
compelling use case.

As you illustrate above, using
untidy literals makes us have to worry about incomplete information,
defaults, and, even more importantly, it couples the RDF data
model (abstract syntax) with a schema language that implementors might not
want to support.

In contrast, tidy stuff is orders of magnitude simpler (most
of database folks I know will have hard times following your
explanation). Keep in mind that 99% or more of all datatype use is covered
by the XSD primitive types. Oracle, Microsoft, and IBM support (or are
just about to support) XSD in their database products. The remaining one
percent of non-trivial datatyping can be done easily using
resources and bNodes. If fact, in RDF you'd probably not stretch
datatyping to Addresses and Employees, but would use a schema language for
defining such entity types anyway.

> >Is this shortcut so fundamental that there is value of making it 
> >part of the spec?
> 
> I think so, cf. above. In other words, its not just a shortcut.

The technical argument is coming, I hope....

> But 
> even if it were, we have Mike Dean and Patrick S. insisting that 
> saving one triple per entry is critical for their applications (on 
> palm pilots and cell phones respectively, I note :-)

Most apps will happily work with the primitive types. Most
industrially deployed databases do, why won't cell phones?

> BTW, isn't range datatyping exactly like having a datatype associated 
> with a table column in a database, rather than having to rewrite it 
> in every separate entry? And isn't that normal DB practice?

Aha! In databases, as soon as you say insert data, a schema lookup is done
and the values of the correct types are generated. It's just like creating
the properly typed literals when parsing RDF/XML into an RDF graph (in the
recent proposal).

> >Is there an appealing use case for untidy literals that is not long 
> >range datatyping (aka implicit/global idiom)?
>
> >Are we closing off any important extensibility paths if we go for 
> >tidy literals?
> 
> With regards to this last point, yes. DAML and OIL and probably OWL 
> will need the flexibility of allowing (semantically) untidy literals,

DAML and OIL folks must be using untidy literals for *something*, right?
What is their use case?

Sergey


> and if we forbid them then the DAML spec will need to be rewritten 
> and OWL will probably no longer base itself on RDF (or, an 
> alternative scenario, the Webont WG will split apart into two rival 
> groups which will produce incompatible standards. It is perilously 
> close to this already.)
> 
> Pat
Received on Monday, 26 August 2002 09:52:32 UTC