Re: Datatyping Summary V4 from Sergey Melnik on 2002-02-05 (w3c-rdfcore-wg@w3.org from February 2002)

From: Sergey Melnik <melnik@db.stanford.edu>
Date: Mon, 04 Feb 2002 19:29:36 -0800
To: Brian McBride <bwm@hplb.hpl.hp.com>
CC: RDF Core <w3c-rdfcore-wg@w3.org>
Message-ID: <3C5F51A0.F9F2104@db.stanford.edu>
Brian McBride wrote:
> 
> An updated summary of the datatyping issues, as I currently understand them.
> 
> Changes:
> 
>    B1  now disputed
>    B7  status changed to agreed
>    B9  withdrawn
>    B10 added "say what you mean"
> 
> Issue B1:
> =========
> 
> status: disputed by Sergey.  Sergey you owe us an explanation of why.
> 
> In S, if one wants to use both idiom A and idiom B, e.g.
> 
> <mary> <age> "10" .
> <age> <rdfs:range> <xsd:integer.lex> .
> 
> and
> 
> <mary> <ageD> _:a .
> _:a <xsd:integer.map> "10" .
> 
> two properties have to be used, <age> and <ageD>, in this example.
> 
> I believe there is a agreement that this is a difference between the
> two proposals. Indeed, it may be said that the main aim of TDL is
> to avoid requiring different properties for these different idioms.
> 
> Can't Live With: PatrickS

If the schema designers (e.g. of DublinCore) want to ensure that all
three idioms S-A, S-B, and S-P are usable with a given property (e.g.
dc:Date), they can simply define the range of the property as a UNION of
xsd:date.val, xsd:date.lex and xsd:date.map. These three sets are
disjoint, so no clash can occur.

Moreover, the schema designers have fine-grained control with respect to
the lexical representations that each compliant DublinCore application
*must* support. For example, imagine that there is another datatype for
date, say uml:date, that shares the value space of xsd:date, but uses a
different (disjoint) lexical representation. To enforce that each
DublinCore application can handle both lexical forms we can make the
range of dc:Date a union of xsd:date.val (=uml:date.val), xsd:date.lex,
xsd:date.map, uml:date.lex, and uml:date.map. If, in contrast,
uml:date.lex and xsd:date.lex clash in some incompatible way, the range
of dc:Date could comprise just xsd:date.map and uml:date.map, or a union
of xsd:date.val, xsd:date.lex, xsd:date.map, and uml:date.map.

No "second" property is needed in the above examples.

Remark:

Notice that a schema is like a contract. Imagine we are in the position
of the DublinCore, i.e. we have to design a schema that insures maximum
interoperability between compliant applications. If, for example, we
decide to enforce a specific lexical representation of a certain
datatype, we could use S-P. On the other hand, if the schema needs
maximum flexibility, we could take S-A to allow lexical representations
to evolve with time. In such case, the "contract" merely states that
certain value space is under consideration, but no further requirement
is put forth with respect to the lexical encoding. Both variants, i.e.
with "decoupled" and "coupled" lexical representation are useful.

 
> Issue B2: Multiple Lexical Representations of a data value
> ==========================================================
> 
> status: agreed that S-A allows this and TDL does not.
> 
> S, idiom A, permits multiple lexical representations of a data value:
> 
> _:i <xsd:double> "10.1" .
> _:i <xsd:double.de> "10,1" .
> 
> Issue B3: the self entailment issue
> ===================================
> status: Withdrawn in favour of B4:
> 
> From:
> 
> http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0410.html
> 
> [[I accept the reasoning above; it doesn't address my objection;
> it' just shows that my example wasn't very good. Sergey's
> example makes the point better:]]
> 
> B9 also added in response to Graham's request.
> 
> Issue B4 - TDL breaks existing code
> ===================================
> 
> status: facts agreed; significance disputed.
> 
> This is similar to B3. I've changed the example slightly from Sergey's.
> 
> Under TDL, consider the graph:
> 
> _:f <rdf:type> <film> .
> _:f <dc:Title> (_, "10") .
> <mary> <age> (_, "10").
> 
> Does not entail:
> 
> _:x <dc:Title> _:y .
> _:z <age> _:y .
> 
> Can't Live With: DanC
>
> Issue B5: Storage Requirements
> ===============================
> 
> status: disputed.
> 
> TDL requires significantly more storage to implement.

In most recent suggestions, there is a way of indicating (by means of
syntax) which literals are to be treated as untidy. As long as not *all*
literals is required to be untidy, I withdraw the storage issue.
 
> Issue B6: S requires 4 URI's be registered for each data type
> =============================================================
> S requires that for each datatype 4 URI's be registered
> datatype
> datatype.lex
> datatype.val
> datatype.map
> 
> Sergey: Do you agree this is the case? If not, how many URI's are required
> to implement ALL the idioms of S and coexist in the same model.

nope ;)

Surprise: only one URI is required.
Price:    special vocabulary is needed to identify lexical spaces,
          value spaces, and datatype mappings for a given datatype.

Here how it works. In the simplest scenario, we define additional three
properties (in total, not for each datatype), say rdfdt:isValueSpaceOf,
rdfdt:isLexicalSpaceOf, rdfdt:isDatatypeMappingOf. Then, we write e.g.

dc:Date rdf:range _:1
_:1 rdfdt:isValueSpaceOf xsd:date

Voila! Defining the semantics of the above three rdfdt: properties is
straightforward. Additionally, we can reuse xsd: URIs without concern.

> Issue B7: Complexity
> ====================
> 
> status: agreed
> 
> S has several ways of expressing the same thing. An RDF processor has to be
> aware of them all.

If by RDF processor you mean a general-purpose API and/or parser, I
disagree. The applications do need to deal with the diverse
representation, but even then conditionally. Using different datatyping
idioms in schemas amounts to establishing distinct "contracts" among
applications. As explained above, it's in the hands of schema designers
to make it easier to comply with the schema (using a less flexible
representation), or harder (using a more general representation). We
only provide the tools (= datatyping idioms).

The burden of translating between different datatyping idioms would hit
an application if it needs to interoperate with another, independently
developed(!), but related application. In such case, the schemas of used
in both applications are typically heterogeneous, i.e., use different
properties, classes etc. Thus, we have a standard problem of
data/application integration, where having different styles of
datatyping is the easiest issue by far and can even be fully automated
(in contrast, general schema mapping cannot be done fully
automatically).

> Supported by Jeremy's error cases message
> 
> http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Jan/0397.html
> 
> and a message from Andy Seaborne to rdf comments:
> 
> http://lists.w3.org/Archives/Public/www-rdf-comments/2002JanMar/0058.html
> 
> Issue B8: S-B encourages logically (sic) errors in the
> application type processing.
> =======================================================
> 
> status: ?

Sergey agrees with the significance of B8 (but can live with it).
 
> Given:
> 
> _:f <rdf:type> <film> .
> _:f <dc:Title> "10" .
> <mary> <age> "10" .
> 
> an application 'knows' that the range of <age> is an integer so it 'knows'
> that mary has <age> 10. Under S-B, running a query:
> 
> ?x <dc:Title> ?y .
> ?z <age> ?y .
> 
> will return ?x = _:f and ?z = <mary>, and knowing that the age of <mary> is
> 10, may conclude that the title of the film is also 10.
> 
> Can't Live With: Jeremy
> 
> Issue B9: In TDL a document does not entail itself
> ==================================================
> 
> status: Withdrawn.
> 
> Under TDL, does:
> 
> <foo> <dc:Title> "W3C" .
> 
> entail
> 
> <foo> <dc:Title> "W3C" .
> 
> yes.
> 
> Issue B10: Say what you mean
> ============================
> 
> status: ?
> 
> The concern here is that in TDL, a literal denotes a pair consisting of a
> value and a lexical representation of that value.  The problem is then that
> the german representation of floating point number, e.g. "10,5" is
> different from the english representation, e.g. "10.5".
> 
> Thus under TDL a german 10 and a half is a different thing from an english
> 10 and a half.
> 
> More formally, under TDL:
> 
>    <foo>      <eg:size>   _:s1 .
>    _:s1       <rdf:value> "10,5" .
>    _:s1       <rdf:type>  <xsd:double-de> .
> 
>    <bar>      <eg:size>   _:s2 .
>    _:s2       <rdf:value> "10.5" .
>    _:s2       <rdf:type>  <xsd:double> .
> 
> does not entail:
> 
>    <foo> <eg:size> _:s .
>    <bar> <eg:size> _:s .
> 
> Does anyone dispute the facts, or that this is a significant issue?

I believe the above issue is closely related to B1 and B2... I'd like to
raise another issue:

Issue B11: Misuse of datatypes
==============================

Given untidy graphs it is possible to create a "datatype" for persons
and another one for names, so that literal "Martyn" may represent a
person if it occurs in one context, or it may represent a person's name
in another context. Thus, untidy graphs facilitate ambiguous modeling
techniques.

Sergey

-- 
E-Mail:      melnik@db.stanford.edu (Sergey Melnik)
WWW:         http://www-db.stanford.edu/~melnik
Tel:         OFFICE: 1-650-725-4312 (USA)
Address:     Room 438, Gates, Stanford University, CA 94305, USA
Received on Monday, 4 February 2002 22:12:39 UTC