Re: suggestions for datatyping (long) from Pat Hayes on 2001-10-25 (w3c-rdfcore-wg@w3.org from October 2001)

From: Pat Hayes <phayes@ai.uwf.edu>
Date: Thu, 25 Oct 2001 11:02:31 -0500
To: Sergey Melnik <melnik@db.stanford.edu>
Cc: w3c-rdfcore-wg@w3.org
Message-Id: <p0510108bb7fde0cb1017@[205.160.76.193]>
>
>b) typing information can either be represented in an instance graph
>only,
>    in a schema graph only, or both.

Can you clarify this distinction? I wasn't aware that we had such a 
distinction in RDF (?)

>The suggestion that I'd like you to think about concerns the dimension
>(a).
>
><SUG2>: to focus on representing typing info in the triple structure
>         and keep literals atomic.

I'm not quite clear exactly what is meant by 'in the triple 
structure'. (It makes sense that one should be able to tell from the 
triples that a particular literal token is supposed to be, say, an 
XSD: integer. I don't think it makes sense to require that the entire 
content of what that means, ie the entire XSD spec, should be 
represented or encoded in the triples.) If you mean the first, OK.

>
>3 DATATYPES AND DATATYPING
>--------------------------
>
>3.1 Value spaces and lexical spaces in [XSD]
>--------------------------------------------
>
>A nice conceptual intro to the datatyping issue is provided in the
>[XSD] document. According to [XSD], each datatype is characterized by
>a *value space* and a *lexical space*. For example, take the type
>"decimal". Its value space are all arbitrary precision decimal
>numbers, whereas its lexical space includes all character strings that
>match a certain pattern. In [XSD], a datatype definition specifies a
>mapping between the value space of the datatype and its lexical
>space. Notice that in general more than one lexical token may map to
>the same data value.
>
>My working understanding of the [XSD] document in terms of the current
>model theory draft is that the elements of a lexical space are literal
>values.

That is not mine. I would characterize literals as the lexical space 
and literal values as the value space. That is the working assumption 
behind the pfps/ph datatyping extension to the MT.

>
>3.2 Datatypes as classifiers in [UML,CWM]
>-----------------------------------------
>
>[UML] and [CWM] treat datatypes as some kind of classes (or
>classifiers in UML terminology). In other words, datatypes have
>"instances" which are called *data values*. Here are the relevant
>quotes:
>
>[UML], Sec 2.5.2.14: "Datatype"
>
>     A data type is a type whose values have no identity; that is, they
>     are pure values. Data types include primitive built-in types (such
>     as integer and string) as well as definable enumeration types
>     (such as the predefined enumeration type boolean whose literals
>     are false and true).
>
>[CWM], Sec 7.6.1.1: "DataValue"
>
>     A data value is an instance with no identity. In the metamodel,
>     DataValue is a child of Instance that cannot change its state,
>     i.e. all operations that are applicable to it are pure functions
>     or queries that do not cause any side effects. DataValues are
>     typically used as attribute values.  Since it is not possible to
>     differentiate between two data values that appear to be the same,
>     it becomes more of a philosophical issue whether there are several
>     data values representing the same value or just one for each
>     value. In addition, a data value cannot change its data type and
>     it does not have contained instances.
>
>[UML], Sec 2.5.2.34: "Primitive"
>
>     A Primitive defines a predefined DataType, without any relevant
>     UML substructure; that is, it has no UML parts. A primitive
>     datatype may have an algebra and operations defined outside of UML
>     (for example, mathematically). Primitive datatypes used in UML
>     itself include Integer, UnlimitedInteger, and String.  The
>     run-time instances of a Primitive datatype are DataValues. The
>     values are in many-to- one correspondence to mathemetical elements
>     defined outside of UML (for example, the various integers).
>
>[UML], Sec 2.5.4.10: "Miscellaneous"
>
>     ... A data type is a special kind of classifier, similar to a class,
>     but whose instances are primitive values (not objects). For
>     example, the integers and strings are usually treated as primitive
>     values. A primitive value does not have an identity , so two
>     occurrences of the same value cannot be differentiated. Usually,
>     it is used for specification of the type of an attribute. An
>     enumeration type is a user-definable type comprising a finite
>     number of values. ...
>
>Translated into RDF terms, a data value corresponds to a bNode in a
>graph.

I disagree. That begs several important questions, but in any case a 
bNode can denote any kind of value. Why would we want to say that 
bNodes *are* values?

>3.2 Datatyping: classes or mappings?
>------------------------------------
>
>As pointed out above, reading [XSD] gives an impression that
>datatyping is a kind of mapping that establishes a relationship
>between data values and literal values. In contrast, [UML] talks
>merely about the value spaces of datatypes and does not say anything
>about their lexical spaces. As a consequence, [UML] does not establish
>any mappings between value spaces and lexical spaces of the primitive
>datatypes. Still, [UML] does define the "features" of value spaces
>that include ordering, operations etc.
>
>To sum up, the specs [XSD,UML,CWM] utilize two abstract concepts:
>
>- datatype as a class(ifier)
>- datatyping as a mapping between a value space and a lexical space
>
>My feeling is that both views may be useful for representing typed
>data (just as wave-particle dualism is helpful for explaining
>different phenomena in physics ;). On the one hand, if data values do
>not have fixed URI identifiers, we need a *mapping* that allows us to
>identify resources as data values using their lexical representations.
>On the other hand, for defining and resticting datatypes, the class
>view is superior (although it looks like the class view is in
>principle dispensable).

I think we can have both. We have a class/property distinction at the 
basis of RDFS, and it seems natural to map this entire discussion 
into that vocabulary. Data type mappings are rather like (the 
extensions of) properties assigning data values to lexical strings, 
and the ranges of these properties are the classifiers whose class 
extensions are the sets of data values themselves

>
>3.3 Literal properties as "datatyping mappings"?
>------------------------------------------------
>
>One final point that I'd like to make before turning to examples is
>that properties with literal values possess a high resemblance to
>datatyping mappings.

Right, exactly. They differ only in their special relationship to the 
RDF syntax.

>Assume that the interpretation of each literal
>symbol is fixed and is determined by its textual contents.

No, do not make that assumption! That begs the central question. That 
is the entire point of datatyping, that this assumption breaks down 
for literals, so datatyping is required.

>Then, since
>each literal symbol denotes just a lexical token,

Why does it *denote* a lexical token? It *is* a lexical item.

>it presumably does
>not make sense to use it as object for properties like "age", "size",
>"price", "weight", etc. In fact, such use would suggest that e.g. the
>weight of a thing is a lexical token; typically, we'd like it to
>denote some abstract entity that corresponds to say 5 pounds.

No, no. If I USE a literal as a value, I am not MENTIONING a lexical 
token; I am using the literal to indicate a literal value. So for 
example by writing

phayes weightAtAge50inPounds "165" .

I am saying that my weight was 165 pounds, not that it was a lexical item.

>
>In other words, for most meaningful representations, we can think of a
>property whose objects are literals as a mapping that associates a
>value space with some lexical space.

No, that is what the datatyping mapping does, not the property. It is 
LIKE a property, but it is not itself an RDF property. If we assume 
that, then we are begging the question, since we have simply 
described the datatying in RDF; and then there is no datatying as 
such.

>In yet other words, each
>literal-valued property may be though of (by convention) as a
>"datatyping property" (also referred to as "interpretation property"
>by TimBL).
>
>If <SUG2> turns out to be acceptable, the next thing I would suggest
>to nail down is the nature of literals. A further proposal from my
>side would therefore be
>
><SUG3>: the interpretation of each literal symbol is fixed
>         and is determined by its textual contents.

If we adopt this convention then there is no need to invoke any 
special treatment of datatyping in RDF itself, since all the 
datatyping is purely a lexical matter. (?) Seems to me that this 
trivialises the discussion.

>
>4 EXAMPLES
>----------
>
>Alright, enough babble is enough. Below, I've picked some examples
>that illustrate several use cases of datatyping.
>
>Example 1: numbers
>------------------
>
>This first example illustrates the use of value spaces and "datatyping
>properties":
>
>_x1 xsd:int       "570"
>_x2 xsd:float     "0570"
>
>_x3 rdf:type      uml:Integer
>_x3 inHundreds   "5.7"
>
>_y1 rdf:type      cwm:java:double
>_y1 rdf:value     "5.7"
>
>_y2 rdf:type      cwm:sql:REAL
>_y2 rdf:value     "570E-3"
>_y2 nist:float-de "5,7"
>
>_c1 realPart      _x1
>_c1 imaginaryPart _y2
>
>A possible interpretation for the above set of statements is shown in
>the figure below (see Appendix). In the figure, the red and green dots
>represent entities in the domain of discourse (resources and literal
>values). The arcs connecting the dots represent elements of the
>corresponding property extensions. For instance, the arc labeled with
>I(xsd:int) in the figure represents pair <I(_x1),I("570")> contained
>in IEXT(I(xsd:int)). The subdivision into lexical spaces is omitted
>for clarity (recall that each lexical space is a set of literal
>values).
>
>For argument's sake, assume that xsd:int, xsd:float are properties
>that map between values spaces and lexical spaces (must not be that
>way). The domains of these properties are the value spaces of 64-bit
>integers and 64-bit floats, as defined in [XSD]. In contrast,
>uml:Integer, cwm:java:double, and cwm:sql:REAL denote "classes", i.e.
>the value spaces themselves.
>
>In the interpretation shown in the figure, bNodes _x1,_x2,_x3 map to
>the same resource in the domain of discourse, as do _y1 and _y2.
>
>Property "inHundreds" illustrates a datatyping property that
>associates the value spaces of UML Integers with a lexical
>representation in which the number is represented by a different
>lexical token (this token, by the way, looks more like a proper real
>number!). If the domain of inHundreds is defined to be an integer
>value space, than the statement (_x3, rdf:type, uml:Integer) seems
>like an overspecification. However, such overspecification may still
>be of some limited use for applications that just understand
>uml:Integer and know nothing about the property inHundreds.
>
>Notice that the property rdf:value is used to map literal value "5.7"
>to a double-precision number _y1 instead of using some special
>"datatyping property". Such approach seems possible when we assume
>that there exists some "default" datatyping mapping for a certain
>datatype. Such "default" mapping may be defined by the value-space
>class (like cwm:java:double) in a schema. In the above example the
>"default" datatyping mapping for cwm:java:double may correspond to the
>mapping induced by say xsd:double. In turn, the property/mapping
>xsd:double is defined pseudo-axiomatically in [XSD].
>
>The complex number _c1 is represented using two real numbers.
>
>
>Example 2: xml:lang and base64
>------------------------------
>
>_w1 urn:lang:en-us "flashlight"
>
>_w2 rdf:type       nist:lang-de/Word
>_w2 rdf:value      _l1
>_l1 base64         "VGFzaGVubGFtcGU="
>
>_w3 urn:lang:de    "Taschenlampe"
>
>A possible interpretation for the above statements are shown in the
>figure. Notice that similarly to the distinction between datatypes and
>datatyping, language tagging can be realized using properties or using
>classes. Analogously, rdf:value is used in the example using the
>assumption that some schema describes what the default type mapping is
>for the class nist:lang-de/Word (alternatively, this information can
>be considered "built-in").
>
>Furthermore, note that in this example the literal value
>"Taschenlampe" appears as a sort of a "subject of an interpreted
>statement" even if it is not used as a subject node in a graph or at a
>subject position in the Ntriple notation (this supports Pat's point
>that allowing literal symbols in subjects is fine).
>
>Example 3: units
>----------------
>
>Well, once we have a way to represent numbers, representing units may
>be a less of a pain. Again, several alternatives are possible, for
>example:
>
>_x1 weight   _x2
>_x2 rdf:type WeightInPounds
>_x2 numeric  _x3
>_x3 xsd:int  "5"
>
>OR
>
>_x1 weightInPounds _x2
>_x2 numeric  _x3
>_x3 rdf:type uml:Integer
>_x3 rdf:value "5"
>
>etc. The property "numeric" is used to relate the concept of "5
>pounds" to a numeric data value.
>
>Example 4: constants
>--------------------
>
>Some datatypes define fixed constants. For example, [UML] defines a
>concept of an infinitely large integer. Instantiation (rdf:type) could
>be used for this purpose as in
>
>uml-dt:*  rdf:type  uml-dt:UnlimitedInteger
>
>(as appears in
>http://www-db.stanford.edu/~melnik/rdf/uml/uml-datatypes-20000507.rdf)
>
>
>CONCLUSION
>----------
>
>The purpose of the examples above is to illustrate how datatyping may
>be introduces and interpreted semantically by utilizing value spaces,
>lexical spaces and mappings between them, and to support the
>suggestions <SUG2>, <SUG1>, and <SUG3>.
>
>I do recognize that datatyping remains a hard issue. Specifically, it
>is not clear how structured or derived datatypes defined say by means
>of [XSD] can be used meaningfully in RDF. Another stumbling block in
>datatyping are parameterized datatypes, or templates, like CHAR(20) in
>SQL. A basic problem is that enumerating all character datatypes
>CHAR(N) is impractical. After reading [UML,CWM] it is still unclear to
>me how templates can be interpreted semantically.

I agree this kind of thing is tricky to get right. Syntactic 
parameters make the semantics much more complicated.

>Regarding the model theory for datatyping, I don't think that
>datatyping needs some special treatment if we go along with <SUG2> and
><SUG3>.

SUG3 makes it kind of trivial, seems to me, if I follow it.

>Of course, it would be fine to introduce a "shortcut" notation
>for datatyping in the model theory if necessary, just like ICEXT is
>defined as a shortcut in the MT draft.
>
>For now, that's all folks!
>
>Sergey
>Attachment converted: Macintosh HD:rdf-literals3.gif (GIFf/prvw) (00028E3B)


-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 25 October 2001 12:02:45 UTC