The X Datatype Proposal from Patrick.Stickler@nokia.com on 2001-11-12 (w3c-rdfcore-wg@w3.org from November 2001)

From: <Patrick.Stickler@nokia.com>
Date: Mon, 12 Nov 2001 17:25:01 +0200
To: w3c-rdfcore-wg@w3.org
Message-ID: <2BF0AD29BC31FE46B78877321144043162173C@trebe003.NOE.Nokia.com>
          Definition of X Proposal, with examples

This is the definition of my X Proposal, as it has been named,
expressed (unfortunately) in non-mathematical terms, to the best
of my ability.

Although I am currently digesting the present MT and the other
proposals being offered, I have tried to avoid including any
direct discussion of those other proposals as I felt it would
lengthen an already long document and possibly add confusion as
to the boundaries between the different proposals. Certainly, this
proposal has many points of intersection with the others, and those
hopefully will be obvious, but this document is expressed independently
of the other proposals.

I have organized the content in the following (non-conventional)
manner: I first provide a glossary of terms. This is so that you
know right off how I am using terms that are recognizable to you
but may not match identically your expected meaning. It also provides
a brief introduction to new terms that I use in this description,
as a preview of things to come.  I then provide a list of assumptions,
assertions, and my summary of the problem and the solution offered
by this proposal, so that they do not get lost in the subsequent
discussion and missed. This is then immediately followed by a
detailed discussion of the problem space and my proposed solution,
including a discussion of where URVs fit into all this.

I provide examples througout which hopefully clearly illustrate
the concepts outlined in this proposal.  

It may be the case that this proposal may be too radical for the
present scope of our charter, and its adoption may correspond
to a new version (either major or minor) of RDF. If this is so,
then I assert that (a) the data typing problem cannot be properly
solved within the constraints of our present charter, in line
with common interpretation of the RDF and RDFS specs, and (b)
we must commence the definition of such a new version of RDF as
soon as possible.  I hope that this will become evident from the
discussion below.

======================================================================

GLOSSARY OF TERMS

value space

        An abstract set of entities sharing common properties
        (very loose definition)

value

        A member of a value space

representation space

        A set of concrete representations mapping to values in a
        value space which facilitate automated operations
        in terms of those values -- e.g. the reification of
        a value space within an computer system

representation

        Within a representation space, a concrete representation
        of a value in the corresponding value space

canonical representation space

        A representation space where each value in the value space
        has only one possible representation in the representation
        space (the internal representation space of a computer system
        is a canonical representation space)

lexical space

        An set of concrete lexical representations (strings) which
        represent values in a specific value space, defined in
        terms of a lexical grammar

lexical form

        Within a lexical space, a concrete lexical representation
        (string) of a value in the corresponding value space, which
        is valid according to the defined lexical grammar

canonical lexical space

        A lexical space where each value in the value space
        has only one possible representation in the lexical space

data type

        An explicit lexical space whose members map to
        values in an explicit value space

(RDF) literal

        A string

typed (RDF) literal

        A lexical form

local type

        A data type associated directly with an occurrence of a
        value serving as the object of a statement

global type

        A data type associated globally with all occurrences of a
        value serving as the object of a statement having a particular
        predicate (i.e. via an rdfs:range definition)

descriptive range

        A range definition for a particular predicate defining a global
        type for all values of that predicate

prescriptive range

        A range constraint for a particular predicate defining a global
        type which all local types for all values must be equivalent to
        (either identical to, or a subclass of, the defined range class)

node

        The basic construct of an RDF graph, per this proposal

node facet

        A primitive property of a graph node serving as the 
        label of an arc

arc

        A named relation between two nodes, from the
        perspective of one node (source node) towards the
        other (target node), corresponding to a facet

LNode

        A node representing a resource labeled by an RDF Literal

UNode

        A node representing a resource labeled by a URI Reference

SNode

        A node representing an RDF Statement

BNode

        A node representing an anonymous resource with no label

qualifying statement

        A statement where the subject is represented by an SNode

statement qualification

        A limitation on the applicability of a statement for
        certain processes; such as scope, source, authority,
        or authentication

literal match

        The binding of a statement to a query where the statement and
        query are expressed in the same vocabulary and in terms of the
        same data typing scheme

inferred match

        The binding of a statement to a query where the statement and
        query are not expressed in the same vocabulary and/or in terms
        of the same data typing scheme but which are deemed equivalent
        according to rdfs:subClassOf or rdfs:subPropertyOf relations
        between those vocabularies

level 0 graph

        A maximal representation of an X Proposal graph
        where every node from every statement is distinct, and
        having no compression whatsoever

level 1 merge

        A transformation on a level 0 graph such that all UNodes with
        identitical uriref labels and SNodes where subject, predicate,
        and object nodes are all UNodes with identitical uriref labels
        respectively are merged

level 1 graph

        A graph which is derived from a level 0 graph by means
        of a level 1 merge, either virtually or destructively

======================================================================

PROPOSAL IN A NUTSHELL

Assumptions and Assertions:

The representation and interpretation of data types should be:
a. consistent
b. explicitly defined by the RDF specification
c. as neutral as possible with regards to data type scheme
d. compatible with XML Schema data types

The solution adopted must:
a. not deviate significantly from the present specification, either
   with regards to XML serialization or graph representation
b. be sufficiently future proof to allow for extension to address
   known or future issues with minimal impact to existing systems

No interpretation of data types will be provided by RDF. Any
interpretation of RDF encoded knowledge based on a defined correlation
between an RDF node and a particular data type is application
specific and beyond the scope of RDF.  RDF will only concern itself
with the specification of relationships between nodes and types,
and the preservation of such information for interpretation in
contexts outside the scope of RDF, not the interpretation itself.

Typed literals constitute lexical forms within a given lexical
space and which map to values in a given value space.

The proper interpretation of a typed literal requires both the
lexical form and the identity of the lexical and value space for
which the lexical form is expressed.

Separation of a lexical form from either the lexical space or
value space for which it was originally expressed renders it
uninterpretable in a reliable manner.

The rdfs:range property may function as either prescriptive
or descriptive, depending on the presence or absence of a local
type for the object of a statement.

In order for rdfs:range to function prescriptively, there must
be both:
a. a range value defined for the property of a statement
b. a local type defined for the object of the statement

In the absence of a local type, and in the presence of a range
definition for a given property, the type of the object of a statement
is taken to be that defined as the range of the property.

Query processes, while not explicitly defined by the RDF specification,
should be taken into account with regards to the representation and
interpretation of RDF encoded knowledge.

Query processes which employ inference based on rdfs:subPropertyOf
relations may bind objects to predicates which are superordinate to
the predicate of the original statement.

Query processes which employ inference based on rdfs:subClassOf
relations may bind literals to types which are superordinate to
the type originally defined for the literals.

Query processes which bind a non-locally typed literal to a superordinate
predicate different from that of the original statement and which
may have a range defined which differs from the range defined
for the original predicate effectively separate the lexical form
embodied in that literal from the lexical space for which it was
originally expressed, rendering it uninterpretable in a reliable
manner.

Query processes which bind a locally typed literal to a superordinate
type different from that originally defined for the literal effectively
separate the lexical form embodied in that literal from the lexical
space for which it was originally expressed, rendering it uninterpretable
in a reliable manner.

----------------------------------------------------------------------

Conclusions:

In the absence of a local type, range may be descriptive.

In the absence of a local type, range cannot be prescriptive.

In the presence of a local type, range may be prescriptive.

We MUST impose the requirement that all data type classes
define a value space that is a proper subset of the value
space of all superordinate data type classes.

We CANNOT impose the requirement that all data type classes
define a lexical space that is a proper subset of the lexical
space of all superordinate data type classes.

The reliable interpretation of non-locally typed literals
by rdfs:range definitions requires the absolute persistent
preservation of the binding between predicate and object per the
original statement.

The reliable interpretation of locally typed literals
requires the absolute persistent preservation of the binding
between object and type per the original statement.

----------------------------------------------------------------------

Proposed Solution:

The basis for the graph representation, and all operations and
interpretations, should be the explicit reification of the
statement. An RDF graph should represent the statements which
constitute knowledge, and the present RDF graph model should be
seen as a higher level resource-centric view or interpretation
of that underlying statement-centric graph.

Thus, rather than the present graph representation:

   [urn:foo] --- urn:someProperty ---> "bar"

we should have instead, for every statement, a canonical
underlying representation as follows:

      [ ]
       |
       ---- ID ----------> 1
       |
       ---- type --------> SNode
       |
       ---- subject -----> [ ]
       |                    |
       |                    ------ ID ------> 2
       |                    |
       |                    ------ type ----> UNode
       |                    |
       |                    ------ label ---> <urn:foo>
       |
       ---- predicate ---> [ ]
       |                    |
       |                    ------ ID ------> 3
       |                    |
       |                    ------ type ----> UNode
       |                    |
       |                    ------ label ---> <urn:someProperty>
       |
       -----object ------> [ ]
                            |
                            ------ ID ------> 4
                            |
                            ------ type ----> LNode
                            |
                            ------ label ---> "bar"

which can be more concisely represented as:

      [1,S]
        |
        ---- subject -----> [2,U,urn:foo]
        |
        ---- predicate ---> [3,U,urn:someProperty]
        |
        -----object ------> [4,L,bar]

or minimally represented as

      [1,S,2,3,4]
      [2,U,urn:foo]
      [3,U,urn:someProperty]
      [4,L,bar]

This model and its graph notation, along with two possible implementational
representations (in Java and Relational Tables) are described
in detail below.  

Again, the current RDF graph representation is merely a resource-centric
logical view or interpretation of the latter representation, and
the latter statement-centric representation is the key to the
data type solution and is the heart of this proposal.

The statement-centric graph representation provides the key constructs
necessary for preserving the relationships between predicate and
non-locally typed value and local type and value, such that meaningful
constraints on query operations and interpretation of query results
can be defined.  It also provides the key construct necessary
for addressing the needs of statement qualification, such as source,
authority, scope, and authentication; as well as for the differentiation
between general statements, asserted statements, and inferred
statements.

A query on an RDF graph always matches and returns complete statements,
not object values or other partial knowledge, and if a statement is
matched by inference, then either the original statement is returned
as-is (such that all original knowledge is preserved and available
for reliable interpretation) or the query engine is responsible
for deriving and returning an entirely new statement from the
original statement, expressed in terms of the query ontology and
data type scheme, taking into account all issues relating to mapping
and conversion of literals to conform to the lexical and value
space of the query ontology and data type scheme (and since it
has the original statement to work with, it has all the information
needed for reliable interpretation).

Thus, statements are not just first-class constructs in the graph, they
are the *primary* constructs of the graph and the basis for interpretation
and interaction with graph encoded knowledge.

All of this is discussed in detail below.

======================================================================

DISCUSSION

Descriptive vs. Prescriptive role of rdfs:range

Given the present RDF graph model, and "standard" behavior
of inference derived binding based on triples with non-locally
typed literal objects, the rdfs:range property may only be
safely descriptive of a literal value's data type iff RDF
requires that any data type that is a rdfs:subClassOf any
other data type constitute a perfect subset of both the value
space and lexical space of the superordinate data type,
and that any property that is an rdfs:subPropertyOf another
property have a range defined that is a data type which
is either equivalent to or a decendant of the range type
for the superordinate property.

This ensures that if a non-locally typed literal value
is bound by inference to a superordinate property than
for which it was originally defined, any application
which determines the type of that literal via the defined
range for the superordinate property, will be able to
interpret its lexical form reliably to obtain the properly
corresponding value.

If the above constraints cannot be enforced, and we continue
with the present graph model where inference may separate
a value from the predicate of the original statement, then
rdfs:range can only serve a prescriptive purpose, to ensure
that locally typed literal values correspond to the specified
data type.

Furthermore, it means that non-locally typed literals
may not have a reliable interpretation in all inference
derived contexts and therefore, it should be strongly
advisable to always specify the type of literals when
they are defined.

It should be noted, that the XML Schema simple data types
do *not* conform to the above constraints, even if RDF were
to impose them. Furthermore, there are likely to be numerous
data type schemes which also do not conform to such tight
constraints, and thus it would be imprudent and impractical
to propose the adoption of such constraints.

HOWEVER, the above is only true for the present graph model...

By basing the graph model on the reification of the statement,
and defining the behavior of query processes such that original
statements are returned in their original state, and defining
inference processes such that inferred matches return the original
statement unchanged (only the match being inferred) or generate
new statements derived from the original statements (including
dealing with all lexical issues for the interpretation of lexical
forms embodied in literals), we can ensure that even if a given
statement is matched by inference, the entire original statement
is returned -- providing also the original predicate by which,
via its range definition, the type and lexical form of a non-locally
typed literal can be properly interpreted, either by the recieving
client or by the query engine itself for the purpose of deriving
an inferred statement accordingly.

Thus, the answer to both data type integrity and reliable interpretation
of untyped literals by property range *and* the qualification
of statements for scope, source, authority, authentication, etc.
are addressed by the following proposed graph model, which has
as its foundation the reification of the statement itself.

======================================================================

PROPOSED CANONICAL GRAPH REPRESENTATION

The following is a graph representation which is based on the
reified statement as its foundational construct.

The current RDF graph model may be seen as a logical view or interpretation
of this proposed model, and thereby, this model does not conflict
with, nor replace the current graph model, but rather serves as
a new foundational layer below it, as a basis for the MT interpretation
of RDF encoded knowledge.  

Types (classes) of graph nodes:

   SNode  Statement Node
   UNode  URIRef Labled Node
   LNode  Literal Labled Node
   BNode  Blank Node

The distinction between the types of nodes is relevant both for
the allowed/required facets as well as for merge operations performed
on graphs of different levels of representation (explained below).

Facets (properties) of graph nodes:

   ID          SysID
   type        (SNode|UNode|LNode|UNode)
   label       for UNode, URI Reference
               for LNode, RDF Literal
   subject     for SNode, SysID
   predicate   for SNode, SysID
   object      for SNode, SysID

NOTE: Although facets constitute "properties" of graph nodes, they
      are not represented by RDF Statements, but are primitives of
      the underlying graph representation.

A node is required to have one and only one facet value for the
properties ID and type, and may have at most one facet value for
the property label.

An SNode is required to have one and only one facet value for each
of the properties subject, predicate, and object.

The value of a label for a UNode must be a URI Reference.

The value of a label for an LNode must be an RDF Literal.

Thus, an RDF Statement is reified by an SNode and that reification
is the basis for this revised graph model and its interpretation.

----------------------------------------------------------------------

Graph notation:

A node is represented by a comma separated sequence of ID, type,
and (if present) label which is surrounded by square brackets. The
type is represented by an uppercase character S, U, L, or B denoting
an SNode, UNode, LNode, or BNode respectively. I.e.

     '[' ID ',' [SULB] ( ',' label )? ']'

E.g.

      [1,S]
      [3,U,urn:someProperty]
      [4,L,bar]
      [9,B]

Subject, predicate, and object facets may be represented by arcs
with the facet name serving as the arc label and the arc represented
by an arrow terminating in the value of the facet. E.g.

      [1,S] ---- subject -----> [2,U,urn:foo]

In cases where the graph is too large to explicitly make the connection,
the node ID value can be shown instead. I.e.

      [1,S] ---- subject -----> 2

An absolute minimal representation can be provided as a list of
node definitions such that for SNodes, the values of the subject,
predicate, and object facets are listed by node ID in that order,
and the arcs are implicit. E.g.

      [1,S,2,3,4]
      [2,U,urn:foo]
      [3,U,urn:someProperty]
      [4,L,bar]

NOTE: If UUID values (or similar) are employed as system identifiers
      for the values of ID facets, then knowledge encoded in this
      graph representation would be fully portable without modification
      across disparate systems and applications.

----------------------------------------------------------------------

Asserted and Inferred Statements:

At this level of representation, an SNode does not necessarily
represent an asserted statement nor an explicitly defined statement
(e.g. loaded from some serialized instance). Assertion and nature
of definition are qualifications of the statement (statements
about the statement) and such qualifications are defined in terms
of RDF Statements and not in terms of graph primitives (facets
of SNodes). A statement is just a statement. Its significance,
status, role, purpose, relevance, etc. in a given context must be
inferred from its qualifications. This is outlined in more detail
immediately below.  

----------------------------------------------------------------------

Qualification of Statements

Although the issues relating to the reification and general qualification
of statements have been deferred to future working groups, this
proposal includes as a component a treatment of these issues as
the mechanism by which statements are differentiated for assertion
and inference on the basis of this reification. This same treatment
also serves to address other types of statement qualification
such as scope, source, authority, and authentication.

NOTE: If this treatment of statement qualification and reification
      is deemed acceptable to the WG, we may choose to readdress
      some of the recently deferred issues and outline their solution
      in terms of this proposed treatment.

It must be stressed that, according to this proposal, the key
to solving the data type problem -- namely, the reified statement
construct -- is also the key to solving the statement qualification
problem, thus this proposal essentially kills both birds with
one stone.  

For a given process or operation, relevant statements can be identified
by specifying qualifications as either inclusive (only statements
matching those qualifications) or exclusive (no statements matching
those qualifications), and of course a combination of inclusive
and exclusive qualifications can be defined.

Several examples are provided below illustrating statement qualification
based on the following treatment and the proposed statement-centric
graph representation.

Ontology for Statement Qualification:

   rdfq:scope              domain = rdf:Statement, range = {URI Ref}
      rdfq:source          domain = rdf:Statement, range = {URI Ref}
      rdfq:authentication  domain = rdf:Statement, range = {URI Ref}
      rdfq:attributedTo    domain = rdf:Statement, range = {URI Ref}
      rdfq:assertedBy      domain = rdf:Statement, range = {URI Ref}

The latter four qualification properties are sub-properties of
rdfq:scope which constitutes a generic qualification property.

Note that the concept of an 'inferred statement' is defined in
terms of the authority which asserts the statement or to which
the statement is attributed; where that authority may be the
system itself or a inference agent employed by the system.

Whether a given operation wishes to include only asserted statements
from trusted authorities, or also include statements attributed
to trusted authorities (hearsay) or include all statements is up
to the particular application.

It should be stressed that because qualifications are statements,
and statements are always reified in this graph model, the qualifications
themselves may be qualified. 

NOTE: In the examples that follow, for convenience I use qualified
      names having xsd:, rdf:, rdfs: and rdfq: prefixes for vocabulary
      terms corresponding to XML Schema data types, RDF, RDFS,
      and the above qualification ontology. Such qnames are enclosed
      in curly brackets. The curly brackets are not part of the notation
      (which does not understand namespaces). These qnames are not to be
      confused with URN or URV encodings such as urn:foo or xsd:lang:en
      which are complete URIs. I trust this distinction will be
      clear in all examples.  I also employ local URI refs without
      expansion (e.g. #green) though in practice, all URI refs should
      have their fully expanded representation.


Example 1: "John says that Mary says that Bob says the sky is green":

            -----------> [1,S]
            |              |
            |              ---- subject ------> [2,U,#Sky]
            |              |
            |              ---- predicate ----> [3,U,#is]
            |              |
            |              ---- object -------> [4,U,#green]
            |
      -----------------> [5,S]
      |     |              |
      |     --- subject ----
      |                    |
      |                    ---- predicate ----> [6,U,{rdfq:attributedTo}]
      |                    |
      |                    ---- object -------> [7,U,#Bob]
      |      
  ---------------------> [8,S]
  |   |                    |
  |   --------- subject ----
  |                        |
  |                        ---- predicate ----> [9,U,{rdfq:attributedTo}]
  |                        |
  |                        ---- object -------> [10,U,#Mary]
  |          
  |                      [11,S]
  |                        |
  ------------- subject ----
                           |
                           ---- predicate ----> [12,U,{rdfq:assertedBy}]
                           |
                           ---- object -------> [13,U,#John]
                                            

Or in maximally condensed form:

   [1,S,2,3,4]
   [2,U,#Sky]
   [3,U,#is]
   [4,U,#green]
   [5,S,1,6,7]
   [6,U,{rdfq:attributedTo}]
   [7,U,#Bob]
   [8,S,5,9,10]
   [9,U,{rdfq:attributedTo}]
   [10,U,#Mary]
   [11,S,8,12,13]
   [12,U,{rdfq:assertedBy}]
   [13,U,#John]


Example 2: Typed, Scoped Values

Here is an example of how this same treatment provides for general
qualification of statements, including statements defining scoping
and data type association (here's how this proposal addresses
the core problem of data typing): 

            -----------> [1,S]
            |              |
            |              ---- subject ------> [2,U,#status]
            |              |
            |              ---- predicate ----> [3,U,{rdf:label}]
            |              |
            |              ---- object -------> [4,L,Status]
            |
            |            [5,S]
            |              |
            --- subject ----
                           |
                           ---- predicate ----> [6,U,{rdfq:scope}]
                           |
                           ---- object -------> [7,L,en]
                                                  ^ ^
            --------------------------------------| |
            |                                       |
            |            [8,S]                      |
            |              |                        |
            |              ---- subject -------------
            |              |
            |              ---- predicate ----> [9,U,{rdf:type}]
            |              |
            |              ---- object -------> [10,U,{xsd:lang}]
            |
      -----------------> [11,S]
      |     |              |                        
      |     --- subject ----
      |                    ---- predicate ----> [12,U,{rdf:label}]
      |                    |
      |                    ---- object -------> [13,L,English]
      |
      |                  [14,S]
      |                    |
      --------- subject ----
                           |
                           ---- predicate ----> [15,U,{rdfq:scope}]
                           |
                           ---- object -------> [16,L,en]
                                                  ^ ^
                                          ... ----| |
                                                    |
            ... ------>  [17,S]                     |
                           |                        |
                           ---- subject -------------
                           |
                           ---- predicate ----> [18,U,{rdf:type}]
                           |
                           ---- object -------> [19,U,{xsd:lang}]

or

   [1,S,2,3,4]
   [2,U,#status]
   [3,U,{rdf:label}]
   [4,L,Status]
   [5,S,1,6,7]
   [6,U,{rdfq:scope}]
   [7,L,en]
   [8,S,7,9,10] 
   [9,U,{rdf:type}]
   [10,U,{xsd:lang}]
   [11,S,7,12,13]
   [12,U,{rdf:label}]
   [13,L,English]
   [14,S,11,15,16]
   [15,U,{rdfq:scope}]
   [16,L,en]
   [17,S,16,15,16]
   [18,U,{rdf:type}]
   [19,L,{xsd:lang}]
   ...

NOTE: Notice the infinite recursion required for labeling and typing
      of locally typed literals. See below for examples of how
      URVs alleviate this problem without recourse to rdfs:range
      definitions or application specific knowledge. Furthermore,
      this potentially infinite body of knowledge must be defined
      for *every* instance of such qualification, resulting in
      a gross proliferation of needlessly redundant statements.
      See the URV example below for a better way to encode such
      knowledge.  

----------------------------------------------------------------------

Levels of Graph Compression/Distillation

The proposed graph model includes the definition of two levels
of graph representation:

Level 0: Maximal Representation
   Every node from every statement is distinct. No
   compression whatsoever.

Level 1: URI Ref Equivalence
   UNodes with identitical uriref labels and SNodes
   where subject, predicate, and object nodes are all
   UNodes with identitical uriref labels respectively
   are merged

There may be additional levels of graph compression, based
on inference or other criteria, but those are undefined by
this graph model.

An API can provide access to statements at any of the defined
levels (presuming all are maintained) and the upper levels can
be simulated at run time as needed. A given system, however, may
choose to maintain knowledge only at a higher level (e.g. level 1)
performing merge operations on insertion of statements in to the
system, for the sake of storage efficiency, as this can reduce
a graph's size considerably and the utility of a level 0
representation is limited.

The direct benefits of level 1 compression are discussed further
below (though they should be immediately evident).

----------------------------------------------------------------------

Constraints on Query and Inference Behavior

A query applied to an RDF graph, based on this proposed representation,
must return only SNodes, not LNodes, UNodes, or BNodes.

Statements can be filtered as needed/desired either during or
after execution of a query according to any specified qualifications,
such as excluding non-asserted statements or statements not
having a particular scope or trusted authority.

Statements which are returned by a query are returned as originally
defined.

Any statements which were matched by inference rather than literal
match must be returned in their original form, or may be mapped
by the query API to new statements using the query vocabulary
and data type schemes of the query properties, without change
to the original statements. Whether the API interns the new statements
in the knowledge base, or only treats them as transient statements
to be discarded after returning the query results is up to the
specific implementation or process.

Queries may differentiate between non-asserted statements, asserted
statements, and inferred statements as needed/desired, as this
distinction is just like any other statement qualification.

======================================================================

POSSIBLE IMPLEMENTATIONS

Representation 1: Relational Table Model

Table Schema 1: Node

Field 1: ID(UUID)
Field 2: Type('UNode'|'LNode'|'SNode'|'BNode')
Field 3: Label(URIREF|LITERAL|'nil')
Field 4: Subject(UUID|nil)
Field 5: Predicate(UUID|nil)
Field 6: Object(UUID|nil)

----------------------------------------------------------------------

Representation 2: Linked Object Model (skeletal)

abstract public class Node
{
   protected UUID id;
}

public class SNode extends Node
{
   protected UUID subject;
   protected UUID predicate;
   protected UUID object;
}

public class UNode extends Node
{
   protected URIREF label;
}

public class LNode extends Node
{
   protected LITERAL label;
}

public class BNode extends Node;

It is presumed that the above object model is combined with a
dictionary, map, hash table or other similar mechanism by which
individual nodes can be located by either label or node ID.  

======================================================================

RELATION OF GRAPH MODEL TO URV ENCODING

The use of a URI Ref to identify a resource is an implicit agreement
or contract with all others making statements that everyone using
that URI Ref is talking about the same 'thing'.  Thus, there is
the expectation that all statements relating to such a 'thing'
would combine upon syndication to provide for a consolidated body
of knowledge about that 'thing'.

The benefit of a "destructive" (non-virtual) level 1 merge is
to substantially reduce graph real-estate where all UNodes are
combined. Thus, the more UNodes that are combined, the greater
the compression. And in fact, is conceivable that a level 1 merge
would be applied on all input automatically.

By encoding typed data literals in URVs, all such values are able
to be merged in a level 1 merge, rather than remain as locally
qualified LNodes, thus achieving substantial reduction in graph
real-estate.

E.g. in a large knowledge base about people where individuals'
ages are defined as nonNegativeInteger values, rather than have
one age value node for each person, along with a complete statement
qualifying that node for type, a URV encoding allows for a single
UNode to be shared for each equivalent age value, for all persons
having the same age. Thus, in a context of millions of persons,
one could achieve substantial compression in the graph with regards
to knowledge about age, without any loss of information whatsoever.
Furthermore, one can more efficiently locate persons of a particular
age by simply extracting all age statements with that URV as the
object, thus increasing search efficiency. A query API can hide
the details of the URV encoding, if so desired, and resultant
level 1 merge compression by always expanding the value out to
a normalized LNode with associated rdf:type qualification statement.

The verbose, potentially infinite example shown above can be
redefined using URVs in a more concise, finite form as follows
(showing a level 1 merge compression):

         -----------> [1,S]
         |              |
         |              ---- subject ------> [2,U,#status]
         |              |
         |              ---- predicate ----> [3,U,{rdf:label}]
         |              |
         |              ---- object -------> [4,L,Status]
         |
         |            [5,S]
         |              |
         --- subject ----
                        |
                        ---- predicate ----> [6,U,{rdfq:scope}]
                        |
                        |
                        |
                        ---- object -------> [7,U,xsd:lang:en] <--
                                               ^ ^          ^    |
 . . . . . . . . . . . . . . . . . . . . . . . |.| . . . . .|. . | . 
                                               | |          |    |
         --------------------------------------| |          |    |
         |                                       |          |    |
         |            [8,S]                      |          |    |
         |              |                        |          |    |
         |              ---- subject -------------          |    |
         |              |                                   |    |
         |              ---- predicate --> [9,U,{rdf:type}] |    |
         |              |                                   |    |
         |              ---- object -------------------------    |
         |                                                       |
   -----------------> [10,S]                                     |
   |     |              |                                        |
   |     --- subject ----                                        |
   |                    |                                        |
   |                    ---- predicate ----> [11,U,{rdf:label}]  |
   |                    |                                        |
   |                    ---- object -------> [12,L,English]      |
   |                                                             |
   |                  [13,S]                                     |
   |                    |                                        | 
   --------- subject ----                                        |
                        |                                        |
                        ---- predicate ----> [14,U,{rdfq:scope}] |
                        |                                        |
                        ---- object ------------------------------

Note that the knowledge below the dotted line (. . . .) is defined
globally only once for the resource xsd:lang:en even if that
resource is used millions of times to qualify a statement. Without
a means such as URV encoding to define first class resources
(with URI identity) this knowledge would have had to be duplicated
those millions of times.

Hopefully the practical benefit of URV encoding, and this proposed
graph representation and iterpretation, are clear from this example.

======================================================================

SERIALIZATION AND MAPPING TO GRAPH REPRESENTATION

Statement qualification properties can be defined as attribute
values on certain RDF/XML elements, with interpretations as
follows:

  rdf:RDF

     Qualifications apply to all statements in instance scope

  rdf:Description

     Qualifications apply to all statements in description scope

  (property element)

     Qualifications apply only to specific statement

All qualification property attributes may take multiple whitespace
separated values, which are expanded into individual qualification
statements.

Example 1: Instance Level

  <rdf:RDF rdfq:scope="urn:bas">
     ...
  </rdf:RDF>

Defines the following qualifying statement for all statements in
the RDF instance:

  [#A,S]
     |
     ---- subject ----> [...]
     |
     ---- predicate --> [#B,U,{rdfq:scope}]
     |
     ---- object -----> [#C,U,urn:bas]

where #X, #A, #B, and #C are instantiated to node IDs for
each qualifying statement and the subject ID, type and
(if present) label correspond to the qualified statement.

This level of definition is especially useful for defining
qualifications for source, authentication, and authority
which typically are shared for all statements in a given
instance.

Example 2: Description Level

  <rdf:Description rdf:about="urn:boo" rdfq:scope="urn:bas">
     <x:property1 rdf:resource="urn:foo"/>
     <x:property2 rdf:resource="urn:bar"/>
  </rdf:Description>

Defines the same qualifying statement as in example 1 above for
both statements, one each for x:property1 and x:property2. 

Example 3: Property Level

  <rdf:Description rdf:about="urn:boo">
     <x:property1 rdf:resource="urn:foo" rdfq:scope="urn:bas"/>
     <x:property2 rdf:resource="urn:bar"/>
  </rdf:Description>

Defines the same qualifying statement as in example 1 above, but only
for the x:property1 statement.

Example 4: Equivalence between Description and Explicit Reification

The following two serializations have identitical representation
in the graph, according to this proposal:

  Serialization 1:

     <rdf:Description rdf:about="urn:boo">
        <x:property rdf:resource="urn:foo" rdfq:scope="urn:bas"/>
     </rdf:Description>

  Serialization 2:

     <rdf:Statement rdfID="X">
        <rdf:subject   rdf:resource="urn:boo"/>
        <rdf:predicate rdf:resource="{x:property}"/>
        <rdf:object    rdf:resource="urn:foo"/>
     </rdf:Statement>
   
     <rdf:Description rdf:about="#X">
        <rdfq:scope rdf:resource="urn:bas"/>
     </rdf:Description>

  Graph Representation:

                    ----> [1,S]
                    |       |
                    |       ---- subject ----> [2,U,urn:boo]
                    |       |
                    |       ---- predicate --> [3,U,{x:property}]
                    |       |
                    |       ---- object -----> [4,U,urn:foo]
                    |
   [5,S]            |
     |              |
     ---- subject ---
     |
     ---- predicate --> [6,U,{rdfq:scope}]
     |
     ---- object -----> [7,U,urn:bas]

---

That's all folks...  ;-)

Patrick

--
               
Patrick Stickler              Phone: +358 50 483 9453
Senior Research Scientist     Fax:   +358 7180 35409
Nokia Research Center         Email: patrick.stickler@nokia.com
Received on Monday, 12 November 2001 10:25:12 UTC