An outline of RDFn -- RDF with (auto- and custom-) names

As the group tries to decide on options, the following outline of a revised version of RDFn may be useful for discussions.

Core concepts and ideas in RDFn:

  1.
An RDFn statement is uniquely identified using the tuple <s, p, o, g, n>, where the component n is the "name" of the statement. (The components s, p, and o represent the subject, predicate, and object, respectively. The component g, representing graph name, is non-NULL only for quads and will not be used in the examples below.)
Example 1: An RDFn statement, with ex:jSm as its name, representing the tuple <ex:john, ex:spouseOf, ex:mary, null, ex:jSm>:
--> ex:john ex:spouseOf ex:mary | ex:jSm .
  2.
Based on how its name was created, a statement can belong to one of two possible types:
     *
auto-named: The name n for an auto-named statement <s, p, o, g, n> is computed as rdfnAuto:foo(s, p, o, g), where
        *
rdfnAuto is an exclusive namespace used only for names used for auto-named statements, and
        *
foo is an implementation-specific function that generates unique string from the <s, p, o, g> portion of the statement,
     *
custom-named: The name of a custom-named statement is an IRI that is supplied by the data creator. (The IRI cannot have rdfnAuto as its namespace prefix.)
  3.
The name of a statement may be used as subject or object of other statements as long as there is no direct or indirect self-recursion involving the name (e.g., <n, p, o, g, n> is not allowed because n has to be computed using n).
Example 2: Adding statements about an auto-named statement (using placeholder for the auto-generated name):
--> ex:Cleveland ex:servedAs ex:POTUS | rdfnAuto:term1 .
--> rdfnAuto:term1 ex:startYear 1885 ; ex:endYear 1889 .
Example 3: Adding statements about a custom-named statement:
--> ex:Cleveland ex:servedAs ex:POTUS | ex:term2 .
--> ex:term2 ex:startYear 1893 ; ex:endYear 1897 .

Core concepts and ideas in SPARQLn:

  1.
A new filter isAuto(<name>) is introduced to allow distinguishing between auto-named and custom-named statements. If this filter is not used, all statements will qualify, regardless whether auto-named or custom-named, provided they match regular SPARQL criteria.
Example 4: The following query returns the ?cnt = 2 if the data about President Cleveland's both terms (from Example 2 and Example 3 above) are present in the RDF dataset:
--> SELECT (count(*) as ?cnt) { ?s ex:servedAs ex:POTUS }
Example 5: The following query returns ?cnt=1 due to the presence of the isAuto() filter:
--> SELECT (count(*) as ?cnt) { ?s ex:servedAs ex:POTUS | ?n . FILTER ( isAuto(?n) ) }
Example 6: The following query returns ?minStartYr = 1885, ?maxEndYr = 1897:
--> SELECT (min(?startYr) as ?minStartYr) (max(?endYr) as ?maxEndYr)
        { ?s ex:servedAs ex:POTUS | ?n .
           ?n ex:startYear ?startYr ; ex:endYear ?endYr }
  2.
A custom-named statement is considered as unasserted unless an auto-named statement exists with the same <s, p, o, g>. This has implications in SPARQL query processing. A new triple-pattern format, that uses the << ... >> enclosure,  is introduced in SPARQL to indicate whether matching with unasserted statements is allowed.
Example 7: Consider the following data that consists of just a single custom-named statement. Since there is no auto-named statement with <s, p, o, g> as <ex:bob, ex:fatherOf, ex:john, null> present, the custom-named statement is considered as unasserted. The first query below is looking for match with asserted statements only and hence will return no results. The second query on the other hand is open to considering unasserted statements as well (due to the use of the << ...>> enclosure for the triple-pattern) and will return the result: ?dad = ex:bob, ?kid = ex:john.
DATA:
--> ex:bob ex:fatherOf ex:john | ex:cname1 .
QUERY 1:
--> SELECT ?dad ?kid { ?dad ex:fatherOf ?kid }
QUERY 2:
--> SELECT ?dad ?kid { << ?dad ex:fatherOf ?kid >> }

A few other relevant points:

  1.
For cross-system sharing of query results, include a list containing <s, p, o, g, n> for each auto-generated name n that is (directly or indirectly) included in the result: This is necessary due to the fact that triplestores have full autonomy for implementing the function foo used for generating auto-names and therefore, given the same <s, p, o, g>, two different triplestores could generate two different auto-names. Hence, the recipient needs to know the <s, p, o, g> corresponding to each auto-name returned (or indirectly involved) in the result to generate the appropriate auto-name for its local use.
  2.
Statement-Set: This can be done by having multiple distinct <s, p, o, g> share the same custom-name. While the advantage over named graphs is that statements from distinct graphs (or default graph) can form a group, a disadvantage would be that auto-named statements cannot be part of a (non-singleton) statement-set.
  3.
Ref. Transparency vs. Opacity: The current idea of "opaque by default and transparent in case TEPs are involved" would work fine for RDFn too.

Based on the above outline, I'd argue that use of RDFn to support the desired extensions to RDF would also satisfy some of the practical constraints that are critical for adoption by enterprise, specifically:

  *
full backward-compatibility for RDF1.1 data (each RDF1.1 statement becomes an auto-named (asserted) statement in RDFn)
  *
continued validity of pre-existing SPARQL1.1 queries even as data evolves to include more expressive content by taking advantage of new capabilities to include statements about statements and multi-edges
  *
minimization of the custom naming burden on the user because custom names are needed only for those cases where multi-edges or (non-singleton) statement-sets are involved

Thanks,
Souri.

Received on Thursday, 16 November 2023 02:39:13 UTC