proposal: "box datasets" (sandro's dataset spec, v0.1) from Sandro Hawke on 2013-09-15 (public-rdf-wg@w3.org from September 2013)

From: Sandro Hawke <sandro@w3.org>
Date: Sun, 15 Sep 2013 12:48:35 -0400
To: RDF WG <public-rdf-wg@w3.org>
Message-ID: <5235E4E3.1050900@w3.org>
Here's what I think we need to define to make Jeremy and many other 
people happy.   Obviously this is not the final draft of a spec, but 
hopefully it conveys the idea clearly enough.    If you read this, 
please say whether you see any seriously technical problems with it 
and/or would be happy with it going out as a WG Note.   Actually, the 
idea is so simple and so well-known, even if not formalized or named 
before, that maybe it's not out of the question to put in on the Rec 
Track -- but obviously not if it endangers anything else.

        -- Sandro

== Introduction

A "box dataset" is a kind of RDF Dataset which adheres to certain 
semantic conditions.    These conditions are likely to be intuitive for 
a large set of RDF users, but they are not universally held, so some RDF 
Datasets are not box datasets.    Some readers may find this document 
challenging because they have never seriously considered the possibility 
of any other kind of dataset, so the properties of box datasets will 
seem utterly obvious.  The fact that a dataset is a box dataset may be 
conveyed using the rdf:BoxDataset class name or via some non-standard 
and/or out-of-band mechanism.

A box dataset is defined to be any RDF Dataset in which the graph names 
each denote some resource (sometimes called a "g-box") which "contains" 
exactly those triples which comprise the RDF Graph which is paired with 
that name in that dataset.    That is, this dataset:

   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX : <http://example.org/>
   <> a rdf:BoxDataset.
   GRAPH :g1 { :a :b :c }

tells us that the resource denoted by <http://example.org/#g1> contains 
exactly one RDF triple and what that triple is.

It contradicts this dataset:

   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX : <http://example.org/>
   <> a rdf:BoxDataset.
   GRAPH :g1 { :a :b :d }

since they disagree about the contained triple is.

These two datasets also contract each other (given the same PREFIX 
declarations as above):

   <> a rdf:BoxDataset.
   GRAPH :g1 { :a :b 1.0 }

and

   <> a rdf:BoxDataset.
   GRAPH :g1 { :a :b 1.00 }

Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing, 
they are not the same RDF term, so the triple { :a :b 1.0 } is not the 
same triple as { :a :b 1.00 }.  Since they are not the same triple, the 
datasets which say they are each what is contained by :g1 cannot both be 
true.   (See "Literal Term Equality" in RDF 1.1 Concepts.)

== Contains

This notion of "contains" is not formally defined but is reflected in 
the documentation of properties and classes used with Box Datasets.  It 
is essentially the same notion as people use when they say a web page 
"contains" some statements or a file "contains" some graphic image.    
More broadly, the web can be thought of as "content" which is 
"contained" in web pages.

Given this pre-existing notion of "contains", it follows that 
pre-existing properties and classes can be used with Box Datasets with 
reasonably confidence they will be correctly understood.

For example, given this dataset:

    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX : <http://example.org/>
    PREFIX dc: <http://purl.org/dc/terms/>
    PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>

   <> a rdf:BoxDataset.
   GRAPH :g1 { :site17 :toxicityLevel 0.0034 }
   :g1 dc:creator :inspector1204;
         dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp.

if we read the documentation for dc:creator and dc:date, and if 
necessary consult the long history of how these terms have been used 
with web pages and computer files which "contain" various statements, it 
becomes clear that this dataset is telling us the given statement using 
the :toxicityLevel property was made by the given entity 
("inspector1204") at the given time.   If we did not know this was a box 
dataset, we would not have any defined connection between :g1 and 
toxicityLevel triple.   We would know something was created by that 
inspector at that time, but its association with that triple would be 
undefined.

== Dereference

While it would be out of scope for this specification to constrain or 
formally characterize what HTTP URIs denote, existing practice with 
metadata on web pages strongly suggests that when referencing a URL 
returns RDF triples, it is reasonable to think of that URL as denoting 
something which contains those triples.   This mean this dataset:

   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX : <http://example.org/>
   <> a rdf:BoxDataset.
   GRAPH :g1 { :a :b :c }

can reasonably be assumed to be saying that dereferencing the URL 
"http://example.org/g1" provides the RDF triple { <http://example.org/a> 
<http://example.org/b> <http://example.org/c> }.    It can further be 
assumed that no other RDF triples are returned.   There is no 
implication about whether other (non-RDF) content might be returned.

Of course, web content can vary over time and per-client, and the 
content isn't always available, due to access control, network failues, 
etc.   The idea here is that those circumstances where the semantic 
constraints of the dataset are met are the same circumstances under 
which that URL would provide the given RDF content, if one were able to 
access it.    That is, the dataset is only "true" if and when that URL 
is backed by that RDF content.  If the dataset is always true everywhere 
(which is the somewhat-naive standard reading of RDF) then that URL 
always has that RDF content. More nuanced notions of context, including 
change over time and different perspectives for different users remain 
as future work.

== Web Crawler Example

As a more complete example, consider the case of a system which crawls 
the web looking for RDF content.  It might store everything it has 
gathered during its repeated crawling in a box dataset.  It might also 
do some canonicalization (think of 1.0 and 1.00 in the introduction) and 
some inference, and  store the output of that processing in the 
dataset.   Then it can make the entire dataset available to SPARQL 
Queries, since SPARQL is defined as querying an RDF Dataset, and it can 
make it available for download in one or more dataset syntaxes like 
TriG, JSON-LD, and N-Quads.

For this example, we'll assume the crawler is only looking at one site, 
http://stocks.example.com, and that site publishes RDF with stock 
closing prices each day at URLs like http://stocks.example.com/data/orcl 
(for Oracle Corporation, whose ticker symbol is "orcl").   Oracle was 
selected at random for this example from among the publicly traded 
companies actively participating in the RDF Working Group, namely 
Oracle, IBM, and Google.

We'll use the following PREFIXes:

   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
   PREFIX feed: <http://stocks.example.com/data/>    # data feeds from 
stocks.example.com
   PREFIX stock: <http://stocks.example.com/vocab#>     # stock 
terminology, and IRIs for public companies
   PREFIX crawl: <http://crawl.example.org/ns/>          # crawler 
terminology
   PREFIX snap: <http://crawl.example.org/snapshots/>   # where the 
crawler publishes individual snapshots
   PREFIX dc: <http://purl.org/dc/terms/>
   PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>

=== Latest Content

The latest content might be stored in name graphs with the name being 
the dereference URL, like this:

   GRAPH feed:orcl { stock:orcl stock:closing 32.46; stock:volume 17655400 }
   feed:orcl crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
                  crawl:lastModified 
"2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
   GRAPH feed:goog { ... }
   feed:goog crawl:fetchedAt ....
   GRAPH feed:ibm { ... }
   feed:ibm crawl:fetchedAt ...
   ...

In this example, stocks.example.com has chosen to make the daily 
information available at one URL (http://stocks.example.com/data/orcl) 
while the stable, long term information about every companies is 
available at another (http://stocks.example.com/vocab).    When the 
crawler visits that second document, it will add this to the dataset:

   GRAPH <http://stocks.example.com/vocab> {
     stock:orcl a stock:PublicCompany, stock:TechSectorCompany;
         rdfs:label "Oracle Corporation";
         stock:ticker "orcl".
      ...
      stock:ticker a rdfs:Property;
          rdfs:comment "The standard ticker symbol (a short string) 
which unambiguously identifies this company".
   }
   <http://stocks.example.com/vocab> crawl:fetchedAt 
"2013-09-15T16:00:02Z"^^xs:dateTimeStamp;

=== Older Content

The older content will need to be stored with different graph names to 
avoid colliding with the latest content.  Here it would be reasonable to 
use blank nodes as the graph names, if the crawler does not want to 
serve linked data, like this:

   GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume  
16250100 }
   _:orcl_20130912 crawl:fetchedFrom: feed:orcl;
                              crawl:fetchedAt 
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
                              crawl:lastModified 
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
   GRAPH _:goog_20130912 { ... }
    _:goog_20130912 crawl:fetchedFrom feed:goog ....
    ...

Alternatively, if the crawler is willing to provide linked data, it can 
create URLs for the snapshots it will be re-publishing:

   GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79; 
stock:volume 16250100 }
   snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
                                    crawl:fetchedAt 
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
                                    crawl:lastModified 
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
   GRAPH snap:goog_20130912 { ... }
    snap:goog_20130912 crawl:fetchedFrom feed:goog ....
    ...

Following best practice with linked data, the crawler should only use 
snapshot URLs like this if there is web server answering at those URLs 
with suitable content.   Because the crawler is using a box dataset, the 
suitable content would have to be the RDF graph associated with that URL 
in this dataset.   Note that the metadata (like the crawl:fetchedAt 
information) MUST NOT be embedded in that content since it's not inside 
the named graph in the dataset above. Instead, if the metadata were to 
be offered, it would have to be offered via another resource.  The HTTP 
Link header can be used to provide a link to it, like this:

 > GET /snapshots/orcl_20130912 HTTP/1.1
 > Host: crawl.example.org
 > Accept: text/turtle; charset=utf-8

< HTTP/1.1 200 OK
< Server: nginx/1.2.1
< Date: Sun, 15 Sep 2013 15:28:38 GMT
< Content-Type: text/turtle; charset=utf-8
< Link: </snapshots/orcl_20130912_meta>; rel="meta"
( ... prefixes ... )
stock:orcl stock:closing 32.79; stock:volume  16250100.

and

 > GET /snapshots/orcl_20130912_meta HTTP/1.1
 > Host: crawl.example.org
 > Accept: text/turtle; charset=utf-8

< HTTP/1.1 200 OK
< Server: nginx/1.2.1
< Date: Sun, 15 Sep 2013 15:28:38 GMT
< Content-Type: text/turtle; charset=utf-8
( ... prefixes ... )
snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
                                    crawl:fetchedAt 
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
                                    crawl:lastModified 
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.

=== Derived Content

It may be useful to have the crawler do some processing on the RDF 
content it fetches and then share the results of that processing. For 
example, it might gather all the ontologies linked from fetched content, 
do some RDFS or OWL reasoning with the results, and then include 
some/all of the resulting entailments in additional graphs in the dataset.

For example, perhaps stock:closing used to be called 
stock:closingSharePrice.  To enable older clients to still read the 
data, stocks.example.com might include in the stock: ontology the triple 
{ stock:closing owl:equivalentProperty stock:closingSharePrice }.   
(This would require older clients to be doing some OWL reasoning, of 
course, which might or might not be a realistic assumption depending on 
their user base.)

On seeing this equivalentProperty declaration, and doing some inference, 
the crawler might add this to the dataset:

   GRAPH snap:orcl_20130912_inferred { stock:orcl 
stock:closingSharePrice 32.79 }
   snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912.

Alternatively, the crawler might use the prov ontology to be more 
explicit about how the inferrence was made.

As a related kind of derived content, the harvester might produce a 
variation on the fetched graph where the non-canonical literals (like 
1.00) are replaced with their canonical equivalents (like 1.0).   It's 
not clear how valuable this would be, however, since many downstream 
systems (like all [?most?] SPARQL systems) will mask this difference.

===============

That's it for now.    Awaiting feedback.

         -- Sandro
Received on Sunday, 15 September 2013 16:48:43 UTC