- From: Sandro Hawke <sandro@w3.org>
- Date: Sun, 15 Sep 2013 12:48:35 -0400
- To: RDF WG <public-rdf-wg@w3.org>
Here's what I think we need to define to make Jeremy and many other
people happy. Obviously this is not the final draft of a spec, but
hopefully it conveys the idea clearly enough. If you read this,
please say whether you see any seriously technical problems with it
and/or would be happy with it going out as a WG Note. Actually, the
idea is so simple and so well-known, even if not formalized or named
before, that maybe it's not out of the question to put in on the Rec
Track -- but obviously not if it endangers anything else.
-- Sandro
== Introduction
A "box dataset" is a kind of RDF Dataset which adheres to certain
semantic conditions. These conditions are likely to be intuitive for
a large set of RDF users, but they are not universally held, so some RDF
Datasets are not box datasets. Some readers may find this document
challenging because they have never seriously considered the possibility
of any other kind of dataset, so the properties of box datasets will
seem utterly obvious. The fact that a dataset is a box dataset may be
conveyed using the rdf:BoxDataset class name or via some non-standard
and/or out-of-band mechanism.
A box dataset is defined to be any RDF Dataset in which the graph names
each denote some resource (sometimes called a "g-box") which "contains"
exactly those triples which comprise the RDF Graph which is paired with
that name in that dataset. That is, this dataset:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX : <http://example.org/>
<> a rdf:BoxDataset.
GRAPH :g1 { :a :b :c }
tells us that the resource denoted by <http://example.org/#g1> contains
exactly one RDF triple and what that triple is.
It contradicts this dataset:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX : <http://example.org/>
<> a rdf:BoxDataset.
GRAPH :g1 { :a :b :d }
since they disagree about the contained triple is.
These two datasets also contract each other (given the same PREFIX
declarations as above):
<> a rdf:BoxDataset.
GRAPH :g1 { :a :b 1.0 }
and
<> a rdf:BoxDataset.
GRAPH :g1 { :a :b 1.00 }
Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing,
they are not the same RDF term, so the triple { :a :b 1.0 } is not the
same triple as { :a :b 1.00 }. Since they are not the same triple, the
datasets which say they are each what is contained by :g1 cannot both be
true. (See "Literal Term Equality" in RDF 1.1 Concepts.)
== Contains
This notion of "contains" is not formally defined but is reflected in
the documentation of properties and classes used with Box Datasets. It
is essentially the same notion as people use when they say a web page
"contains" some statements or a file "contains" some graphic image.
More broadly, the web can be thought of as "content" which is
"contained" in web pages.
Given this pre-existing notion of "contains", it follows that
pre-existing properties and classes can be used with Box Datasets with
reasonably confidence they will be correctly understood.
For example, given this dataset:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX : <http://example.org/>
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
<> a rdf:BoxDataset.
GRAPH :g1 { :site17 :toxicityLevel 0.0034 }
:g1 dc:creator :inspector1204;
dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp.
if we read the documentation for dc:creator and dc:date, and if
necessary consult the long history of how these terms have been used
with web pages and computer files which "contain" various statements, it
becomes clear that this dataset is telling us the given statement using
the :toxicityLevel property was made by the given entity
("inspector1204") at the given time. If we did not know this was a box
dataset, we would not have any defined connection between :g1 and
toxicityLevel triple. We would know something was created by that
inspector at that time, but its association with that triple would be
undefined.
== Dereference
While it would be out of scope for this specification to constrain or
formally characterize what HTTP URIs denote, existing practice with
metadata on web pages strongly suggests that when referencing a URL
returns RDF triples, it is reasonable to think of that URL as denoting
something which contains those triples. This mean this dataset:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX : <http://example.org/>
<> a rdf:BoxDataset.
GRAPH :g1 { :a :b :c }
can reasonably be assumed to be saying that dereferencing the URL
"http://example.org/g1" provides the RDF triple { <http://example.org/a>
<http://example.org/b> <http://example.org/c> }. It can further be
assumed that no other RDF triples are returned. There is no
implication about whether other (non-RDF) content might be returned.
Of course, web content can vary over time and per-client, and the
content isn't always available, due to access control, network failues,
etc. The idea here is that those circumstances where the semantic
constraints of the dataset are met are the same circumstances under
which that URL would provide the given RDF content, if one were able to
access it. That is, the dataset is only "true" if and when that URL
is backed by that RDF content. If the dataset is always true everywhere
(which is the somewhat-naive standard reading of RDF) then that URL
always has that RDF content. More nuanced notions of context, including
change over time and different perspectives for different users remain
as future work.
== Web Crawler Example
As a more complete example, consider the case of a system which crawls
the web looking for RDF content. It might store everything it has
gathered during its repeated crawling in a box dataset. It might also
do some canonicalization (think of 1.0 and 1.00 in the introduction) and
some inference, and store the output of that processing in the
dataset. Then it can make the entire dataset available to SPARQL
Queries, since SPARQL is defined as querying an RDF Dataset, and it can
make it available for download in one or more dataset syntaxes like
TriG, JSON-LD, and N-Quads.
For this example, we'll assume the crawler is only looking at one site,
http://stocks.example.com, and that site publishes RDF with stock
closing prices each day at URLs like http://stocks.example.com/data/orcl
(for Oracle Corporation, whose ticker symbol is "orcl"). Oracle was
selected at random for this example from among the publicly traded
companies actively participating in the RDF Working Group, namely
Oracle, IBM, and Google.
We'll use the following PREFIXes:
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX feed: <http://stocks.example.com/data/> # data feeds from
stocks.example.com
PREFIX stock: <http://stocks.example.com/vocab#> # stock
terminology, and IRIs for public companies
PREFIX crawl: <http://crawl.example.org/ns/> # crawler
terminology
PREFIX snap: <http://crawl.example.org/snapshots/> # where the
crawler publishes individual snapshots
PREFIX dc: <http://purl.org/dc/terms/>
PREFIX xs: <http://www.w3.org/2001/XMLSchema#>
=== Latest Content
The latest content might be stored in name graphs with the name being
the dereference URL, like this:
GRAPH feed:orcl { stock:orcl stock:closing 32.46; stock:volume 17655400 }
feed:orcl crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
crawl:lastModified
"2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
GRAPH feed:goog { ... }
feed:goog crawl:fetchedAt ....
GRAPH feed:ibm { ... }
feed:ibm crawl:fetchedAt ...
...
In this example, stocks.example.com has chosen to make the daily
information available at one URL (http://stocks.example.com/data/orcl)
while the stable, long term information about every companies is
available at another (http://stocks.example.com/vocab). When the
crawler visits that second document, it will add this to the dataset:
GRAPH <http://stocks.example.com/vocab> {
stock:orcl a stock:PublicCompany, stock:TechSectorCompany;
rdfs:label "Oracle Corporation";
stock:ticker "orcl".
...
stock:ticker a rdfs:Property;
rdfs:comment "The standard ticker symbol (a short string)
which unambiguously identifies this company".
}
<http://stocks.example.com/vocab> crawl:fetchedAt
"2013-09-15T16:00:02Z"^^xs:dateTimeStamp;
=== Older Content
The older content will need to be stored with different graph names to
avoid colliding with the latest content. Here it would be reasonable to
use blank nodes as the graph names, if the crawler does not want to
serve linked data, like this:
GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume
16250100 }
_:orcl_20130912 crawl:fetchedFrom: feed:orcl;
crawl:fetchedAt
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
crawl:lastModified
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
GRAPH _:goog_20130912 { ... }
_:goog_20130912 crawl:fetchedFrom feed:goog ....
...
Alternatively, if the crawler is willing to provide linked data, it can
create URLs for the snapshots it will be re-publishing:
GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79;
stock:volume 16250100 }
snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
crawl:fetchedAt
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
crawl:lastModified
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
GRAPH snap:goog_20130912 { ... }
snap:goog_20130912 crawl:fetchedFrom feed:goog ....
...
Following best practice with linked data, the crawler should only use
snapshot URLs like this if there is web server answering at those URLs
with suitable content. Because the crawler is using a box dataset, the
suitable content would have to be the RDF graph associated with that URL
in this dataset. Note that the metadata (like the crawl:fetchedAt
information) MUST NOT be embedded in that content since it's not inside
the named graph in the dataset above. Instead, if the metadata were to
be offered, it would have to be offered via another resource. The HTTP
Link header can be used to provide a link to it, like this:
> GET /snapshots/orcl_20130912 HTTP/1.1
> Host: crawl.example.org
> Accept: text/turtle; charset=utf-8
< HTTP/1.1 200 OK
< Server: nginx/1.2.1
< Date: Sun, 15 Sep 2013 15:28:38 GMT
< Content-Type: text/turtle; charset=utf-8
< Link: </snapshots/orcl_20130912_meta>; rel="meta"
( ... prefixes ... )
stock:orcl stock:closing 32.79; stock:volume 16250100.
and
> GET /snapshots/orcl_20130912_meta HTTP/1.1
> Host: crawl.example.org
> Accept: text/turtle; charset=utf-8
< HTTP/1.1 200 OK
< Server: nginx/1.2.1
< Date: Sun, 15 Sep 2013 15:28:38 GMT
< Content-Type: text/turtle; charset=utf-8
( ... prefixes ... )
snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
crawl:fetchedAt
"2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
crawl:lastModified
"2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
=== Derived Content
It may be useful to have the crawler do some processing on the RDF
content it fetches and then share the results of that processing. For
example, it might gather all the ontologies linked from fetched content,
do some RDFS or OWL reasoning with the results, and then include
some/all of the resulting entailments in additional graphs in the dataset.
For example, perhaps stock:closing used to be called
stock:closingSharePrice. To enable older clients to still read the
data, stocks.example.com might include in the stock: ontology the triple
{ stock:closing owl:equivalentProperty stock:closingSharePrice }.
(This would require older clients to be doing some OWL reasoning, of
course, which might or might not be a realistic assumption depending on
their user base.)
On seeing this equivalentProperty declaration, and doing some inference,
the crawler might add this to the dataset:
GRAPH snap:orcl_20130912_inferred { stock:orcl
stock:closingSharePrice 32.79 }
snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912.
Alternatively, the crawler might use the prov ontology to be more
explicit about how the inferrence was made.
As a related kind of derived content, the harvester might produce a
variation on the fetched graph where the non-canonical literals (like
1.00) are replaced with their canonical equivalents (like 1.0). It's
not clear how valuable this would be, however, since many downstream
systems (like all [?most?] SPARQL systems) will mask this difference.
===============
That's it for now. Awaiting feedback.
-- Sandro
Received on Sunday, 15 September 2013 16:48:43 UTC