Re: proposal: "box datasets" (sandro's dataset spec, v0.1)

On Sep 15, 2013, at 9:48 AM, Sandro Hawke wrote:

> Here's what I think we need to define to make Jeremy and many other people happy.   Obviously this is not the final draft of a spec, but hopefully it conveys the idea clearly enough.    If you read this, please say whether you see any seriously technical problems with it and/or would be happy with it going out as a WG Note.   Actually, the idea is so simple and so well-known, even if not formalized or named before, that maybe it's not out of the question to put in on the Rec Track -- but obviously not if it endangers anything else.

Jeremy himself must be the one to say what makes Jeremy happy, but this is *not* a proposal to have named graphs in datasets be what Jeremy and I (and others) once called named graphs. Which is a pity, in my opinion. This proposal has two parts, getting them muddled up with one another, and I would like to keep them more separated. 

One idea is to provide a way to state that graph names in certain datasets do indeed refer to the graph they label. Let me call this the naming idea. 

The other idea is to treat the graphs in a dataset not as graphs, but as graph boxes containing a graph as their current state, but (presumably) able to be changed by future operations. Let me call this the box idea. 

One can take either of these ideas independently from the other; they have no particular relationship. But the box idea is clearly at odds with the current definition of dataset in RDF and in SPARQL, so represents a much more drastic change than the naming idea. The box idea seems to me to be highly disruptive to put into a WG note, since it seems to suggest that datasets are labile things with a state, which is exactly what we decided to not have them be. (I know that technically it does not actually do this, but it sure *seems* to on first, in fact in my case on the first three, readings.) And I don't see any reason to introduce this box idea: we don't need it here (since in order for the proposal to make sense, the boxes must be fixed and not allowed to change.) 

Other comments in-line below. 

> 
>       -- Sandro
> 
> == Introduction
> 
> A "box dataset" is a kind of RDF Dataset which adheres to certain semantic conditions.    These conditions are likely to be intuitive for a large set of RDF users, but they are not universally held, so some RDF Datasets are not box datasets.    Some readers may find this document challenging because they have never seriously considered the possibility of any other kind of dataset, so the properties of box datasets will seem utterly obvious.  The fact that a dataset is a box dataset may be conveyed using the rdf:BoxDataset class name or via some non-standard and/or out-of-band mechanism.
> 
> A box dataset is defined to be any RDF Dataset in which the graph names each denote some resource (sometimes called a "g-box") which "contains" exactly those triples which comprise the RDF Graph which is paired with that name in that dataset.  

Contains at what time, and under what circumstances? Does the containment refer to the time of publication of the dataset or the time it is read and used? Can this containment change with time? If so, how can users know what is the g-box when the dataset  is accessed? If not – if the g-box is 'fixed' – what is the point of introducing the g-box into the discussion in the first place? Why not just say that the graph name refers to the graph? 

>   That is, this dataset:
> 
>  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>  PREFIX : <http://example.org/>
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :a :b :c }
> 
> tells us that the resource denoted by <http://example.org/#g1> contains exactly one RDF triple and what that triple is.
> 
> It contradicts this dataset:

If we are going to use words like "contradict" then we really have to give a semantics for this. Which would not, of course, be hard to do.

> 
>  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>  PREFIX : <http://example.org/>
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :a :b :d }
> 
> since they disagree about the contained triple is.

But if :g1 is a box, cannot they both (have been) true but at different times? Maybe :g1 started with the first triple but later got changed to include the second triple instead, eg by a SPARQL update operation.

> 
> These two datasets also contract each other (given the same PREFIX declarations as above):
> 
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :a :b 1.0 }
> 
> and
> 
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :a :b 1.00 }
> 
> Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing, they are not the same RDF term, so the triple { :a :b 1.0 } is not the same triple as { :a :b 1.00 }.  Since they are not the same triple, the datasets which say they are each what is contained by :g1 cannot both be true.   (See "Literal Term Equality" in RDF 1.1 Concepts.)
> 
> == Contains
> 
> This notion of "contains" is not formally defined but is reflected in the documentation of properties and classes used with Box Datasets.  It is essentially the same notion as people use when they say a web page "contains" some statements or a file "contains" some graphic image.    More broadly, the web can be thought of as "content" which is "contained" in web pages.

And this common notion impies that pages and files have a state, ie their content can change without their identity changing. Do you want g-boxes to have this labile quality also? 

> 
> Given this pre-existing notion of "contains", it follows that pre-existing properties and classes can be used with Box Datasets with reasonably confidence they will be correctly understood.

Um, bullshit? Especially if people use RDF to describe them. Utter confusion will reign, and become set in many forms of concrete.

> For example, given this dataset:
> 
>   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>   PREFIX : <http://example.org/>
>   PREFIX dc: <http://purl.org/dc/terms/>
>   PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>
> 
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :site17 :toxicityLevel 0.0034 }
>  :g1 dc:creator :inspector1204;
>        dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp.
> 
> if we read the documentation for dc:creator and dc:date, and if necessary consult the long history of how these terms have been used with web pages and computer files which "contain" various statements, it becomes clear that this dataset is telling us the given statement using the :toxicityLevel property was made by the given entity ("inspector1204") at the given time.   If we did not know this was a box dataset, we would not have any defined connection between :g1 and toxicityLevel triple.   We would know something was created by that inspector at that time, but its association with that triple would be undefined.

Right, but that just needs the naming idea, not the box idea. 

> 
> == Dereference
> 
> While it would be out of scope for this specification to constrain or formally characterize what HTTP URIs denote, existing practice with metadata on web pages strongly suggests that when referencing a URL returns RDF triples, it is reasonable to think of that URL as denoting something which contains those triples.  

I don't think this is at all reasonable, in practice. In fact, the emerging consensus seems to be more like that what you get when you reference a URI is some kind of representation or description of what it is that the URI refers to. Or maybe just "more information about" that thing. But that does not presume that the thing being described is a container of the description. ESPECIALLY when we are dealing with RDF. 

>  This mean this dataset:
> 
>  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>  PREFIX : <http://example.org/>
>  <> a rdf:BoxDataset.
>  GRAPH :g1 { :a :b :c }
> 
> can reasonably be assumed to be saying that dereferencing the URL "http://example.org/g1" provides the RDF triple { <http://example.org/a> <http://example.org/b> <http://example.org/c> }.  

True, but this would also be the case if :g1 were understood as denoting the actual graph, and what you got when you reference it is a (awww-)representation of the graph, ie some bytes in a recognized RDF surface syntax which parse to that graph. 

>   It can further be assumed that no other RDF triples are returned.   There is no implication about whether other (non-RDF) content might be returned.
> 
> Of course, web content can vary over time and per-client, and the content isn't always available, due to access control, network failues, etc.   The idea here is that those circumstances where the semantic constraints of the dataset are met are the same circumstances under which that URL would provide the given RDF content, if one were able to access it.

Again, an idealization which is often applied to the Web in general. 

>    That is, the dataset is only "true" if and when that URL is backed by that RDF content.

No, it is true when the names refer correctly. You only KNOW it is true when the Web is working correctly so you can get your hands on the relevant information, but that is a separate issue. If I read a notice which uses a word I don't understand, then my ignorance does not make the notice false. What changes when i discover what the word means is my state of understanding, not the truth of the notice. 

>  If the dataset is always true everywhere (which is the somewhat-naive standard reading of RDF) then that URL always has that RDF content. More nuanced notions of context, including change over time and different perspectives for different users remain as future work.

You won't get away with this. If you insist that these graphs are boxes, and appeal to "normal" meanings, then some people will assume they are labile and their state can change, some people will also assume that they are always about the present, while others will assume that they are really graphs all the time. And all these assumings will be implicit in deployed RDF, adding to the babel of confusion that we already have. 

> 
> == Web Crawler Example
> 
> As a more complete example, consider the case of a system which crawls the web looking for RDF content.  It might store everything it has gathered during its repeated crawling in a box dataset.  It might also do some canonicalization (think of 1.0 and 1.00 in the introduction) and some inference, and  store the output of that processing in the dataset.   Then it can make the entire dataset available to SPARQL Queries, since SPARQL is defined as querying an RDF Dataset, and it can make it available for download in one or more dataset syntaxes like TriG, JSON-LD, and N-Quads.
> 
> For this example, we'll assume the crawler is only looking at one site, http://stocks.example.com, and that site publishes RDF with stock closing prices each day at URLs like http://stocks.example.com/data/orcl (for Oracle Corporation, whose ticker symbol is "orcl").   Oracle was selected at random for this example from among the publicly traded companies actively participating in the RDF Working Group, namely Oracle, IBM, and Google.
> 
> We'll use the following PREFIXes:
> 
>  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>  PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>  PREFIX feed: <http://stocks.example.com/data/>    # data feeds from stocks.example.com
>  PREFIX stock: <http://stocks.example.com/vocab#>     # stock terminology, and IRIs for public companies
>  PREFIX crawl: <http://crawl.example.org/ns/>          # crawler terminology
>  PREFIX snap: <http://crawl.example.org/snapshots/>   # where the crawler publishes individual snapshots
>  PREFIX dc: <http://purl.org/dc/terms/>
>  PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>
> 
> === Latest Content
> 
> The latest content might be stored in name graphs with the name being the dereference URL, like this:
> 
>  GRAPH feed:orcl { stock:orcl stock:closing 32.46; stock:volume 17655400 }
>  feed:orcl crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>                 crawl:lastModified "2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
>  GRAPH feed:goog { ... }
>  feed:goog crawl:fetchedAt ....
>  GRAPH feed:ibm { ... }
>  feed:ibm crawl:fetchedAt ...
>  ...
> 
> In this example, stocks.example.com has chosen to make the daily information available at one URL (http://stocks.example.com/data/orcl) while the stable, long term information about every companies is available at another (http://stocks.example.com/vocab).    When the crawler visits that second document, it will add this to the dataset:
> 
>  GRAPH <http://stocks.example.com/vocab> {
>    stock:orcl a stock:PublicCompany, stock:TechSectorCompany;
>        rdfs:label "Oracle Corporation";
>        stock:ticker "orcl".
>     ...
>     stock:ticker a rdfs:Property;
>         rdfs:comment "The standard ticker symbol (a short string) which unambiguously identifies this company".
>  }
>  <http://stocks.example.com/vocab> crawl:fetchedAt "2013-09-15T16:00:02Z"^^xs:dateTimeStamp;
> 
> === Older Content
> 
> The older content will need to be stored with different graph names to avoid colliding with the latest content.

Does the latest content use the same URI now for todays information that it used yesterday for yesterday's information? If so, what kind of entity is this named graph? 

>  Here it would be reasonable to use blank nodes as the graph names, if the crawler does not want to serve linked data, like this:
> 
>  GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume  16250100 }
>  _:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>                             crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>                             crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>  GRAPH _:goog_20130912 { ... }
>   _:goog_20130912 crawl:fetchedFrom feed:goog ....
>   ...
> 
> Alternatively, if the crawler is willing to provide linked data, it can create URLs for the snapshots it will be re-publishing:
> 
>  GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume 16250100 }
>  snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>                                   crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>                                   crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>  GRAPH snap:goog_20130912 { ... }
>   snap:goog_20130912 crawl:fetchedFrom feed:goog ....
>   ...
> 
> Following best practice with linked data, the crawler should only use snapshot URLs like this if there is web server answering at those URLs with suitable content.   Because the crawler is using a box dataset, the suitable content would have to be the RDF graph associated with that URL in this dataset.   Note that the metadata (like the crawl:fetchedAt information) MUST NOT be embedded in that content since it's not inside the named graph in the dataset above.

But it could have been, right? So this is a design decision rather than an imperative. (Or am I not following something? I find examples like this more confusing than helpful when I don't know what exactly they are supposed to illustrate.)

> Instead, if the metadata were to be offered, it would have to be offered via another resource.  The HTTP Link header can be used to provide a link to it, like this:
> 
> > GET /snapshots/orcl_20130912 HTTP/1.1
> > Host: crawl.example.org
> > Accept: text/turtle; charset=utf-8
> 
> < HTTP/1.1 200 OK
> < Server: nginx/1.2.1
> < Date: Sun, 15 Sep 2013 15:28:38 GMT
> < Content-Type: text/turtle; charset=utf-8
> < Link: </snapshots/orcl_20130912_meta>; rel="meta"
> ( ... prefixes ... )
> stock:orcl stock:closing 32.79; stock:volume  16250100.
> 
> and
> 
> > GET /snapshots/orcl_20130912_meta HTTP/1.1
> > Host: crawl.example.org
> > Accept: text/turtle; charset=utf-8
> 
> < HTTP/1.1 200 OK
> < Server: nginx/1.2.1
> < Date: Sun, 15 Sep 2013 15:28:38 GMT
> < Content-Type: text/turtle; charset=utf-8
> ( ... prefixes ... )
> snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>                                   crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>                                   crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
> 
> === Derived Content
> 
> It may be useful to have the crawler do some processing on the RDF content it fetches and then share the results of that processing. For example, it might gather all the ontologies linked from fetched content, do some RDFS or OWL reasoning with the results, and then include some/all of the resulting entailments in additional graphs in the dataset.
> 
> For example, perhaps stock:closing used to be called stock:closingSharePrice.  To enable older clients to still read the data, stocks.example.com might include in the stock: ontology the triple { stock:closing owl:equivalentProperty stock:closingSharePrice }.   (This would require older clients to be doing some OWL reasoning, of course, which might or might not be a realistic assumption depending on their user base.)
> 
> On seeing this equivalentProperty declaration, and doing some inference, the crawler might add this to the dataset:
> 
>  GRAPH snap:orcl_20130912_inferred { stock:orcl stock:closingSharePrice 32.79 }
>  snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912.
> 
> Alternatively, the crawler might use the prov ontology to be more explicit about how the inferrence was made.
> 
> As a related kind of derived content, the harvester might produce a variation on the fetched graph where the non-canonical literals (like 1.00) are replaced with their canonical equivalents (like 1.0).   It's not clear how valuable this would be, however, since many downstream systems (like all [?most?] SPARQL systems) will mask this difference.

As far as I can tell, and I might have missed something, all of this can be done under the assumption that the graph name is actually the name of the *graph*, not of a box containing the graph. As that way of expressing all this is (1) simpler (2) more in line with both the history of graph naming and the current normative definition of a dataset and (3) less liable to be misinterpreted as allowing labile "graphs" in datasets, I would prefer to avoid the "box" terminology and just have something like this which requires graph names to denote the actual graphs, ie the naming convention without the box convention. 

Pat

> 
> ===============
> 
> That's it for now.    Awaiting feedback.
> 
>        -- Sandro
> 
> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 home
40 South Alcaniz St.            (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile (preferred)
phayes@ihmc.us       http://www.ihmc.us/users/phayes

Received on Monday, 16 September 2013 05:48:24 UTC