Re: proposal: "box datasets" (sandro's dataset spec, v0.1)

On Sep 16, 2013, at 12:18 PM, Sandro Hawke wrote:

> On 09/16/2013 02:11 PM, Pat Hayes wrote:
>> On Sep 16, 2013, at 4:52 AM, Sandro Hawke wrote:
>> 
>> 
>>> [ I didn't think it appropriate to CC Jeremy, since obviously we don't have consensus yet, and he can't reply to this mailing list unless he joins the WG. ]
>>> 
>>> Key points would be:
>>> 
>>>   - For many years I thought about graphs as N3 does, just using what might be called graph literals (RDF terms which are syntactic expressions which denote RDF graphs or RDF graph patterns).   It's been a long journey for me, letting go of feeling that was the obviously-right way to handle this stuff to today, where I think the box model will work better for folks.
>>> 
>> Graphs are graphs and boxes are boxes (well, resources are resources). We need both of them. I don't see this as a fight between rival models. The only point at issue is, what do the name IRIs in a dataset refer to? It seems to me to be starkly obvious that the only *intuitively* reasonable answer is, they refer to what it is that they name, ie, the graph. 
> 
> So, with your proposed reading, could this dataset be true?
> 
>   GRAPH <http://www.w3.org/> { :a :b :c }

If we assume that this IRI denotes the W3C or the W3C web page, then no, it couldn't be true. 

> 
> What would it mean about the world if it was?

That the W3C was a rather small and boring RDF graph, I guess. 

> 
>> Now, of course, there will be those out there who consistently read "graph" as meaing g-box (or something like that), ie a thing with a state which delivers graphs when poked, and OK, to those readers, this will mean that the dataset is interpreted your way. But by spelling it out, you are forcing people who would have read the dataset spec correctly to buy into this misreading. I think this is a bad pedagogic strategy. 
>> 
> 
> My wild guess is that 99% of the people using SPARQL Update have already bought into the box model and 90% of the people using SPARQL have.
> 
> If I'm wrong about that, and we have another viable option, I'm all ears.

You might be right, but it still doesn't affect my point. 

> 
>>>   - I believe this box model is, in fact, how pretty much all SPARQL users think of this stuff.  I hear them talk a lot about using SPARQL, and they're always talking about putting things (triples) into a particular graph, deleting them from that graph, checking if they're in that graph, ...    That all fits the box model.   Very few have any idea there's a static "dataset" in the model, and even those who know that full well still talk about changing "graphs".
>>> 
>> This may be true, but I don't see that its an argument for making our specs incoherent.
> 
> I don't think it's fair to call the box model incoherent.

I didnt mean it was. I meant only that having the box model in your account of datasets, and the very non-box platonic graph model in the RDF specs where datasets are mentioned, is together incoherent. Perhaps I should have said inconsistent. 

>   In a sufficiently static world, I believe it is perfectly coherent.  And in a not-very-static world, RDF itself is incoherent.   So... glass houses, and all that.   
> 
> Or is there some other incoherence I'm missing? 
> 
> I guess there's the term "named graph" and how the "name" doesn't denote the "graph".   If you know a way out of that one, again, I'm     all ears.  As far as I can tell, in real-world usage, the term "Named Graph" means any g-box in a SPARQL Graph Store that got a "Graph Name".    It's linguistically awkward that the class Named Graph isn't a subclass of the class RDF Graph, but it's what we're stuck with.    I notice how some folks (like David Wood) accent the words in a way that suggests to me a "namedgraph" is a thing not necessarily related to a "graph".

In other discussions we seem to have got this sorted out, perhaps that will make this particular debate moot. 

> 
>>  And notice, this is a way of thinking about GRAPHS, not datasets in particular. So if we say that the dataset names refer to the graph, your readers who think this way will get the meaning you intend them to get. 
>> 
>> 
>>>   - Yes, the change-over-time thing is an issue here, but it's absolutely an issue in the rest of RDF, and it's no different here.   So (as I mentioned to danbri) this is something the RDF community will have to address.
>>> 
>> WILL have to, but lets put that off into the future, as this WG touched it and then dropped it as way too hot to handle. 
>> 
> 
> I agree.  I'm not proposing we handle it now.   I'm just observing that it's already a huge problem with RDF, and while box datasets might make it bite us more often, they didn't create the problem or anything.

All true, but I still think its bad to sound like you are going into that territory when in fact you aren't. 

> 
> 
>>>   Note that Google has this problem now, full force, as the Google Knowledge Graph (which powers more and more of Search, as well as other products) is getting its triples in both the Freebase vocabulary (which models things as you would, as statements which are always true, although of course they can still change) and the Schema.org vocabulary (which models things just as they are right now, since it's trying to match how current natural language web pages say things, and that's how they usually do it).
>>> 
>>> So, basically, I have to challenge you to come up with a counter proposal.
>>> 
>> I already did. Keep separate issues separated, try to keep things clean. All of your extended example can be done just taking datasets to be made up of graphs (as they are defined to be) and having a postmarket flag to say that graph names really do refer to the graphs they name. And that is all.
> 
> I don't see how to make that work.
> 
> Shall we call that postmarket flag rdf:DirectDataset?
> 
> So
> 
>  <> a rdf:DirectDataset.
>  GRAPH :g1 { :a :b :c }
> 
> means that :g1 denotes the RDF Graph that is the set containing the triple :a :b :c.
> 
> So
> 
>   <> a rdf:DirectDataset.
>   GRAPH :g1 { :a :b :c }
>   GRAPH :g2 { :a :b :c }
>   :g1 :p 1.
> entails
>   :g2 :p 1.
> 
> right?
> 
> Now how do we represent Why_Graphs Use Case 1, storing the latest results of a crawl?
> 
> In the box model I wrote:
>  GRAPH feed:orcl
>  { stock:orcl stock:closing 32.46; stock:volume 17655400 }
>  
> feed:orcl
>  crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>            crawl:lastModified "2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
> 
> 
> How do you do something like that for the direct model, given the above entailment?
> 
> (thinking.)
> 
> Okay, it's not that hard.   You'd have to do something like this:
> 
>   GRAPH g:0034545 { stock:orcl stock:closing 32.46; stock:volume 17655400 }
>   [ a crawl:Retrieval;
>     crawl:resultGraph g:0034545;
>     crawl:timeStamp "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>     crawl:lastModified "2013-09-13T22:01:14Z"^^xs:dateTimeStamp. ]
> 
> (As an aside, that's a lot like how one would do it in N3, except without this odd URL g:0034545.  In N3, one would just write:
> 
>   [ a crawl:Retrieval;
>     crawl:resultGraph { stock:orcl stock:closing 32.46; stock:volume 17655400 };
>     crawl:timeStamp "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>     crawl:lastModified "2013-09-13T22:01:14Z"^^xs:dateTimeStamp. ]
> 
> ... which I agree makes a whole lot of sense.   But that's not what this WG is doing.)
> 
> So, that would work, but you end up with a whole lot of meaningless single-use graph names, which you can't even make dereferenceable, I think.

Can't we use blank nodes as labels here? Seems like a perfect use case for them.

>   And you're going to have folks doing SPARQL UPDATES to them all the time.  In fact, they HAVE TO, to create this dataset....   right?
> 
>>  And this has the nice property that it clearly indicates why the name/graph link is so tight (cf your inconsistency examples) which is otherwise puzzling. (If these are boxes, why can't they change? Are they stuck?  Why?)
> 
> They can change, just like foaf:name triples can change, when people change their name.   Or their address.  Or their instantaneous lat/lon.  Or their preferred email address.      And just like with foaf:name changing, if you make the naive assumption that all the RDF data ever published (even from trustworthy sources) is equally true, you're likely to have problems.    So if you're integrating RDF data from different points in time, or whatever, you need some extra machinery.  That extra machinery is a subject for research now and hopefully standardization in a few years.
> 
>> 
>> Now, OK, there will be people out there who treat "graph" as things you can change, but at least they will be doing that uniformly relative to the specs: a graph in RDF and a graph in SPARQL and a graph in a dataset are all the same kind of thing. And who knows, maybe some of them will actually read the specs. 
>> 
> 
> Anyone using SPARQL Update is constantly writing stuff like this:
> INSERT DATA
> { GRAPH 
> <http://example/bookStore> { <http://example/book1>
>   ns:price  42 } }
> 
> 
> or
> INSERT 
>   { GRAPH 
> <http://example/bookStore2>
>  { ?book ?p ?v } }
> WHERE
>   { GRAPH  
> <http://example/bookStore>
> 
>        { ?book dc:date ?date .
>          FILTER ( ?date > "1970-01-01T00:00:00-02:00"^^xsd:dateTime )
>          ?book ?p ?v
>   } }	
> 
> How would you read those, with immutable (aka real, RDF) graphs?    (those are examples from the SPARQL 1.1 Rec)

Sigh. OK, I admit, SPARQL obviously buys into the box model in spades. 

> 
> (I know the SPARQL Update specs talks about slots.   My question is how a human is supposed to read those expressions without a g-box concept.
> 
>> You might be right that the RDF world needs to standardize something like the box model (I prefer the terminology "graph resource") and get RDF made officially into a temporal context logic (actually something more, which keeps track of real time and has a "now" built into the semantics: AFAIK, no such logic has yet been invented by anyone, so this really is new ground.)  But lets do that later, because that is most definitely not in our charter, and lets do it properly, not sneaked into a note without any fanfare and in any case not really relevant to the topic that the note is supposed to be about. 
>> 
> 
> Yeah, (again) I'm sorry for suggesting the box model might in any way be blessed by the WG as any more than one of several ways to use Datasets.
> 
>>>    I'm not attached to any particular design, as you can probably tell because of how my proposals keep changing.   I just want a design that solves the problems current and future users have in maintaining separate streams of RDF data flowing through systems.    cf 1-4 on http://www.w3.org/2011/rdf-wg/wiki/Why_Graphs
>>>      In the long example below I pretty much showed how to do that.   (I left out UC3 for now.)  I have no idea how you can possibly do that, in a way which is mentally in reach of current and future SPARQL users, using the "naming model."  Please show me how.
>>> 
>> I will in a separate email, later this evening. 
> 
> Looking forward to it.   :-)

And I was too tired to even tackle it. And I still am :-)

Pat

> 
>       - s
> 
>> 
>> Pat
>> 
>> 
>>> Thanks.
>>> 
>>>         -- Sandro
>>> 
>>> 
>>> 
>>> 
>>> On 09/16/2013 01:47 AM, Pat Hayes wrote:
>>> 
>>>> On Sep 15, 2013, at 9:48 AM, Sandro Hawke wrote:
>>>> 
>>>> 
>>>>> Here's what I think we need to define to make Jeremy and many other people happy.   Obviously this is not the final draft of a spec, but hopefully it conveys the idea clearly enough.    If you read this, please say whether you see any seriously technical problems with it and/or would be happy with it going out as a WG Note.   Actually, the idea is so simple and so well-known, even if not formalized or named before, that maybe it's not out of the question to put in on the Rec Track -- but obviously not if it endangers anything else.
>>>>> 
>>>> Jeremy himself must be the one to say what makes Jeremy happy, but this is *not* a proposal to have named graphs in datasets be what Jeremy and I (and others) once called named graphs. Which is a pity, in my opinion. This proposal has two parts, getting them muddled up with one another, and I would like to keep them more separated.
>>>> 
>>>> One idea is to provide a way to state that graph names in certain datasets do indeed refer to the graph they label. Let me call this the naming idea.
>>>> 
>>>> The other idea is to treat the graphs in a dataset not as graphs, but as graph boxes containing a graph as their current state, but (presumably) able to be changed by future operations. Let me call this the box idea.
>>>> 
>>>> One can take either of these ideas independently from the other; they have no particular relationship. But the box idea is clearly at odds with the current definition of dataset in RDF and in SPARQL, so represents a much more drastic change than the naming idea. The box idea seems to me to be highly disruptive to put into a WG note, since it seems to suggest that datasets are labile things with a state, which is exactly what we decided to not have them be. (I know that technically it does not actually do this, but it sure *seems* to on first, in fact in my case on the first three, readings.) And I don't see any reason to introduce this box idea: we don't need it here (since in order for the proposal to make sense, the boxes must be fixed and not allowed to change.)
>>>> 
>>>> Other comments in-line below.
>>>> 
>>>> 
>>>>>       -- Sandro
>>>>> 
>>>>> == Introduction
>>>>> 
>>>>> A "box dataset" is a kind of RDF Dataset which adheres to certain semantic conditions.    These conditions are likely to be intuitive for a large set of RDF users, but they are not universally held, so some RDF Datasets are not box datasets.    Some readers may find this document challenging because they have never seriously considered the possibility of any other kind of dataset, so the properties of box datasets will seem utterly obvious.  The fact that a dataset is a box dataset may be conveyed using the rdf:BoxDataset class name or via some non-standard and/or out-of-band mechanism.
>>>>> 
>>>>> A box dataset is defined to be any RDF Dataset in which the graph names each denote some resource (sometimes called a "g-box") which "contains" exactly those triples which comprise the RDF Graph which is paired with that name in that dataset.
>>>>> 
>>>> Contains at what time, and under what circumstances? Does the containment refer to the time of publication of the dataset or the time it is read and used? Can this containment change with time? If so, how can users know what is the g-box when the dataset  is accessed? If not – if the g-box is 'fixed' – what is the point of introducing the g-box into the discussion in the first place? Why not just say that the graph name refers to the graph?
>>>> 
>>>> 
>>>>>   That is, this dataset:
>>>>> 
>>>>>  PREFIX rdf: 
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> 
>>>>>  PREFIX : 
>>>>> <http://example.org/>
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :a :b :c }
>>>>> 
>>>>> tells us that the resource denoted by 
>>>>> <http://example.org/#g1>
>>>>>  contains exactly one RDF triple and what that triple is.
>>>>> 
>>>>> It contradicts this dataset:
>>>>> 
>>>> If we are going to use words like "contradict" then we really have to give a semantics for this. Which would not, of course, be hard to do.
>>>> 
>>>> 
>>>>>  PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> 
>>>>>  PREFIX : 
>>>>> <http://example.org/>
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :a :b :d }
>>>>> 
>>>>> since they disagree about the contained triple is.
>>>>> 
>>>> But if :g1 is a box, cannot they both (have been) true but at different times? Maybe :g1 started with the first triple but later got changed to include the second triple instead, eg by a SPARQL update operation.
>>>> 
>>>> 
>>>>> These two datasets also contract each other (given the same PREFIX declarations as above):
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :a :b 1.0 }
>>>>> 
>>>>> and
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :a :b 1.00 }
>>>>> 
>>>>> Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing, they are not the same RDF term, so the triple { :a :b 1.0 } is not the same triple as { :a :b 1.00 }.  Since they are not the same triple, the datasets which say they are each what is contained by :g1 cannot both be true.   (See "Literal Term Equality" in RDF 1.1 Concepts.)
>>>>> 
>>>>> == Contains
>>>>> 
>>>>> This notion of "contains" is not formally defined but is reflected in the documentation of properties and classes used with Box Datasets.  It is essentially the same notion as people use when they say a web page "contains" some statements or a file "contains" some graphic image.    More broadly, the web can be thought of as "content" which is "contained" in web pages.
>>>>> 
>>>> And this common notion impies that pages and files have a state, ie their content can change without their identity changing. Do you want g-boxes to have this labile quality also?
>>>> 
>>>> 
>>>>> Given this pre-existing notion of "contains", it follows that pre-existing properties and classes can be used with Box Datasets with reasonably confidence they will be correctly understood.
>>>>> 
>>>> Um, bullshit? Especially if people use RDF to describe them. Utter confusion will reign, and become set in many forms of concrete.
>>>> 
>>>> 
>>>>> For example, given this dataset:
>>>>> 
>>>>>   PREFIX rdf: 
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> 
>>>>>   PREFIX : 
>>>>> <http://example.org/>
>>>>> 
>>>>>   PREFIX dc: 
>>>>> <http://purl.org/dc/terms/>
>>>>> 
>>>>>   PREFIX xs:  
>>>>> <http://www.w3.org/2001/XMLSchema#>
>>>>> 
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :site17 :toxicityLevel 0.0034 }
>>>>>  :g1 dc:creator :inspector1204;
>>>>>        dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp.
>>>>> 
>>>>> if we read the documentation for dc:creator and dc:date, and if necessary consult the long history of how these terms have been used with web pages and computer files which "contain" various statements, it becomes clear that this dataset is telling us the given statement using the :toxicityLevel property was made by the given entity ("inspector1204") at the given time.   If we did not know this was a box dataset, we would not have any defined connection between :g1 and toxicityLevel triple.   We would know something was created by that inspector at that time, but its association with that triple would be undefined.
>>>>> 
>>>> Right, but that just needs the naming idea, not the box idea.
>>>> 
>>>> 
>>>>> == Dereference
>>>>> 
>>>>> While it would be out of scope for this specification to constrain or formally characterize what HTTP URIs denote, existing practice with metadata on web pages strongly suggests that when referencing a URL returns RDF triples, it is reasonable to think of that URL as denoting something which contains those triples.
>>>>> 
>>>> I don't think this is at all reasonable, in practice. In fact, the emerging consensus seems to be more like that what you get when you reference a URI is some kind of representation or description of what it is that the URI refers to. Or maybe just "more information about" that thing. But that does not presume that the thing being described is a container of the description. ESPECIALLY when we are dealing with RDF.
>>>> 
>>>> 
>>>>>  This mean this dataset:
>>>>> 
>>>>>  PREFIX rdf: 
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> 
>>>>>  PREFIX : 
>>>>> <http://example.org/>
>>>>> 
>>>>>  <> a rdf:BoxDataset.
>>>>>  GRAPH :g1 { :a :b :c }
>>>>> 
>>>>> can reasonably be assumed to be saying that dereferencing the URL 
>>>>> "http://example.org/g1" provides the RDF triple { <http://example.org/a> <http://example.org/b> <http://example.org/c>
>>>>>  }.
>>>>> 
>>>> True, but this would also be the case if :g1 were understood as denoting the actual graph, and what you got when you reference it is a (awww-)representation of the graph, ie some bytes in a recognized RDF surface syntax which parse to that graph.
>>>> 
>>>> 
>>>>>   It can further be assumed that no other RDF triples are returned.   There is no implication about whether other (non-RDF) content might be returned.
>>>>> 
>>>>> Of course, web content can vary over time and per-client, and the content isn't always available, due to access control, network failues, etc.   The idea here is that those circumstances where the semantic constraints of the dataset are met are the same circumstances under which that URL would provide the given RDF content, if one were able to access it.
>>>>> 
>>>> Again, an idealization which is often applied to the Web in general.
>>>> 
>>>> 
>>>>>    That is, the dataset is only "true" if and when that URL is backed by that RDF content.
>>>>> 
>>>> No, it is true when the names refer correctly. You only KNOW it is true when the Web is working correctly so you can get your hands on the relevant information, but that is a separate issue. If I read a notice which uses a word I don't understand, then my ignorance does not make the notice false. What changes when i discover what the word means is my state of understanding, not the truth of the notice.
>>>> 
>>>> 
>>>>>  If the dataset is always true everywhere (which is the somewhat-naive standard reading of RDF) then that URL always has that RDF content. More nuanced notions of context, including change over time and different perspectives for different users remain as future work.
>>>>> 
>>>> You won't get away with this. If you insist that these graphs are boxes, and appeal to "normal" meanings, then some people will assume they are labile and their state can change, some people will also assume that they are always about the present, while others will assume that they are really graphs all the time. And all these assumings will be implicit in deployed RDF, adding to the babel of confusion that we already have.
>>>> 
>>>> 
>>>>> == Web Crawler Example
>>>>> 
>>>>> As a more complete example, consider the case of a system which crawls the web looking for RDF content.  It might store everything it has gathered during its repeated crawling in a box dataset.  It might also do some canonicalization (think of 1.0 and 1.00 in the introduction) and some inference, and  store the output of that processing in the dataset.   Then it can make the entire dataset available to SPARQL Queries, since SPARQL is defined as querying an RDF Dataset, and it can make it available for download in one or more dataset syntaxes like TriG, JSON-LD, and N-Quads.
>>>>> 
>>>>> For this example, we'll assume the crawler is only looking at one site, 
>>>>> http://stocks.example.com, and that site publishes RDF with stock closing prices each day at URLs like http://stocks.example.com/data/orcl
>>>>>  (for Oracle Corporation, whose ticker symbol is "orcl").   Oracle was selected at random for this example from among the publicly traded companies actively participating in the RDF Working Group, namely Oracle, IBM, and Google.
>>>>> 
>>>>> We'll use the following PREFIXes:
>>>>> 
>>>>>  PREFIX rdf: 
>>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>> 
>>>>>  PREFIX rdfs: 
>>>>> <http://www.w3.org/2000/01/rdf-schema#>
>>>>> 
>>>>>  PREFIX feed: 
>>>>> <http://stocks.example.com/data/>
>>>>>     # data feeds from stocks.example.com
>>>>>  PREFIX stock: 
>>>>> <http://stocks.example.com/vocab#>
>>>>>      # stock terminology, and IRIs for public companies
>>>>>  PREFIX crawl: 
>>>>> <http://crawl.example.org/ns/>
>>>>>           # crawler terminology
>>>>>  PREFIX snap: 
>>>>> <http://crawl.example.org/snapshots/>
>>>>>    # where the crawler publishes individual snapshots
>>>>>  PREFIX dc: 
>>>>> <http://purl.org/dc/terms/>
>>>>> 
>>>>>  PREFIX xs:  
>>>>> <http://www.w3.org/2001/XMLSchema#>
>>>>> 
>>>>> 
>>>>> === Latest Content
>>>>> 
>>>>> The latest content might be stored in name graphs with the name being the dereference URL, like this:
>>>>> 
>>>>>  GRAPH 
>>>>> feed:orcl
>>>>>  { stock:orcl stock:closing 32.46; stock:volume 17655400 }
>>>>>  
>>>>> feed:orcl
>>>>>  crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>>>>>                 crawl:lastModified "2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
>>>>>  GRAPH 
>>>>> feed:goog
>>>>>  { ... }
>>>>>  
>>>>> feed:goog
>>>>>  crawl:fetchedAt ....
>>>>>  GRAPH 
>>>>> feed:ibm
>>>>>  { ... }
>>>>>  
>>>>> feed:ibm
>>>>>  crawl:fetchedAt ...
>>>>>  ...
>>>>> 
>>>>> In this example, stocks.example.com has chosen to make the daily information available at one URL (
>>>>> http://stocks.example.com/data/orcl) while the stable, long term information about every companies is available at another (http://stocks.example.com/vocab
>>>>> ).    When the crawler visits that second document, it will add this to the dataset:
>>>>> 
>>>>>  GRAPH 
>>>>> <http://stocks.example.com/vocab>
>>>>>  {
>>>>>    stock:orcl a stock:PublicCompany, stock:TechSectorCompany;
>>>>>        rdfs:label "Oracle Corporation";
>>>>>        stock:ticker "orcl".
>>>>>     ...
>>>>>     stock:ticker a rdfs:Property;
>>>>>         rdfs:comment "The standard ticker symbol (a short string) which unambiguously identifies this company".
>>>>>  }
>>>>>  
>>>>> <http://stocks.example.com/vocab>
>>>>>  crawl:fetchedAt "2013-09-15T16:00:02Z"^^xs:dateTimeStamp;
>>>>> 
>>>>> === Older Content
>>>>> 
>>>>> The older content will need to be stored with different graph names to avoid colliding with the latest content.
>>>>> 
>>>> Does the latest content use the same URI now for todays information that it used yesterday for yesterday's information? If so, what kind of entity is this named graph?
>>>> 
>>>> 
>>>>>  Here it would be reasonable to use blank nodes as the graph names, if the crawler does not want to serve linked data, like this:
>>>>> 
>>>>>  GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume  16250100 }
>>>>>  _:orcl_20130912 crawl:fetchedFrom: 
>>>>> feed:orcl
>>>>> ;
>>>>>                             crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>>>                             crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>>>  GRAPH _:goog_20130912 { ... }
>>>>>   _:goog_20130912 crawl:fetchedFrom 
>>>>> feed:goog
>>>>>  ....
>>>>>   ...
>>>>> 
>>>>> Alternatively, if the crawler is willing to provide linked data, it can create URLs for the snapshots it will be re-publishing:
>>>>> 
>>>>>  GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume 16250100 }
>>>>>  snap:orcl_20130912 crawl:fetchedFrom: 
>>>>> feed:orcl
>>>>> ;
>>>>>                                   crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>>>                                   crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>>>  GRAPH snap:goog_20130912 { ... }
>>>>>   snap:goog_20130912 crawl:fetchedFrom 
>>>>> feed:goog
>>>>>  ....
>>>>>   ...
>>>>> 
>>>>> Following best practice with linked data, the crawler should only use snapshot URLs like this if there is web server answering at those URLs with suitable content.   Because the crawler is using a box dataset, the suitable content would have to be the RDF graph associated with that URL in this dataset.   Note that the metadata (like the crawl:fetchedAt information) MUST NOT be embedded in that content since it's not inside the named graph in the dataset above.
>>>>> 
>>>> But it could have been, right? So this is a design decision rather than an imperative. (Or am I not following something? I find examples like this more confusing than helpful when I don't know what exactly they are supposed to illustrate.)
>>>> 
>>>> 
>>>>> Instead, if the metadata were to be offered, it would have to be offered via another resource.  The HTTP Link header can be used to provide a link to it, like this:
>>>>> 
>>>>> 
>>>>>> GET /snapshots/orcl_20130912 HTTP/1.1
>>>>>> Host: crawl.example.org
>>>>>> Accept: text/turtle; charset=utf-8
>>>>>> 
>>>>> < HTTP/1.1 200 OK
>>>>> < Server: nginx/1.2.1
>>>>> < Date: Sun, 15 Sep 2013 15:28:38 GMT
>>>>> < Content-Type: text/turtle; charset=utf-8
>>>>> < Link: </snapshots/orcl_20130912_meta>; rel="meta"
>>>>> ( ... prefixes ... )
>>>>> stock:orcl stock:closing 32.79; stock:volume  16250100.
>>>>> 
>>>>> and
>>>>> 
>>>>> 
>>>>>> GET /snapshots/orcl_20130912_meta HTTP/1.1
>>>>>> Host: crawl.example.org
>>>>>> Accept: text/turtle; charset=utf-8
>>>>>> 
>>>>> < HTTP/1.1 200 OK
>>>>> < Server: nginx/1.2.1
>>>>> < Date: Sun, 15 Sep 2013 15:28:38 GMT
>>>>> < Content-Type: text/turtle; charset=utf-8
>>>>> ( ... prefixes ... )
>>>>> snap:orcl_20130912 crawl:fetchedFrom: 
>>>>> feed:orcl
>>>>> ;
>>>>>                                   crawl:fetchedAt "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>>>                                   crawl:lastModified "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>>> 
>>>>> === Derived Content
>>>>> 
>>>>> It may be useful to have the crawler do some processing on the RDF content it fetches and then share the results of that processing. For example, it might gather all the ontologies linked from fetched content, do some RDFS or OWL reasoning with the results, and then include some/all of the resulting entailments in additional graphs in the dataset.
>>>>> 
>>>>> For example, perhaps stock:closing used to be called stock:closingSharePrice.  To enable older clients to still read the data, stocks.example.com might include in the stock: ontology the triple { stock:closing owl:equivalentProperty stock:closingSharePrice }.   (This would require older clients to be doing some OWL reasoning, of course, which might or might not be a realistic assumption depending on their user base.)
>>>>> 
>>>>> On seeing this equivalentProperty declaration, and doing some inference, the crawler might add this to the dataset:
>>>>> 
>>>>>  GRAPH snap:orcl_20130912_inferred { stock:orcl stock:closingSharePrice 32.79 }
>>>>>  snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912.
>>>>> 
>>>>> Alternatively, the crawler might use the prov ontology to be more explicit about how the inferrence was made.
>>>>> 
>>>>> As a related kind of derived content, the harvester might produce a variation on the fetched graph where the non-canonical literals (like 1.00) are replaced with their canonical equivalents (like 1.0).   It's not clear how valuable this would be, however, since many downstream systems (like all [?most?] SPARQL systems) will mask this difference.
>>>>> 
>>>> As far as I can tell, and I might have missed something, all of this can be done under the assumption that the graph name is actually the name of the *graph*, not of a box containing the graph. As that way of expressing all this is (1) simpler (2) more in line with both the history of graph naming and the current normative definition of a dataset and (3) less liable to be misinterpreted as allowing labile "graphs" in datasets, I would prefer to avoid the "box" terminology and just have something like this which requires graph names to denote the actual graphs, ie the naming convention without the box convention.
>>>> 
>>>> Pat
>>>> 
>>>> 
>>>>> ===============
>>>>> 
>>>>> That's it for now.    Awaiting feedback.
>>>>> 
>>>>>        -- Sandro
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> ------------------------------------------------------------
>>>> IHMC                                     (850)434 8903 home
>>>> 40 South Alcaniz St.            (850)202 4416   office
>>>> Pensacola                            (850)202 4440   fax
>>>> FL 32502                              (850)291 0667   mobile (preferred)
>>>> 
>>>> phayes@ihmc.us       http://www.ihmc.us/users/phayes
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>> 
>> ------------------------------------------------------------
>> IHMC                                     (850)434 8903 home
>> 40 South Alcaniz St.            (850)202 4416   office
>> Pensacola                            (850)202 4440   fax
>> FL 32502                              (850)291 0667   mobile (preferred)
>> 
>> phayes@ihmc.us       http://www.ihmc.us/users/phayes
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 home
40 South Alcaniz St.            (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile (preferred)
phayes@ihmc.us       http://www.ihmc.us/users/phayes

Received on Wednesday, 18 September 2013 07:46:01 UTC