Re: proposal: "box datasets" (sandro's dataset spec, v0.1)

On 16 September 2013 12:52, Sandro Hawke <sandro@w3.org> wrote:
> [ I didn't think it appropriate to CC Jeremy, since obviously we don't have
> consensus yet, and he can't reply to this mailing list unless he joins the
> WG. ]

I see JJC cc:'d so I'll leave him in place here. Hi Jeremy!

> Key points would be:
>
>    - For many years I thought about graphs as N3 does, just using what might
> be called graph literals (RDF terms which are syntactic expressions which
> denote RDF graphs or RDF graph patterns).   It's been a long journey for me,
> letting go of feeling that was the obviously-right way to handle this stuff
> to today, where I think the box model will work better for folks.

The box model isn't grabbing me, I'm afraid. I tend to see
de-referencing some URL as a lucky dip, and each time you get a
potentially different re-representation of the otherwise unknowable
entity whose URI you're GET'ing.

(aside re N3, that's funny. The box model reminded me of TimBL's old
'log:semantics' idealization, which if I recall correctly, suggested
cartoonishly that there's only ever one set of triples that is 'the
semantics' of some URI. Well maybe - I'm not sure log:semantics was
declared as owl:FunctionalProperty, but that seemed to be the
intention.)

Perhaps it's unfair for a rejoining / absentee participant to say
this, but anyway: I am rather unsettled to find the WG re-treading the
same territory it was passionately discussing when I was last in in
the group, and to be doing so without any agreed motivating scenario
around which different formal models might be compared.

I wrote up 'dlibert schematics' a couple years back,
http://danbri.org/words/2011/11/03/753 comparing simple 'hasCubicle'
assertions (which would need time-qualifying) with
'cubicle-occupation' scenarios. I really don't care what example we
use, but suggest that

  GRAPH :g1 { :a :b :c }

... is just too abstract to be a useful focal point for building
consensus. Proposals should plausibly express at least one
real-world-tinged example, even if (like the dilbert one) it is still
a simplification. Other examples to consider might be descriptions of
scholarly or cultural heritage examples (former might include volatile
citation count data; latter might include educational events, talks),
TV/movie data (movies have volatile ratings; TV listings data often
gets more precise post-transmission, once last minute guest list
changes are clarified). Change is not something that can be dealt with
later as icing-on-the-cake, it goes to the heart of why people want
more clarity around named graphs and their metadata.

If you give me a standards-track story about these kinds of
(change-riddled) descriptive scenarios I can probably work out whether
'boxes' help with managing RDF; if you give me a standards-track story
about 'a', 'b' and 'c', I'm rather more at a loss.

>    - I believe this box model is, in fact, how pretty much all SPARQL users
> think of this stuff.  I hear them talk a lot about using SPARQL, and they're
> always talking about putting things (triples) into a particular graph,
> deleting them from that graph, checking if they're in that graph, ...
> That all fits the box model.   Very few have any idea there's a static
> "dataset" in the model, and even those who know that full well still talk
> about changing "graphs".

Feels like there's a map/territory distinction going funny here. What
I do in the privacy of my SPARQL database is one matter; what triples
are somehow contained within the things associated with the URIs I use
to name graphs is quite another.

>    - Yes, the change-over-time thing is an issue here, but it's absolutely
> an issue in the rest of RDF, and it's no different here.   So (as I
> mentioned to danbri) this is something the RDF community will have to
> address.   Note that Google has this problem now, full force, as the Google
> Knowledge Graph (which powers more and more of Search, as well as other
> products) is getting its triples in both the Freebase vocabulary (which
> models things as you would, as statements which are always true, although of
> course they can still change) and the Schema.org vocabulary (which models
> things just as they are right now, since it's trying to match how current
> natural language web pages say things, and that's how they usually do it).

(Since you ask, ...)
At schema.org we are interested in modeling things (actions/events)
that have not yet happened, such as potential actions (e.g.
SandroDanDinnerAtISWC2013Event). It may never happen, or it may come
about multiple times, depending on whether Sandro and I are both at
that conference, whether we meet up, etc. This schema.org concern has
got us taking about when to try to squeeze everything into a triples
model (which often forces a kind of lower-case-r-reification), versus
when to stand back and talk about packets of triples aka (named)
graphs. Guha and I have lately been looking at whether an entire graph
could be decorated with - for starters - a temporal range. So 'danbri
age 41' (more realistically, 2, 3 or 4 triples expanding on that
properly)  might be bracketed within a year-long ISO-8601-based
datetime range.

Dan

> So, basically, I have to challenge you to come up with a counter proposal.
> I'm not attached to any particular design, as you can probably tell because
> of how my proposals keep changing.   I just want a design that solves the
> problems current and future users have in maintaining separate streams of
> RDF data flowing through systems.    cf 1-4 on
> http://www.w3.org/2011/rdf-wg/wiki/Why_Graphs     In the long example below
> I pretty much showed how to do that.   (I left out UC3 for now.)  I have no
> idea how you can possibly do that, in a way which is mentally in reach of
> current and future SPARQL users, using the "naming model."  Please show me
> how.
>
> Thanks.
>
>          -- Sandro
>
>
>
>
>
> On 09/16/2013 01:47 AM, Pat Hayes wrote:
>>
>> On Sep 15, 2013, at 9:48 AM, Sandro Hawke wrote:
>>
>>> Here's what I think we need to define to make Jeremy and many other
>>> people happy.   Obviously this is not the final draft of a spec, but
>>> hopefully it conveys the idea clearly enough.    If you read this, please
>>> say whether you see any seriously technical problems with it and/or would be
>>> happy with it going out as a WG Note.   Actually, the idea is so simple and
>>> so well-known, even if not formalized or named before, that maybe it's not
>>> out of the question to put in on the Rec Track -- but obviously not if it
>>> endangers anything else.
>>
>> Jeremy himself must be the one to say what makes Jeremy happy, but this is
>> *not* a proposal to have named graphs in datasets be what Jeremy and I (and
>> others) once called named graphs. Which is a pity, in my opinion. This
>> proposal has two parts, getting them muddled up with one another, and I
>> would like to keep them more separated.
>>
>> One idea is to provide a way to state that graph names in certain datasets
>> do indeed refer to the graph they label. Let me call this the naming idea.
>>
>> The other idea is to treat the graphs in a dataset not as graphs, but as
>> graph boxes containing a graph as their current state, but (presumably) able
>> to be changed by future operations. Let me call this the box idea.
>>
>> One can take either of these ideas independently from the other; they have
>> no particular relationship. But the box idea is clearly at odds with the
>> current definition of dataset in RDF and in SPARQL, so represents a much
>> more drastic change than the naming idea. The box idea seems to me to be
>> highly disruptive to put into a WG note, since it seems to suggest that
>> datasets are labile things with a state, which is exactly what we decided to
>> not have them be. (I know that technically it does not actually do this, but
>> it sure *seems* to on first, in fact in my case on the first three,
>> readings.) And I don't see any reason to introduce this box idea: we don't
>> need it here (since in order for the proposal to make sense, the boxes must
>> be fixed and not allowed to change.)
>>
>> Other comments in-line below.
>>
>>>        -- Sandro
>>>
>>> == Introduction
>>>
>>> A "box dataset" is a kind of RDF Dataset which adheres to certain
>>> semantic conditions.    These conditions are likely to be intuitive for a
>>> large set of RDF users, but they are not universally held, so some RDF
>>> Datasets are not box datasets.    Some readers may find this document
>>> challenging because they have never seriously considered the possibility of
>>> any other kind of dataset, so the properties of box datasets will seem
>>> utterly obvious.  The fact that a dataset is a box dataset may be conveyed
>>> using the rdf:BoxDataset class name or via some non-standard and/or
>>> out-of-band mechanism.
>>>
>>> A box dataset is defined to be any RDF Dataset in which the graph names
>>> each denote some resource (sometimes called a "g-box") which "contains"
>>> exactly those triples which comprise the RDF Graph which is paired with that
>>> name in that dataset.
>>
>> Contains at what time, and under what circumstances? Does the containment
>> refer to the time of publication of the dataset or the time it is read and
>> used? Can this containment change with time? If so, how can users know what
>> is the g-box when the dataset  is accessed? If not – if the g-box is 'fixed'
>> – what is the point of introducing the g-box into the discussion in the
>> first place? Why not just say that the graph name refers to the graph?
>>
>>>    That is, this dataset:
>>>
>>>   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>   PREFIX : <http://example.org/>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :a :b :c }
>>>
>>> tells us that the resource denoted by <http://example.org/#g1> contains
>>> exactly one RDF triple and what that triple is.
>>>
>>> It contradicts this dataset:
>>
>> If we are going to use words like "contradict" then we really have to give
>> a semantics for this. Which would not, of course, be hard to do.
>>
>>>   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>   PREFIX : <http://example.org/>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :a :b :d }
>>>
>>> since they disagree about the contained triple is.
>>
>> But if :g1 is a box, cannot they both (have been) true but at different
>> times? Maybe :g1 started with the first triple but later got changed to
>> include the second triple instead, eg by a SPARQL update operation.
>>
>>> These two datasets also contract each other (given the same PREFIX
>>> declarations as above):
>>>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :a :b 1.0 }
>>>
>>> and
>>>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :a :b 1.00 }
>>>
>>> Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing,
>>> they are not the same RDF term, so the triple { :a :b 1.0 } is not the same
>>> triple as { :a :b 1.00 }.  Since they are not the same triple, the datasets
>>> which say they are each what is contained by :g1 cannot both be true.   (See
>>> "Literal Term Equality" in RDF 1.1 Concepts.)
>>>
>>> == Contains
>>>
>>> This notion of "contains" is not formally defined but is reflected in the
>>> documentation of properties and classes used with Box Datasets.  It is
>>> essentially the same notion as people use when they say a web page
>>> "contains" some statements or a file "contains" some graphic image.    More
>>> broadly, the web can be thought of as "content" which is "contained" in web
>>> pages.
>>
>> And this common notion impies that pages and files have a state, ie their
>> content can change without their identity changing. Do you want g-boxes to
>> have this labile quality also?
>>
>>> Given this pre-existing notion of "contains", it follows that
>>> pre-existing properties and classes can be used with Box Datasets with
>>> reasonably confidence they will be correctly understood.
>>
>> Um, bullshit? Especially if people use RDF to describe them. Utter
>> confusion will reign, and become set in many forms of concrete.
>>
>>> For example, given this dataset:
>>>
>>>    PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>    PREFIX : <http://example.org/>
>>>    PREFIX dc: <http://purl.org/dc/terms/>
>>>    PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>
>>>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :site17 :toxicityLevel 0.0034 }
>>>   :g1 dc:creator :inspector1204;
>>>         dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp.
>>>
>>> if we read the documentation for dc:creator and dc:date, and if necessary
>>> consult the long history of how these terms have been used with web pages
>>> and computer files which "contain" various statements, it becomes clear that
>>> this dataset is telling us the given statement using the :toxicityLevel
>>> property was made by the given entity ("inspector1204") at the given time.
>>> If we did not know this was a box dataset, we would not have any defined
>>> connection between :g1 and toxicityLevel triple.   We would know something
>>> was created by that inspector at that time, but its association with that
>>> triple would be undefined.
>>
>> Right, but that just needs the naming idea, not the box idea.
>>
>>> == Dereference
>>>
>>> While it would be out of scope for this specification to constrain or
>>> formally characterize what HTTP URIs denote, existing practice with metadata
>>> on web pages strongly suggests that when referencing a URL returns RDF
>>> triples, it is reasonable to think of that URL as denoting something which
>>> contains those triples.
>>
>> I don't think this is at all reasonable, in practice. In fact, the
>> emerging consensus seems to be more like that what you get when you
>> reference a URI is some kind of representation or description of what it is
>> that the URI refers to. Or maybe just "more information about" that thing.
>> But that does not presume that the thing being described is a container of
>> the description. ESPECIALLY when we are dealing with RDF.
>>
>>>   This mean this dataset:
>>>
>>>   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>   PREFIX : <http://example.org/>
>>>   <> a rdf:BoxDataset.
>>>   GRAPH :g1 { :a :b :c }
>>>
>>> can reasonably be assumed to be saying that dereferencing the URL
>>> "http://example.org/g1" provides the RDF triple { <http://example.org/a>
>>> <http://example.org/b> <http://example.org/c> }.
>>
>> True, but this would also be the case if :g1 were understood as denoting
>> the actual graph, and what you got when you reference it is a
>> (awww-)representation of the graph, ie some bytes in a recognized RDF
>> surface syntax which parse to that graph.
>>
>>>    It can further be assumed that no other RDF triples are returned.
>>> There is no implication about whether other (non-RDF) content might be
>>> returned.
>>>
>>> Of course, web content can vary over time and per-client, and the content
>>> isn't always available, due to access control, network failues, etc.   The
>>> idea here is that those circumstances where the semantic constraints of the
>>> dataset are met are the same circumstances under which that URL would
>>> provide the given RDF content, if one were able to access it.
>>
>> Again, an idealization which is often applied to the Web in general.
>>
>>>     That is, the dataset is only "true" if and when that URL is backed by
>>> that RDF content.
>>
>> No, it is true when the names refer correctly. You only KNOW it is true
>> when the Web is working correctly so you can get your hands on the relevant
>> information, but that is a separate issue. If I read a notice which uses a
>> word I don't understand, then my ignorance does not make the notice false.
>> What changes when i discover what the word means is my state of
>> understanding, not the truth of the notice.
>>
>>>   If the dataset is always true everywhere (which is the somewhat-naive
>>> standard reading of RDF) then that URL always has that RDF content. More
>>> nuanced notions of context, including change over time and different
>>> perspectives for different users remain as future work.
>>
>> You won't get away with this. If you insist that these graphs are boxes,
>> and appeal to "normal" meanings, then some people will assume they are
>> labile and their state can change, some people will also assume that they
>> are always about the present, while others will assume that they are really
>> graphs all the time. And all these assumings will be implicit in deployed
>> RDF, adding to the babel of confusion that we already have.
>>
>>> == Web Crawler Example
>>>
>>> As a more complete example, consider the case of a system which crawls
>>> the web looking for RDF content.  It might store everything it has gathered
>>> during its repeated crawling in a box dataset.  It might also do some
>>> canonicalization (think of 1.0 and 1.00 in the introduction) and some
>>> inference, and  store the output of that processing in the dataset.   Then
>>> it can make the entire dataset available to SPARQL Queries, since SPARQL is
>>> defined as querying an RDF Dataset, and it can make it available for
>>> download in one or more dataset syntaxes like TriG, JSON-LD, and N-Quads.
>>>
>>> For this example, we'll assume the crawler is only looking at one site,
>>> http://stocks.example.com, and that site publishes RDF with stock closing
>>> prices each day at URLs like http://stocks.example.com/data/orcl (for Oracle
>>> Corporation, whose ticker symbol is "orcl").   Oracle was selected at random
>>> for this example from among the publicly traded companies actively
>>> participating in the RDF Working Group, namely Oracle, IBM, and Google.
>>>
>>> We'll use the following PREFIXes:
>>>
>>>   PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>   PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>   PREFIX feed: <http://stocks.example.com/data/>    # data feeds from
>>> stocks.example.com
>>>   PREFIX stock: <http://stocks.example.com/vocab#>     # stock
>>> terminology, and IRIs for public companies
>>>   PREFIX crawl: <http://crawl.example.org/ns/>          # crawler
>>> terminology
>>>   PREFIX snap: <http://crawl.example.org/snapshots/>   # where the
>>> crawler publishes individual snapshots
>>>   PREFIX dc: <http://purl.org/dc/terms/>
>>>   PREFIX xs:  <http://www.w3.org/2001/XMLSchema#>
>>>
>>> === Latest Content
>>>
>>> The latest content might be stored in name graphs with the name being the
>>> dereference URL, like this:
>>>
>>>   GRAPH feed:orcl { stock:orcl stock:closing 32.46; stock:volume 17655400
>>> }
>>>   feed:orcl crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp;
>>>                  crawl:lastModified
>>> "2013-09-13T22:01:14Z"^^xs:dateTimeStamp.
>>>   GRAPH feed:goog { ... }
>>>   feed:goog crawl:fetchedAt ....
>>>   GRAPH feed:ibm { ... }
>>>   feed:ibm crawl:fetchedAt ...
>>>   ...
>>>
>>> In this example, stocks.example.com has chosen to make the daily
>>> information available at one URL (http://stocks.example.com/data/orcl) while
>>> the stable, long term information about every companies is available at
>>> another (http://stocks.example.com/vocab).    When the crawler visits that
>>> second document, it will add this to the dataset:
>>>
>>>   GRAPH <http://stocks.example.com/vocab> {
>>>     stock:orcl a stock:PublicCompany, stock:TechSectorCompany;
>>>         rdfs:label "Oracle Corporation";
>>>         stock:ticker "orcl".
>>>      ...
>>>      stock:ticker a rdfs:Property;
>>>          rdfs:comment "The standard ticker symbol (a short string) which
>>> unambiguously identifies this company".
>>>   }
>>>   <http://stocks.example.com/vocab> crawl:fetchedAt
>>> "2013-09-15T16:00:02Z"^^xs:dateTimeStamp;
>>>
>>> === Older Content
>>>
>>> The older content will need to be stored with different graph names to
>>> avoid colliding with the latest content.
>>
>> Does the latest content use the same URI now for todays information that
>> it used yesterday for yesterday's information? If so, what kind of entity is
>> this named graph?
>>
>>>   Here it would be reasonable to use blank nodes as the graph names, if
>>> the crawler does not want to serve linked data, like this:
>>>
>>>   GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume
>>> 16250100 }
>>>   _:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>>>                              crawl:fetchedAt
>>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>                              crawl:lastModified
>>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>   GRAPH _:goog_20130912 { ... }
>>>    _:goog_20130912 crawl:fetchedFrom feed:goog ....
>>>    ...
>>>
>>> Alternatively, if the crawler is willing to provide linked data, it can
>>> create URLs for the snapshots it will be re-publishing:
>>>
>>>   GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume
>>> 16250100 }
>>>   snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>>>                                    crawl:fetchedAt
>>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>                                    crawl:lastModified
>>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>   GRAPH snap:goog_20130912 { ... }
>>>    snap:goog_20130912 crawl:fetchedFrom feed:goog ....
>>>    ...
>>>
>>> Following best practice with linked data, the crawler should only use
>>> snapshot URLs like this if there is web server answering at those URLs with
>>> suitable content.   Because the crawler is using a box dataset, the suitable
>>> content would have to be the RDF graph associated with that URL in this
>>> dataset.   Note that the metadata (like the crawl:fetchedAt information)
>>> MUST NOT be embedded in that content since it's not inside the named graph
>>> in the dataset above.
>>
>> But it could have been, right? So this is a design decision rather than an
>> imperative. (Or am I not following something? I find examples like this more
>> confusing than helpful when I don't know what exactly they are supposed to
>> illustrate.)
>>
>>> Instead, if the metadata were to be offered, it would have to be offered
>>> via another resource.  The HTTP Link header can be used to provide a link to
>>> it, like this:
>>>
>>>> GET /snapshots/orcl_20130912 HTTP/1.1
>>>> Host: crawl.example.org
>>>> Accept: text/turtle; charset=utf-8
>>>
>>> < HTTP/1.1 200 OK
>>> < Server: nginx/1.2.1
>>> < Date: Sun, 15 Sep 2013 15:28:38 GMT
>>> < Content-Type: text/turtle; charset=utf-8
>>> < Link: </snapshots/orcl_20130912_meta>; rel="meta"
>>> ( ... prefixes ... )
>>> stock:orcl stock:closing 32.79; stock:volume  16250100.
>>>
>>> and
>>>
>>>> GET /snapshots/orcl_20130912_meta HTTP/1.1
>>>> Host: crawl.example.org
>>>> Accept: text/turtle; charset=utf-8
>>>
>>> < HTTP/1.1 200 OK
>>> < Server: nginx/1.2.1
>>> < Date: Sun, 15 Sep 2013 15:28:38 GMT
>>> < Content-Type: text/turtle; charset=utf-8
>>> ( ... prefixes ... )
>>> snap:orcl_20130912 crawl:fetchedFrom: feed:orcl;
>>>                                    crawl:fetchedAt
>>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp;
>>>                                    crawl:lastModified
>>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp.
>>>
>>> === Derived Content
>>>
>>> It may be useful to have the crawler do some processing on the RDF
>>> content it fetches and then share the results of that processing. For
>>> example, it might gather all the ontologies linked from fetched content, do
>>> some RDFS or OWL reasoning with the results, and then include some/all of
>>> the resulting entailments in additional graphs in the dataset.
>>>
>>> For example, perhaps stock:closing used to be called
>>> stock:closingSharePrice.  To enable older clients to still read the data,
>>> stocks.example.com might include in the stock: ontology the triple {
>>> stock:closing owl:equivalentProperty stock:closingSharePrice }.   (This
>>> would require older clients to be doing some OWL reasoning, of course, which
>>> might or might not be a realistic assumption depending on their user base.)
>>>
>>> On seeing this equivalentProperty declaration, and doing some inference,
>>> the crawler might add this to the dataset:
>>>
>>>   GRAPH snap:orcl_20130912_inferred { stock:orcl stock:closingSharePrice
>>> 32.79 }
>>>   snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912.
>>>
>>> Alternatively, the crawler might use the prov ontology to be more
>>> explicit about how the inferrence was made.
>>>
>>> As a related kind of derived content, the harvester might produce a
>>> variation on the fetched graph where the non-canonical literals (like 1.00)
>>> are replaced with their canonical equivalents (like 1.0).   It's not clear
>>> how valuable this would be, however, since many downstream systems (like all
>>> [?most?] SPARQL systems) will mask this difference.
>>
>> As far as I can tell, and I might have missed something, all of this can
>> be done under the assumption that the graph name is actually the name of the
>> *graph*, not of a box containing the graph. As that way of expressing all
>> this is (1) simpler (2) more in line with both the history of graph naming
>> and the current normative definition of a dataset and (3) less liable to be
>> misinterpreted as allowing labile "graphs" in datasets, I would prefer to
>> avoid the "box" terminology and just have something like this which requires
>> graph names to denote the actual graphs, ie the naming convention without
>> the box convention.
>>
>> Pat
>>
>>> ===============
>>>
>>> That's it for now.    Awaiting feedback.
>>>
>>>         -- Sandro
>>>
>>>
>>>
>> ------------------------------------------------------------
>> IHMC                                     (850)434 8903 home
>> 40 South Alcaniz St.            (850)202 4416   office
>> Pensacola                            (850)202 4440   fax
>> FL 32502                              (850)291 0667   mobile (preferred)
>> phayes@ihmc.us       http://www.ihmc.us/users/phayes
>>
>>
>>
>>
>>
>>
>>
>
>

Received on Monday, 16 September 2013 12:24:01 UTC