- From: Dan Brickley <danbri@google.com>
- Date: Mon, 16 Sep 2013 13:23:32 +0100
- To: Sandro Hawke <sandro@w3.org>
- Cc: Pat Hayes <phayes@ihmc.us>, RDF WG <public-rdf-wg@w3.org>, Jeremy Carroll <jjc@syapse.com>
On 16 September 2013 12:52, Sandro Hawke <sandro@w3.org> wrote: > [ I didn't think it appropriate to CC Jeremy, since obviously we don't have > consensus yet, and he can't reply to this mailing list unless he joins the > WG. ] I see JJC cc:'d so I'll leave him in place here. Hi Jeremy! > Key points would be: > > - For many years I thought about graphs as N3 does, just using what might > be called graph literals (RDF terms which are syntactic expressions which > denote RDF graphs or RDF graph patterns). It's been a long journey for me, > letting go of feeling that was the obviously-right way to handle this stuff > to today, where I think the box model will work better for folks. The box model isn't grabbing me, I'm afraid. I tend to see de-referencing some URL as a lucky dip, and each time you get a potentially different re-representation of the otherwise unknowable entity whose URI you're GET'ing. (aside re N3, that's funny. The box model reminded me of TimBL's old 'log:semantics' idealization, which if I recall correctly, suggested cartoonishly that there's only ever one set of triples that is 'the semantics' of some URI. Well maybe - I'm not sure log:semantics was declared as owl:FunctionalProperty, but that seemed to be the intention.) Perhaps it's unfair for a rejoining / absentee participant to say this, but anyway: I am rather unsettled to find the WG re-treading the same territory it was passionately discussing when I was last in in the group, and to be doing so without any agreed motivating scenario around which different formal models might be compared. I wrote up 'dlibert schematics' a couple years back, http://danbri.org/words/2011/11/03/753 comparing simple 'hasCubicle' assertions (which would need time-qualifying) with 'cubicle-occupation' scenarios. I really don't care what example we use, but suggest that GRAPH :g1 { :a :b :c } ... is just too abstract to be a useful focal point for building consensus. Proposals should plausibly express at least one real-world-tinged example, even if (like the dilbert one) it is still a simplification. Other examples to consider might be descriptions of scholarly or cultural heritage examples (former might include volatile citation count data; latter might include educational events, talks), TV/movie data (movies have volatile ratings; TV listings data often gets more precise post-transmission, once last minute guest list changes are clarified). Change is not something that can be dealt with later as icing-on-the-cake, it goes to the heart of why people want more clarity around named graphs and their metadata. If you give me a standards-track story about these kinds of (change-riddled) descriptive scenarios I can probably work out whether 'boxes' help with managing RDF; if you give me a standards-track story about 'a', 'b' and 'c', I'm rather more at a loss. > - I believe this box model is, in fact, how pretty much all SPARQL users > think of this stuff. I hear them talk a lot about using SPARQL, and they're > always talking about putting things (triples) into a particular graph, > deleting them from that graph, checking if they're in that graph, ... > That all fits the box model. Very few have any idea there's a static > "dataset" in the model, and even those who know that full well still talk > about changing "graphs". Feels like there's a map/territory distinction going funny here. What I do in the privacy of my SPARQL database is one matter; what triples are somehow contained within the things associated with the URIs I use to name graphs is quite another. > - Yes, the change-over-time thing is an issue here, but it's absolutely > an issue in the rest of RDF, and it's no different here. So (as I > mentioned to danbri) this is something the RDF community will have to > address. Note that Google has this problem now, full force, as the Google > Knowledge Graph (which powers more and more of Search, as well as other > products) is getting its triples in both the Freebase vocabulary (which > models things as you would, as statements which are always true, although of > course they can still change) and the Schema.org vocabulary (which models > things just as they are right now, since it's trying to match how current > natural language web pages say things, and that's how they usually do it). (Since you ask, ...) At schema.org we are interested in modeling things (actions/events) that have not yet happened, such as potential actions (e.g. SandroDanDinnerAtISWC2013Event). It may never happen, or it may come about multiple times, depending on whether Sandro and I are both at that conference, whether we meet up, etc. This schema.org concern has got us taking about when to try to squeeze everything into a triples model (which often forces a kind of lower-case-r-reification), versus when to stand back and talk about packets of triples aka (named) graphs. Guha and I have lately been looking at whether an entire graph could be decorated with - for starters - a temporal range. So 'danbri age 41' (more realistically, 2, 3 or 4 triples expanding on that properly) might be bracketed within a year-long ISO-8601-based datetime range. Dan > So, basically, I have to challenge you to come up with a counter proposal. > I'm not attached to any particular design, as you can probably tell because > of how my proposals keep changing. I just want a design that solves the > problems current and future users have in maintaining separate streams of > RDF data flowing through systems. cf 1-4 on > http://www.w3.org/2011/rdf-wg/wiki/Why_Graphs In the long example below > I pretty much showed how to do that. (I left out UC3 for now.) I have no > idea how you can possibly do that, in a way which is mentally in reach of > current and future SPARQL users, using the "naming model." Please show me > how. > > Thanks. > > -- Sandro > > > > > > On 09/16/2013 01:47 AM, Pat Hayes wrote: >> >> On Sep 15, 2013, at 9:48 AM, Sandro Hawke wrote: >> >>> Here's what I think we need to define to make Jeremy and many other >>> people happy. Obviously this is not the final draft of a spec, but >>> hopefully it conveys the idea clearly enough. If you read this, please >>> say whether you see any seriously technical problems with it and/or would be >>> happy with it going out as a WG Note. Actually, the idea is so simple and >>> so well-known, even if not formalized or named before, that maybe it's not >>> out of the question to put in on the Rec Track -- but obviously not if it >>> endangers anything else. >> >> Jeremy himself must be the one to say what makes Jeremy happy, but this is >> *not* a proposal to have named graphs in datasets be what Jeremy and I (and >> others) once called named graphs. Which is a pity, in my opinion. This >> proposal has two parts, getting them muddled up with one another, and I >> would like to keep them more separated. >> >> One idea is to provide a way to state that graph names in certain datasets >> do indeed refer to the graph they label. Let me call this the naming idea. >> >> The other idea is to treat the graphs in a dataset not as graphs, but as >> graph boxes containing a graph as their current state, but (presumably) able >> to be changed by future operations. Let me call this the box idea. >> >> One can take either of these ideas independently from the other; they have >> no particular relationship. But the box idea is clearly at odds with the >> current definition of dataset in RDF and in SPARQL, so represents a much >> more drastic change than the naming idea. The box idea seems to me to be >> highly disruptive to put into a WG note, since it seems to suggest that >> datasets are labile things with a state, which is exactly what we decided to >> not have them be. (I know that technically it does not actually do this, but >> it sure *seems* to on first, in fact in my case on the first three, >> readings.) And I don't see any reason to introduce this box idea: we don't >> need it here (since in order for the proposal to make sense, the boxes must >> be fixed and not allowed to change.) >> >> Other comments in-line below. >> >>> -- Sandro >>> >>> == Introduction >>> >>> A "box dataset" is a kind of RDF Dataset which adheres to certain >>> semantic conditions. These conditions are likely to be intuitive for a >>> large set of RDF users, but they are not universally held, so some RDF >>> Datasets are not box datasets. Some readers may find this document >>> challenging because they have never seriously considered the possibility of >>> any other kind of dataset, so the properties of box datasets will seem >>> utterly obvious. The fact that a dataset is a box dataset may be conveyed >>> using the rdf:BoxDataset class name or via some non-standard and/or >>> out-of-band mechanism. >>> >>> A box dataset is defined to be any RDF Dataset in which the graph names >>> each denote some resource (sometimes called a "g-box") which "contains" >>> exactly those triples which comprise the RDF Graph which is paired with that >>> name in that dataset. >> >> Contains at what time, and under what circumstances? Does the containment >> refer to the time of publication of the dataset or the time it is read and >> used? Can this containment change with time? If so, how can users know what >> is the g-box when the dataset is accessed? If not – if the g-box is 'fixed' >> – what is the point of introducing the g-box into the discussion in the >> first place? Why not just say that the graph name refers to the graph? >> >>> That is, this dataset: >>> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> PREFIX : <http://example.org/> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :a :b :c } >>> >>> tells us that the resource denoted by <http://example.org/#g1> contains >>> exactly one RDF triple and what that triple is. >>> >>> It contradicts this dataset: >> >> If we are going to use words like "contradict" then we really have to give >> a semantics for this. Which would not, of course, be hard to do. >> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> PREFIX : <http://example.org/> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :a :b :d } >>> >>> since they disagree about the contained triple is. >> >> But if :g1 is a box, cannot they both (have been) true but at different >> times? Maybe :g1 started with the first triple but later got changed to >> include the second triple instead, eg by a SPARQL update operation. >> >>> These two datasets also contract each other (given the same PREFIX >>> declarations as above): >>> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :a :b 1.0 } >>> >>> and >>> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :a :b 1.00 } >>> >>> Even though "1.0"^^xs:double "1.00"^^xs:double denote the same thing, >>> they are not the same RDF term, so the triple { :a :b 1.0 } is not the same >>> triple as { :a :b 1.00 }. Since they are not the same triple, the datasets >>> which say they are each what is contained by :g1 cannot both be true. (See >>> "Literal Term Equality" in RDF 1.1 Concepts.) >>> >>> == Contains >>> >>> This notion of "contains" is not formally defined but is reflected in the >>> documentation of properties and classes used with Box Datasets. It is >>> essentially the same notion as people use when they say a web page >>> "contains" some statements or a file "contains" some graphic image. More >>> broadly, the web can be thought of as "content" which is "contained" in web >>> pages. >> >> And this common notion impies that pages and files have a state, ie their >> content can change without their identity changing. Do you want g-boxes to >> have this labile quality also? >> >>> Given this pre-existing notion of "contains", it follows that >>> pre-existing properties and classes can be used with Box Datasets with >>> reasonably confidence they will be correctly understood. >> >> Um, bullshit? Especially if people use RDF to describe them. Utter >> confusion will reign, and become set in many forms of concrete. >> >>> For example, given this dataset: >>> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> PREFIX : <http://example.org/> >>> PREFIX dc: <http://purl.org/dc/terms/> >>> PREFIX xs: <http://www.w3.org/2001/XMLSchema#> >>> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :site17 :toxicityLevel 0.0034 } >>> :g1 dc:creator :inspector1204; >>> dc:date "2013-07-03T09:51:02Z"^^xs:dateTimeStamp. >>> >>> if we read the documentation for dc:creator and dc:date, and if necessary >>> consult the long history of how these terms have been used with web pages >>> and computer files which "contain" various statements, it becomes clear that >>> this dataset is telling us the given statement using the :toxicityLevel >>> property was made by the given entity ("inspector1204") at the given time. >>> If we did not know this was a box dataset, we would not have any defined >>> connection between :g1 and toxicityLevel triple. We would know something >>> was created by that inspector at that time, but its association with that >>> triple would be undefined. >> >> Right, but that just needs the naming idea, not the box idea. >> >>> == Dereference >>> >>> While it would be out of scope for this specification to constrain or >>> formally characterize what HTTP URIs denote, existing practice with metadata >>> on web pages strongly suggests that when referencing a URL returns RDF >>> triples, it is reasonable to think of that URL as denoting something which >>> contains those triples. >> >> I don't think this is at all reasonable, in practice. In fact, the >> emerging consensus seems to be more like that what you get when you >> reference a URI is some kind of representation or description of what it is >> that the URI refers to. Or maybe just "more information about" that thing. >> But that does not presume that the thing being described is a container of >> the description. ESPECIALLY when we are dealing with RDF. >> >>> This mean this dataset: >>> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> PREFIX : <http://example.org/> >>> <> a rdf:BoxDataset. >>> GRAPH :g1 { :a :b :c } >>> >>> can reasonably be assumed to be saying that dereferencing the URL >>> "http://example.org/g1" provides the RDF triple { <http://example.org/a> >>> <http://example.org/b> <http://example.org/c> }. >> >> True, but this would also be the case if :g1 were understood as denoting >> the actual graph, and what you got when you reference it is a >> (awww-)representation of the graph, ie some bytes in a recognized RDF >> surface syntax which parse to that graph. >> >>> It can further be assumed that no other RDF triples are returned. >>> There is no implication about whether other (non-RDF) content might be >>> returned. >>> >>> Of course, web content can vary over time and per-client, and the content >>> isn't always available, due to access control, network failues, etc. The >>> idea here is that those circumstances where the semantic constraints of the >>> dataset are met are the same circumstances under which that URL would >>> provide the given RDF content, if one were able to access it. >> >> Again, an idealization which is often applied to the Web in general. >> >>> That is, the dataset is only "true" if and when that URL is backed by >>> that RDF content. >> >> No, it is true when the names refer correctly. You only KNOW it is true >> when the Web is working correctly so you can get your hands on the relevant >> information, but that is a separate issue. If I read a notice which uses a >> word I don't understand, then my ignorance does not make the notice false. >> What changes when i discover what the word means is my state of >> understanding, not the truth of the notice. >> >>> If the dataset is always true everywhere (which is the somewhat-naive >>> standard reading of RDF) then that URL always has that RDF content. More >>> nuanced notions of context, including change over time and different >>> perspectives for different users remain as future work. >> >> You won't get away with this. If you insist that these graphs are boxes, >> and appeal to "normal" meanings, then some people will assume they are >> labile and their state can change, some people will also assume that they >> are always about the present, while others will assume that they are really >> graphs all the time. And all these assumings will be implicit in deployed >> RDF, adding to the babel of confusion that we already have. >> >>> == Web Crawler Example >>> >>> As a more complete example, consider the case of a system which crawls >>> the web looking for RDF content. It might store everything it has gathered >>> during its repeated crawling in a box dataset. It might also do some >>> canonicalization (think of 1.0 and 1.00 in the introduction) and some >>> inference, and store the output of that processing in the dataset. Then >>> it can make the entire dataset available to SPARQL Queries, since SPARQL is >>> defined as querying an RDF Dataset, and it can make it available for >>> download in one or more dataset syntaxes like TriG, JSON-LD, and N-Quads. >>> >>> For this example, we'll assume the crawler is only looking at one site, >>> http://stocks.example.com, and that site publishes RDF with stock closing >>> prices each day at URLs like http://stocks.example.com/data/orcl (for Oracle >>> Corporation, whose ticker symbol is "orcl"). Oracle was selected at random >>> for this example from among the publicly traded companies actively >>> participating in the RDF Working Group, namely Oracle, IBM, and Google. >>> >>> We'll use the following PREFIXes: >>> >>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> >>> PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#> >>> PREFIX feed: <http://stocks.example.com/data/> # data feeds from >>> stocks.example.com >>> PREFIX stock: <http://stocks.example.com/vocab#> # stock >>> terminology, and IRIs for public companies >>> PREFIX crawl: <http://crawl.example.org/ns/> # crawler >>> terminology >>> PREFIX snap: <http://crawl.example.org/snapshots/> # where the >>> crawler publishes individual snapshots >>> PREFIX dc: <http://purl.org/dc/terms/> >>> PREFIX xs: <http://www.w3.org/2001/XMLSchema#> >>> >>> === Latest Content >>> >>> The latest content might be stored in name graphs with the name being the >>> dereference URL, like this: >>> >>> GRAPH feed:orcl { stock:orcl stock:closing 32.46; stock:volume 17655400 >>> } >>> feed:orcl crawl:fetchedAt "2013-09-15T14:57:02Z"^^xs:dateTimeStamp; >>> crawl:lastModified >>> "2013-09-13T22:01:14Z"^^xs:dateTimeStamp. >>> GRAPH feed:goog { ... } >>> feed:goog crawl:fetchedAt .... >>> GRAPH feed:ibm { ... } >>> feed:ibm crawl:fetchedAt ... >>> ... >>> >>> In this example, stocks.example.com has chosen to make the daily >>> information available at one URL (http://stocks.example.com/data/orcl) while >>> the stable, long term information about every companies is available at >>> another (http://stocks.example.com/vocab). When the crawler visits that >>> second document, it will add this to the dataset: >>> >>> GRAPH <http://stocks.example.com/vocab> { >>> stock:orcl a stock:PublicCompany, stock:TechSectorCompany; >>> rdfs:label "Oracle Corporation"; >>> stock:ticker "orcl". >>> ... >>> stock:ticker a rdfs:Property; >>> rdfs:comment "The standard ticker symbol (a short string) which >>> unambiguously identifies this company". >>> } >>> <http://stocks.example.com/vocab> crawl:fetchedAt >>> "2013-09-15T16:00:02Z"^^xs:dateTimeStamp; >>> >>> === Older Content >>> >>> The older content will need to be stored with different graph names to >>> avoid colliding with the latest content. >> >> Does the latest content use the same URI now for todays information that >> it used yesterday for yesterday's information? If so, what kind of entity is >> this named graph? >> >>> Here it would be reasonable to use blank nodes as the graph names, if >>> the crawler does not want to serve linked data, like this: >>> >>> GRAPH _:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume >>> 16250100 } >>> _:orcl_20130912 crawl:fetchedFrom: feed:orcl; >>> crawl:fetchedAt >>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp; >>> crawl:lastModified >>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp. >>> GRAPH _:goog_20130912 { ... } >>> _:goog_20130912 crawl:fetchedFrom feed:goog .... >>> ... >>> >>> Alternatively, if the crawler is willing to provide linked data, it can >>> create URLs for the snapshots it will be re-publishing: >>> >>> GRAPH snap:orcl_20130912 { stock:orcl stock:closing 32.79; stock:volume >>> 16250100 } >>> snap:orcl_20130912 crawl:fetchedFrom: feed:orcl; >>> crawl:fetchedAt >>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp; >>> crawl:lastModified >>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp. >>> GRAPH snap:goog_20130912 { ... } >>> snap:goog_20130912 crawl:fetchedFrom feed:goog .... >>> ... >>> >>> Following best practice with linked data, the crawler should only use >>> snapshot URLs like this if there is web server answering at those URLs with >>> suitable content. Because the crawler is using a box dataset, the suitable >>> content would have to be the RDF graph associated with that URL in this >>> dataset. Note that the metadata (like the crawl:fetchedAt information) >>> MUST NOT be embedded in that content since it's not inside the named graph >>> in the dataset above. >> >> But it could have been, right? So this is a design decision rather than an >> imperative. (Or am I not following something? I find examples like this more >> confusing than helpful when I don't know what exactly they are supposed to >> illustrate.) >> >>> Instead, if the metadata were to be offered, it would have to be offered >>> via another resource. The HTTP Link header can be used to provide a link to >>> it, like this: >>> >>>> GET /snapshots/orcl_20130912 HTTP/1.1 >>>> Host: crawl.example.org >>>> Accept: text/turtle; charset=utf-8 >>> >>> < HTTP/1.1 200 OK >>> < Server: nginx/1.2.1 >>> < Date: Sun, 15 Sep 2013 15:28:38 GMT >>> < Content-Type: text/turtle; charset=utf-8 >>> < Link: </snapshots/orcl_20130912_meta>; rel="meta" >>> ( ... prefixes ... ) >>> stock:orcl stock:closing 32.79; stock:volume 16250100. >>> >>> and >>> >>>> GET /snapshots/orcl_20130912_meta HTTP/1.1 >>>> Host: crawl.example.org >>>> Accept: text/turtle; charset=utf-8 >>> >>> < HTTP/1.1 200 OK >>> < Server: nginx/1.2.1 >>> < Date: Sun, 15 Sep 2013 15:28:38 GMT >>> < Content-Type: text/turtle; charset=utf-8 >>> ( ... prefixes ... ) >>> snap:orcl_20130912 crawl:fetchedFrom: feed:orcl; >>> crawl:fetchedAt >>> "2013-09-12T23:51:02Z"^^xs:dateTimeStamp; >>> crawl:lastModified >>> "2013-09-12T22:01:16Z"^^xs:dateTimeStamp. >>> >>> === Derived Content >>> >>> It may be useful to have the crawler do some processing on the RDF >>> content it fetches and then share the results of that processing. For >>> example, it might gather all the ontologies linked from fetched content, do >>> some RDFS or OWL reasoning with the results, and then include some/all of >>> the resulting entailments in additional graphs in the dataset. >>> >>> For example, perhaps stock:closing used to be called >>> stock:closingSharePrice. To enable older clients to still read the data, >>> stocks.example.com might include in the stock: ontology the triple { >>> stock:closing owl:equivalentProperty stock:closingSharePrice }. (This >>> would require older clients to be doing some OWL reasoning, of course, which >>> might or might not be a realistic assumption depending on their user base.) >>> >>> On seeing this equivalentProperty declaration, and doing some inference, >>> the crawler might add this to the dataset: >>> >>> GRAPH snap:orcl_20130912_inferred { stock:orcl stock:closingSharePrice >>> 32.79 } >>> snap:orcl_20130912_inferred crawl:inferredFrom snap:orcl_20130912. >>> >>> Alternatively, the crawler might use the prov ontology to be more >>> explicit about how the inferrence was made. >>> >>> As a related kind of derived content, the harvester might produce a >>> variation on the fetched graph where the non-canonical literals (like 1.00) >>> are replaced with their canonical equivalents (like 1.0). It's not clear >>> how valuable this would be, however, since many downstream systems (like all >>> [?most?] SPARQL systems) will mask this difference. >> >> As far as I can tell, and I might have missed something, all of this can >> be done under the assumption that the graph name is actually the name of the >> *graph*, not of a box containing the graph. As that way of expressing all >> this is (1) simpler (2) more in line with both the history of graph naming >> and the current normative definition of a dataset and (3) less liable to be >> misinterpreted as allowing labile "graphs" in datasets, I would prefer to >> avoid the "box" terminology and just have something like this which requires >> graph names to denote the actual graphs, ie the naming convention without >> the box convention. >> >> Pat >> >>> =============== >>> >>> That's it for now. Awaiting feedback. >>> >>> -- Sandro >>> >>> >>> >> ------------------------------------------------------------ >> IHMC (850)434 8903 home >> 40 South Alcaniz St. (850)202 4416 office >> Pensacola (850)202 4440 fax >> FL 32502 (850)291 0667 mobile (preferred) >> phayes@ihmc.us http://www.ihmc.us/users/phayes >> >> >> >> >> >> >> > >
Received on Monday, 16 September 2013 12:24:01 UTC