Re: New Proposal (6.1) for GRAPHS from Sandro Hawke on 2012-04-10 (public-rdf-wg@w3.org from April 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Tue, 10 Apr 2012 18:38:01 -0400
To: Arnaud Le Hors <lehors@us.ibm.com>
Cc: public-rdf-wg <public-rdf-wg@w3.org>
Message-ID: <1334097481.2249.70.camel@waldron>
On Thu, 2012-04-05 at 15:24 -0700, Arnaud Le Hors wrote:
> Sandro Hawke <sandro@w3.org> wrote on 04/05/2012 04:31:48 AM:
> 
> > From: Sandro Hawke <sandro@w3.org> 
> > To: Arnaud Le Hors/Cupertino/IBM@IBMUS, 
> > Cc: public-rdf-wg <public-rdf-wg@w3.org> 
> > Date: 04/05/2012 04:33 AM 
> > Subject: Re: New Proposal (6.1) for GRAPHS 
> > 
> > On Wed, 2012-04-04 at 15:05 -0700, Arnaud Le Hors wrote:
> > > Hearing people arguing over whether the statement  <u> {<a> <b>
> <c>}
> > > defines a complete graph or not leads me to wonder whether we
> > > shouldn't recognize that the answer ought to be: "it
> depends". :-) 
> > > 
> > > I think Lee explained very effectively how the same statement can
> be
> > > interpreted differently depending on whether you're doing a GET,
> PUT,
> > > POST or something similar. 
> > > 
> > > I already noted that the response Sandro expects from his question
> > > "According to this query, how many triples are in the graph known
> to
> > > that endpoint as ' ?" is actually based on additional information
> he
> > > is providing in his question, specifically: the query is limited
> to
> > > the graph known to that particular endpoint. 
> > > 
> > > If all you had were the following triples: 
> > > >>>      <a>  <b>  1.
> > > >>>      <a>  <b>  2.
> > > >>>      <a>  <b>  3.
> > > without giving any other information about how or where you got
> them
> > > from and you'd ask: "how many triples are associated with <a>?"  I
> > > think the answer would have to be: "it depends". 
> > 
> > FWIW, I really think you do need to keep the notion of a graph in
> the
> > question, since SPARQL has the keyword "GRAPH".   As in, "How many
> > triples are in the graph associated with <http://g1.example.org>.
> > 
> > But since we're not really doing this survey anyway, it probably
> doesn't
> > matter. 
> 
> But I don't think keeping the notion of graph in the question is what
> really makes the difference. What matters is the scope which is
> defined by "known to that endpoint". 
> 
> > 
> > > I heard Sandro say that when he dereferences <u> he expects to get
> all
> > > the triples in <u>. I agree but I think that's a Linked Data view
> of
> > > the world and it comes from the meaning of GET rather than what
> you
> > > receive. In another context, retrieved in a different way, what
> you
> > > receive might mean something else. In the case of a SPARQL query
> it
> > > could mean "these are all the triples in <u> that this endpoint
> knows
> > > about". 
> > > 
> > > So, do we really need to choose one way or the other? Can't we
> just
> > > leave it to the application to decide whether it defines a
> complete
> > > graph or not?
> > 
> > I see three ways to do this.   Which are you suggesting?
> > 
> >   1.  have two syntaxes.   eg trig means complete graphs and
> n-quads 
> >       means partial graphs.
> > 
> >   2.  have two constructs in trig, eg: 
> >         <u1> { <a> <b> <c> }         to mean full graphs and
> >         <u2> < <a> <b> <c> ... }     to means partial graphs
> > 
> >   3.  use the same syntax, but let the consumer decide which was
> meant.
> > 
> > I don't like 3 at all, because it doesn't solve the use cases.   For
> > isntance, you couldn't have a shared crawler, if the apps using the
> > crawl happened to need complete graphs. 
> 
> Maybe (most likely ;-) I'm missing something but, in the web crawler
> use case for instance, isn't it the fact that you're getting it via a
> GET of a URL that returns all the data the crawler accumulated that
> would tell you the graph you're getting is complete? Why does that
> need to be embedded in the syntax/data set? 

Crawlers wont necessarily report all the data from each source.  For
instance, they could quite plausibly truncate at 100MB source text.

With 'complete-graphs' semantics, they would have to flag that fact in
the metadata somewhere; with 'incomplete-graph' semantics, then I expect
truncating crawlers wouldn't bother to flag it, since their report would
still be correct.

> Is it because you don't want to rely on out of band information? 

Indeed.   I think one of our key goals is to do as much as we reasonably
can with just machine-to-machine communication.    In the shared crawler
example, we should posit that there are many off-the-shelf crawlers and
many clients, and we don't want each client administrator to have to
have a long conversation with each server admin.   Instead, we want to
just give each client a list of URLs of crawler-dumps it can use.   If
some of those do truncation and others don't, they should reveal that in
some way a machine can see.

There are many approaches to that, but I like the idea of using true
declarations as much as possible, because I think it scales very well.
So if you're going to use a format like Trig, then define it to have a
meaning, not be ambiguous, so people can make true statements and others
can tell if they did.   (That is, at least pick whether it's
partial-graph or complete-graph semantics, not either-or where you can't
tell which without talking to a human.)

> If you know that http://example.com/all returns all data that has been
> accumulated, you don't need the dataset to tell you that, do you? 

(but you dont, as above)

> Just like Lee doesn't rely on the dataset to decide whether to replace
> or merge a graph in anzo. In his case, the command line provides the
> out of band information that drives the interpretation one way or the
> other. 
> 
By rough analogy, Lee is talking about program source code and I'm
talking about packages (as in debian).   Certainly there are times it's
good to let people pick their operations by hand, but I'd like to enable
plugging thing together for automatic operation.

> > The second one is cute, but I think would be very hard to implement;
> it
> > would force every consumer to deal with both complete and incomplete
> > graphs, at least a bit.
> > 
> > Both 1 and 2 raise the issue of how you reflect this difference in
> the
> > dataset, or in SPARQL.   How could you do that?
> > 
> >     -- Sandro
> 
> I would have said: 
> 4. use the same syntax, let the context decide which it is 

If all we're looking for is some commonality between tools, that would
work.   I think we can do more and enable some plugging together of
machines just using URIs.

     -- Sandro

> > > --
> > > Arnaud  Le Hors - Software Standards Architect - IBM Software
> Group
> > > 
> > > 
> > > 
> > > 
> > > From:        Lee Feigenbaum <lee@thefigtrees.net> 
> > > To:        Ivan Herman <ivan@w3.org>, 
> > > Cc:        Arnaud Le Hors/Cupertino/IBM@IBMUS, Sandro Hawke
> > > <sandro@w3.org>, public-rdf-wg <public-rdf-wg@w3.org> 
> > > Date:        04/04/2012 05:16 AM 
> > > Subject:        Re: New Proposal (6.1) for GRAPHS 
> > > Sent by:        Lee Feigenbaum <figtree@gmail.com> 
> > > 
> > >
> ______________________________________________________________________
> > > 
> > > 
> > > 
> > > How do people use TriG in practice today? For us, the choice
> between 
> > > these semantics today is determined externally to a TriG file.
> That
> > > is, 
> > > given foo.trig which contains u1 { a b c }, whether this is all of
> u1
> > > or 
> > > a subgraph of u1 is determined based on which API or which
> > > command-line 
> > > command is used. For example:
> > > 
> > > > anzo import foo.trig
> > > 
> > > ...interprets what's in the the trig file as subgraphs that get
> added
> > > to 
> > > any existing contents of the graphs. So "a b c" would be added to
> u1.
> > > 
> > > > anzo replace foo.trig
> > > 
> > > ...interprets what's in the trig file as complete graphs, and sets
> > > the 
> > > contents of the graphs in the repository, overwriting whatever
> might 
> > > already be in the repository as the contents of the graphs. So
> after 
> > > this operation u1 ends up with exactly { a b c } as its contents.
> > > 
> > > (Aside: So for us, "import" is basically like doing a POST of the 
> > > triples in the trig file to the associated graphs via the SPARQL
> > > Graph 
> > > Store Protocol, and "replace" is like doing a PUT.)
> > > 
> > > (Aside 2: there are other operations as well, such as "anzo
> update 
> > > --remove" which uses the subgraph semantics and also means that
> the 
> > > triples in question should be removed from the associated graphs
> in
> > > the 
> > > repository.)
> > > 
> > > All of which is to say, there are plenty of use cases in our
> > > experience 
> > > for both of these semantics. If the standard supported a way to
> make 
> > > these semantics explicit, we would probably support that via some
> > > sort 
> > > of generic command ("anzo process"? who knows), but would still
> let 
> > > these existing command line commands override the semantics. We
> have 
> > > plenty of cases in which we export some bit of trig, and then
> later
> > > on 
> > > either use "anzo import" or "anzo replace" based on the situation
> --
> > > and 
> > > we wouldn't want to have to produce two different trig files for
> that 
> > > situation! (This would be roughly analogous to the way in which
> the 
> > > SPARQL Protocol lets the RDF dataset definition override what's in
> > > the 
> > > query, so that queries can easily be reused in different
> contexts.)
> > > 
> > > Lee
> > > 
> > > On 4/4/2012 3:18 AM, Ivan Herman wrote:
> > > >
> > > > On Apr 4, 2012, at 07:09 , Arnaud Le Hors wrote:
> > > >
> > > >> Hi Sandro,
> > > >> I have to say that my expectation was similar to Charles's. I
> guess
> > > it's a matter of deciding whether<u1>  {<a>  <b>  <c>   } defines
> > > the<u1>  graph in its entirety, as containing one triple, or
> merely
> > > states that the triple<a>  <b>  <c>   is part of graph<u1>.
> > > >>
> > > >> I'm not saying it should be the latter rather than the former,
> just
> > > that it's not obvious.
> > > >> See below for more on that.
> > > >
> > > > So let me give my typical W3C answer, ie, trying to find a
> > > compromise:-)
> > > >
> > > > More seriously. The structure offered by Sandro relies on the
> fact
> > > that the
> > > >
> > > > <u>  {<a>  <b>  <c>  }
> > > >
> > > > syntax gets its more precise meaning through a possible
> > > >
> > > > <u>  rdf:type rdf:SOMECLASSHERE .
> > > >
> > > > Sandro offered two such classes; isn't possible to have three,
> one
> > > that makes the graph THE graph, the other that makes it PART OF
> the
> > > graph?
> > > >
> > > > We can of course have long discussions on which the default is.
> But
> > > that is a lighter discussion I believe.
> > > >
> > > > Ivan
> > > >
> > > >
> > > >
> > > >>
> > > >> Sandro Hawke<sandro@w3.org>  wrote on 04/02/2012 05:57:13 PM:
> > > >>
> > > >>> From: Sandro Hawke<sandro@w3.org>
> > > >>> To: Charles Greer<cgreer@marklogic.com>,
> > > >>> Cc: Charles Greer<Charles.Greer@marklogic.com>, public-rdf-wg
> > > >>> <public-rdf-wg@w3.org>
> > > >>> Date: 04/02/2012 05:57 PM
> > > >>> Subject: Re: New Proposal (6.1) for GRAPHS
> > > >>>
> > > >>> On Mon, 2012-04-02 at 14:00 -0700, Charles Greer wrote:
> > > >>>> Thanks for responding Sandro.  I think that what I'm finding
> > > difficult,
> > > >>>> or at least a significant departure from RDF as I have
> understood
> > > it in
> > > >>>> the past, is that this TRIG document
> > > >>>>
> > > >>>> <u1>  {<a>  <b>  <c>  .<d>  <e>  <f>  }
> > > >>>>
> > > >>>> is not equivalent to these n-quads:
> > > >>>>
> > > >>>> <a>  <b>  <c>  <u1>.
> > > >>>> <d>  <e>  <f>  <u1>.
> > > >>>>
> > > >>>> Or rather, you now need a document structure around n-quads
> as
> > > well in
> > > >>>> order to provide the context in which rdf knows that these
> > > triples, and
> > > >>>> only these triples, constitute the graph<u1>.
> > > >>>>
> > > >>>> I had previously thought that RDF was a data model that
> didn't
> > > need any
> > > >>>> notion of 'document' to work.  I'm not sure how another
> assertion
> > > that
> > > >>>>
> > > >>>> {<u1>  a rdf:Graph }
> > > >>>>
> > > >>>> can assert the boundaries of<u1>  unless either the { }
> syntax
> > > does more
> > > >>>> than it appears to, or the document is a harder scope
> boundary
> > > than I
> > > >>>> would have expected.  If the document has some relationship
> to
> > > scope, I
> > > >>>> think that should be made explicit.
> > > >>>
> > > >>> Two main points:
> > > >>>
> > > >>> 1.  That rdf:Graph declaration is different thing.  It changes
> > > how<u1>
> > > >>> relates to the graph, but in a semantic (not syntactic) way.
>  It
> > > can be
> > > >>> in a different document, or deduced by the use of some
> predicates,
> > > or
> > > >>> known a priori by a data consumer.  Knowing it entitles the
> > > consumer to
> > > >>> see that<u1>  actually identifies the graph directly, rather
> than
> > > just
> > > >>> being a label for the graph.     This might matter if we also
> > > know<u1>
> > > >>> dc:licence ...SomeLicensingTerms....   Is it the graph that's
> > > licensed,
> > > >>> or something else?     There are some use cases that suggests
> this
> > > >>> distinction is important, but if it turns out not to be, it's
> not
> > > bad,
> > > >>> people will just not use rdf:Graph declarations much.
> > > >>>
> > > >>> 2.  Whether or not your trig example and your n-quads example
> are
> > > >>> equivalent depends on your reading of n-quads.   This extends
> to
> > > your
> > > >>> reading of SPARQL as well.     My understanding is people are
> > > somewhat
> > > >>> informal about this, but they generally do expect that once
> > > they've seen
> > > >>> the whole trig file, or the whole n-quads file, or searched
> the
> > > whole
> > > >>> SPARQL end point, that they've seen all the triples in the
> graph
> > > with
> > > >>> that name/label.
> > > >>>
> > > >>> As a social test case, we could tell people this SPARQL query
> is
> > > run:
> > > >>>
> > > >>>      SELECT ?s ?p ?o
> > > >>>      WHERE GRAPH<http://g1.example.org>  { ?s ?p ?o }.
> > > >>>
> > > >>> and that we got three result bindings back:
> > > >>>
> > > >>>      ?s  ?p  ?o
> > > >>>      === === ===
> > > >>>      <a>  <b>  1.
> > > >>>      <a>  <b>  2.
> > > >>>      <a>  <b>  3.
> > > >>>
> > > >>> Then we ask them: "According to this query, how many triples
> are
> > > in the
> > > >>> graph known to that endpoint as 'http://g1.example.org' ?"
> > > >>>
> > > >>> What do you think they'll say?
> > > >>>
> > > >>> I think most folks will say, "Three", even if you ask them to
> > > think
> > > >>> again and be pedantically precise.
> > > >>>
> > > >>
> > > >> I agree that's what they would say but primarily because you
> said:
> > > "in the graph known to that endpoint"
> > > >> This is a critical element which isn't apparent in a mere
> statement
> > > like:
> > > >>
> > > >> <u1>  {<a>  <b>  <c>  .<d>  <e>  <f>  }
> > > >>
> > > >> Which doesn't say anything about where it comes from and
> whether
> > > it's complete or not.
> > > >>
> > > >> This being said, I can get used to having it the way you
> suggest.
> > > Especially when the graph name comes first. If we had: {<a>  <b>
>  <c>
> > >  .<d>  <e>  <f>  }<u1>  I would think differently.
> > > >> --
> > > >> Arnaud  Le Hors - Software Standards Architect - IBM Software
> Group
> > > >>
> > > >>
> > > >>> I think that means they're using the complete-graph semantics
> I'm
> > > >>> suggesting.  If they were using partial-graph semantics,
> they'd
> > > have to
> > > >>> say, "Three or more".
> > > >>>
> > > >>> You see what I'm saying?   When we have a complete protocol
> > > interaction,
> > > >>> via SPARQL, or transmitting a trig or n-quad files, I think
> the
> > > usual
> > > >>> assumption is that *all* the triples in the named graph are
> being
> > > sent,
> > > >>> not just some of them.
> > > >>>
> > > >>> I understand sometimes it would be nice to store/transmit just
> > > part of
> > > >>> some named graph.   But, as I discussed in a message a couple
> of
> > > minutes
> > > >>> ago, I think we have to pick one or the other, and I think the
> > > >>> complete-graph approach is better.  It's pretty easy to convey
> > > partial
> > > >>> graphs if we define the complete approach.
> > > >>>
> > > >>> (I suppose if we defined the partial-graph approach we could
> > > transmit
> > > >>> complete graphs by transmitting partial graphs and including a
> > > >>> triple-count as metadata, so you know it's complete.   I guess
> > > that
> > > >>> would work, but it seems to me to be optimizing for the
> > > much-less-common
> > > >>> case.)
> > > >>>
> > > >>> Coming back to:
> > > >>>
> > > >>>> I had previously thought that RDF was a data model that
> didn't
> > > need
> > > >>> any
> > > >>>> notion of 'document' to work.
> > > >>>
> > > >>> Yeah, it depends what you're doing with it.   There's a lot
> you
> > > can do
> > > >>> with RDF without paying any attention to what documents
> particular
> > > bits
> > > >>> of RDF were found in, but I think most of the Graphs use cases
> > > involve
> > > >>> situations where you do need to pay attention to these
> document
> > > >>> boundaries.
> > > >>>
> > > >>>> Thanks for your willingness to understand my points --- I'm
> sure
> > > that my
> > > >>>> formal language will improve over time.
> > > >>>
> > > >>> It's a long process.   :-)    Interesting, it seems to be
> helped
> > > by
> > > >>> arguing.
> > > >>>
> > > >>>      -- Sandro
> > > >>>
> > > >>>>
> > > >>>> Charles
> > > >>>>
> > > >>>>
> > > >>>>
> > > >>>> On 04/02/2012 08:36 AM, Sandro Hawke wrote:
> > > >>>>> On Thu, 2012-03-29 at 09:25 -0700, Charles Greer wrote:
> > > >>>>>> I really like this solution and it seems to satisfy the use
> > > cases
> > > >>>>>> familiar to me from when I actually worked a lot with RDF
> in
> > > the wild.
> > > >>>>>>
> > > >>>>>> One thing I'm tripping over though --  The scope of a TRIG
> > > document or
> > > >>>>>> RDF dataset in effect 'closes the world.'  Is the idea of
> > > "merge" only
> > > >>>>>> within a TRIG document/dataset?
> > > >>>>>>
> > > >>>>>> I can only see two ways to really assert a graph literal --
> > > either by
> > > >>>>>> sanctifying the boundaries of  a dataset, thereby making
> merges
> > > with
> > > >>>>>> external data problematic, or by signing bytes.  Am I
> missing
> > > something,
> > > >>>>>> as usual?
> > > >>>>> There's some misunderstanding here, yes.   Maybe you can
> talk
> > > through
> > > >>>>> some particular thing you imagine doing, involving merging
> and
> > > TriG, and
> > > >>>>> I'll be able to pick it up.   From what you've written, I'm
> > > confused.
> > > >>>>>
> > > >>>>> Maybe I can clarifying by translating this TriG document:
> > > >>>>>
> > > >>>>>           <u1>    {<a>    <b>    <c>   }
> > > >>>>>
> > > >>>>> into this English declaration:
> > > >>>>>
> > > >>>>>           The URI 'u1' denotes something, and that thing has
> > > exactly one
> > > >>>>>           associated RDF Graph.   That associated RDF graph
> > > consists of
> > > >>>>>           one RDF triple, which we can write in turtle as
> "<a>
> > > <b>   <c>".
> > > >>>>>
> > > >>>>> So, perhaps it's more clear, now.  If you merged that with
> > > another TriG
> > > >>>>> document:
> > > >>>>>
> > > >>>>>           <u1>    {<a>    <b>    <d>   }
> > > >>>>>
> > > >>>>> Then, trying to accept both documents at onces, you'd be
> saying:
> > > >>>>>
> > > >>>>>           The URI 'u1' denotes something, and that thing has
> > > exactly one
> > > >>>>>           associated RDF graph.  In one document that
> associated
> > > graph is
> > > >>>>>           claimed to be the RDF triple "<a>   <b>   <c>",
> but in
> > > another
> > > >>>>>           document that graph is claimed to be the RDF
> triple
> > > "<a>   <b>
> > > >>>>>           <d>".
> > > >>>>>
> > > >>>>> So, in this case, you can try to merge the documents, but
> when
> > > you do,
> > > >>>>> you find there is a contradiction, since there is only
> allowed
> > > to be one
> > > >>>>> associated graph, but in this case there are two different
> ones.
> > > >>>>>
> > > >>>>>          -- Sandro
> > > >>>>>
> > > >>>>>> Charles
> > > >>>>>>
> > > >>>>>>
> > > >>>>>> On 03/27/2012 07:23 PM, Sandro Hawke wrote:
> > > >>>>>>> I've written up design 6 (originally suggested by Andy) in
> > > more
> > > >>>>>>> detail.  I've called in 6.1 since I've change/added a few
> > > details that
> > > >>>>>>> Andy might not agree with.  Eric has started writing up
> how
> > > the use
> > > >>>>>>> cases are addressed by this proposal.
> > > >>>>>>>
> > > >>>>>>> This proposal addresses all 15 of our old open issues
> > > concerning graphs.
> > > >>>>>>> (I'm sure it will have its own issues, though.)
> > > >>>>>>>
> > > >>>>>>> The basic idea is to use trig syntax, and to support the
> > > different
> > > >>>>>>> desired relationships between labels and their graphs via
> > > class
> > > >>>>>>> information on the labels.  In particular, according to
> this
> > > proposal,
> > > >>>>>>> in this trig document:
> > > >>>>>>>
> > > >>>>>>>       <u1>    {<a>    <b>    <c>    }
> > > >>>>>>>
> > > >>>>>>> ... we only know that<u1>    is some kind of label for the
> RDF
> > > Graph<a>
> > > >>>>>>> <b>    <c>, like today.  However, in his trig document:
> > > >>>>>>>
> > > >>>>>>>       {<u2>    a rdf:Graph }
> > > >>>>>>>       <u2>    {<a>    <b>    <c>    }
> > > >>>>>>>
> > > >>>>>>> we know that<u2>    is an rdf:Graph and, what's more, we
> know
> > > that<u2>
> > > >>>>>>> actually is the RDF Graph {<a>    <b>    <c>    }.  That
> is,
> > > in
> > > >>> this case, we
> > > >>>>>>> know that URL "u2" is a name we can use in RDF to refer to
> > > that g-snap.
> > > >>>>>>>
> > > >>>>>>> Details are here:
> > > http://www.w3.org/2011/rdf-wg/wiki/Graphs_Design_6.1
> > > >>>>>>>
> > > >>>>>>> That page includes answers to all the current GRAPHS
> issues,
> > > including
> > > >>>>>>> ISSUE-5, ISSUE-14, etc.
> > > >>>>>>>
> > > >>>>>>> Eric has started going through Why Graphs and adding the
> > > examples as
> > > >>>>>>> addressed by Proposal 6.1:
> > > >>>>>>> http://www.w3.org/2011/rdf-wg/wiki/Why_Graphs_6.1
> > > >>>>>>>
> > > >>>>>>>         -- Sandro (with Eric nearby)
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>>
> > > >>>
> > > >
> > > >
> > > > ----
> > > > Ivan Herman, W3C Semantic Web Activity Lead
> > > > Home: http://www.w3.org/People/Ivan/
> > > > mobile: +31-641044153
> > > > FOAF: http://www.ivan-herman.net/foaf.rdf
> > > >
> > > >
> > > >
> > > >
> > > >
> > > 
> > > 
> > 
> > 
> >
Received on Tuesday, 10 April 2012 22:38:05 UTC