Re: Three solution designs to the first three Graphs use cases from Sandro Hawke on 2012-02-01 (public-rdf-wg@w3.org from February 2012)

From: Sandro Hawke <sandro@w3.org>
Date: Wed, 01 Feb 2012 08:03:54 -0500
To: Steve Harris <steve.harris@garlik.com>
Cc: Ivan Herman <ivan@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-ID: <1328101434.2916.140.camel@waldron>
On Wed, 2012-02-01 at 10:43 +0000, Steve Harris wrote:
> On 2012-02-01, at 00:23, Sandro Hawke wrote:
> > On Fri, 2012-01-27 at 12:27 +0000, Steve Harris wrote:
> >> On 2012-01-27, at 10:35, Ivan Herman wrote:
> >>> On Jan 27, 2012, at 10:33 , Andy Seaborne wrote:
> >>>> On 27/01/12 03:45, Sandro Hawke wrote:
> >>>>> On Thu, 2012-01-05 at 11:09 +0000, Andy Seaborne wrote:
> >>>>>> On 04/01/12 19:23, David Wood wrote:
> >>>>>>> Thanks, Sandro.  That's very helpful.
> >>>>>>> 
> >>>>>>> It might be useful to consider augmenting TriG syntax to support your third solution (explicitly naming relations). I'd be quite happy with that.
> >>>>>> 
> >>>>>> What would the data model be?
> >>>>> 
> >>>>> I think: an RDF graph which can have other RDF graphs as values of its
> >>>>> triples.  All these graphs would be subgraphs of some greater graph, so
> >>>>> they can share b-nodes.
> >>>>> 
> >>>>> (This is what cwm has had implemented since 2001, I think.)
> >>>> 
> >>>> I thought this WG wasn't going there (graph literals).
> >>>> 
> >>>> Personally, I see graph literals as the clean answer but it is RDF 2 (+).  RDF 1.1 is, to me, incremental improvements within the current abstract data model.  Datatyped literals  (e.g. "<s> <p> <o>"^^rdf:graphNTriples) are unwieldy and might block doing graph literals properly in RDF 2+.
> >>>> 
> >>> 
> >>> I am not convinced it is such a huge jump and, if this is the only way to have a clean way forward, we may have to do this. The datatyped literals may be a way forward and, after all, the trig version of using '{' may be considered as a syntactic sugar for a datatyped literal…
> >> 
> >> This makes me /extremely/ nervous.
> >> 
> >> From the perspective of the indexing/query engine is an enormous difference, and I'm not aware of any commonly used systems that currently follow this model. So, there's a lack of experience in the community of how to deal with these structures efficiently.
> >> 
> >> I bought this kind of argument with RDF Lists (collections), and accessor functions - storing the lists natively, and also reflecting them into triples. Coming up with an implementation that was both correct and efficient turned out to be so hard that we gave up, and just elected not to use Lists in production.
> > 
> > I'm sad to hear about this experience with lists.  Sometime I'd like to
> > hear more about why that was so hard.   (Have you folks
> > written/presented about it?)
> 
> No, but I think I mentioned it at the last F2F.
> 
> In essence, to make it have anything like decent performance you have to maintain a parallel copy of the list structure in a vector (of some kind), and tracking changes in the triples, and updating the vector appropriatly (and vice versa, if you allow useful list manipulation functions) is /extremely/ difficult, and computationally expensive, especially at scale. Quite simply, it's just not worth the effort.
> 
> I believe Andy said something similar too.

Do your systems do inference, eg RDFS?   I'm guessing not.  My sense is
that given the machinery for doing fairly-simple inference, this kind of
list handling isn't bad.   If you need to scale such that RDFS is out of
the question, then, yeah, I can imagine there would be a problem with
lists.

I'd like to address this by steering people away from the triple view of
lists, having that be a part that can be turned off for performance.
That's tough in a Standards-Conformant world.   I wonder if there's any
way we can bless that design (making it okay to turn off the triple view
in some situations) that does more good than harm.   Kind of like our
blessing .well-known/genid giving people some license to avoid bnodes.

> >> If we had a critical mass of systems that worked this way I would be enthusiastic about it, but we don't.
> > 
> > I think it's possible to implement graph literals (like in N3, or my
> > third proposed solution) using a quad store, like the ones you already
> > use.  That's how at least one version of cwm did it.   The technique is
> > to map it to TriG/SameAs with minted identifiers:
> > 
> > So, to represent:
> > 
> >  <s> <p> { <a> <b> <c> }
> > 
> > you mint an identifier ( <g1> ) then store these quads:
> > 
> >  <s> <p> <g1> DEFAULT
> >  <a> <b> <c> <g1>
> 
> Sure, it's possible, but it's novel (for scalable systems), and no-one understands the performance implications.
> 
> We use SPARQL-style GRAPHs a lot for holding provenance information, currently it's just a query across the quad to find the provenance identifier, but this would move that to a single column join.
> 
> > In this proposal, such a use of quads is a purely internal decision of
> > the implementer -- what's standard for interchange is the N3-like syntax
> > with the graph literals.  It's just those documents are stored for easy
> > access/manipulation in quads using a SameAs relation.  Elsewhere, people
> > remain free to use quads, internally, however they want.
> > 
> > Wouldn't that solve the implementation burden?
> 
> No.
> 
> In general I have a serious issue with the way this group is chartered. It seems to take no account of the fact that there are FTSE-100 etc. companies spending serious effort and money on deployments of these technologies. IMHO it's far too late to run around messing with the underlying real-world datamodel (quads) when it has this many deployments. 10 years ago, when RDF was mostly just an academic plaything it might have been OK, but quads were already in common use then. I strongly believe this group should have been chartered to standardise what real implementations actually do, not invent random new stuff backed by no significant implementation experience.
> 
> We spend an eyewatering amount of money every year on power, cooling, and hardware to store quads. If RDF 1.1 makes that noticeably less efficient, then frankly we'll just ignore it.
> 
> - Steve [picking up toys and putting them back in the pram :)]

/me tosses pebbles at Steve's window, hoping he'll come back out and
play...

Doesn't the charter say what you want?  It says:
        
        Care should be taken to not jeopardize existing RDF deployment
        efforts and adoption. In case of doubt, the guideline should be
        not to include a feature in the set of additions if doing so
        might raise backward compatibility issues.

I certainly don't see us doing anything to break SPARQL, and (sorry if
I'm being slow) I'm not really seeing why you'd have to change your
internal representation structures, no matter what we decide for
TriG-type stuff.

I guess the problem is: how do we provide interop between people who are
currently doing things in different ways?   It looks to me like the
current designs I'm talking about are all sort of isomorphic,
transformable into each other at the input and/or output stage, so that
systems can do whatever they want internally.   Maybe I just haven't dug
deeply enough.
 
    -- Sandro
Received on Wednesday, 1 February 2012 13:04:09 UTC