Re: Three solution designs to the first three Graphs use cases from Steve Harris on 2012-02-01 (public-rdf-wg@w3.org from February 2012)

From: Steve Harris <steve.harris@garlik.com>
Date: Wed, 1 Feb 2012 18:09:27 +0000
To: Sandro Hawke <sandro@w3.org>
Cc: Ivan Herman <ivan@w3.org>, Andy Seaborne <andy.seaborne@epimorphics.com>, public-rdf-wg@w3.org
Message-Id: <821E2E9F-37A9-4662-BA98-6515D5325B01@garlik.com>
On 2012-02-01, at 16:16, Sandro Hawke wrote:
> On Wed, 2012-02-01 at 15:33 +0000, Steve Harris wrote:
>> On 2012-02-01, at 13:03, Sandro Hawke wrote:
>>> On Wed, 2012-02-01 at 10:43 +0000, Steve Harris wrote:
>>>> On 2012-02-01, at 00:23, Sandro Hawke wrote:
>>>>> On Fri, 2012-01-27 at 12:27 +0000, Steve Harris wrote:
>>>>>> On 2012-01-27, at 10:35, Ivan Herman wrote:
>>>>>>> On Jan 27, 2012, at 10:33 , Andy Seaborne wrote:
>>>>>>>> On 27/01/12 03:45, Sandro Hawke wrote:
>>>>>>>>> On Thu, 2012-01-05 at 11:09 +0000, Andy Seaborne wrote:
>>>>>>>>>> On 04/01/12 19:23, David Wood wrote:
>>>>>>>>>>> Thanks, Sandro.  That's very helpful.
>>>>>>>>>>> 
>>>>>>>>>>> It might be useful to consider augmenting TriG syntax to support your third solution (explicitly naming relations). I'd be quite happy with that.
>>>>>>>>>> 
>>>>>>>>>> What would the data model be?
>>>>>>>>> 
>>>>>>>>> I think: an RDF graph which can have other RDF graphs as values of its
>>>>>>>>> triples.  All these graphs would be subgraphs of some greater graph, so
>>>>>>>>> they can share b-nodes.
>>>>>>>>> 
>>>>>>>>> (This is what cwm has had implemented since 2001, I think.)
>>>>>>>> 
>>>>>>>> I thought this WG wasn't going there (graph literals).
>>>>>>>> 
>>>>>>>> Personally, I see graph literals as the clean answer but it is RDF 2 (+).  RDF 1.1 is, to me, incremental improvements within the current abstract data model.  Datatyped literals  (e.g. "<s> <p> <o>"^^rdf:graphNTriples) are unwieldy and might block doing graph literals properly in RDF 2+.
>>>>>>>> 
>>>>>>> 
>>>>>>> I am not convinced it is such a huge jump and, if this is the only way to have a clean way forward, we may have to do this. The datatyped literals may be a way forward and, after all, the trig version of using '{' may be considered as a syntactic sugar for a datatyped literal…
>>>>>> 
>>>>>> This makes me /extremely/ nervous.
>>>>>> 
>>>>>> From the perspective of the indexing/query engine is an enormous difference, and I'm not aware of any commonly used systems that currently follow this model. So, there's a lack of experience in the community of how to deal with these structures efficiently.
>>>>>> 
>>>>>> I bought this kind of argument with RDF Lists (collections), and accessor functions - storing the lists natively, and also reflecting them into triples. Coming up with an implementation that was both correct and efficient turned out to be so hard that we gave up, and just elected not to use Lists in production.
>>>>> 
>>>>> I'm sad to hear about this experience with lists.  Sometime I'd like to
>>>>> hear more about why that was so hard.   (Have you folks
>>>>> written/presented about it?)
>>>> 
>>>> No, but I think I mentioned it at the last F2F.
>>>> 
>>>> In essence, to make it have anything like decent performance you have to maintain a parallel copy of the list structure in a vector (of some kind), and tracking changes in the triples, and updating the vector appropriatly (and vice versa, if you allow useful list manipulation functions) is /extremely/ difficult, and computationally expensive, especially at scale. Quite simply, it's just not worth the effort.
>>>> 
>>>> I believe Andy said something similar too.
>>> 
>>> Do your systems do inference, eg RDFS?   I'm guessing not.  My sense is
>>> that given the machinery for doing fairly-simple inference, this kind of
>>> list handling isn't bad.   If you need to scale such that RDFS is out of
>>> the question, then, yeah, I can imagine there would be a problem with
>>> lists.
>> 
>> No, and if we have a use for it it would be prohibitively computationally expensive.
>> 
>>> I'd like to address this by steering people away from the triple view of
>>> lists, having that be a part that can be turned off for performance.
>>> That's tough in a Standards-Conformant world.   I wonder if there's any
>>> way we can bless that design (making it okay to turn off the triple view
>>> in some situations) that does more good than harm.   Kind of like our
>>> blessing .well-known/genid giving people some license to avoid bnodes.
>> 
>> It's going to make RDF even weirder than it already is.
> 
> Not if we hide (archaify) the triple view, but I guess that's just too
> much change for this industry right now.   Never mind, I think this is
> probably a distraction.
> 
>>>>>> If we had a critical mass of systems that worked this way I would be enthusiastic about it, but we don't.
>>>>> 
>>>>> I think it's possible to implement graph literals (like in N3, or my
>>>>> third proposed solution) using a quad store, like the ones you already
>>>>> use.  That's how at least one version of cwm did it.   The technique is
>>>>> to map it to TriG/SameAs with minted identifiers:
>>>>> 
>>>>> So, to represent:
>>>>> 
>>>>> <s> <p> { <a> <b> <c> }
>>>>> 
>>>>> you mint an identifier ( <g1> ) then store these quads:
>>>>> 
>>>>> <s> <p> <g1> DEFAULT
>>>>> <a> <b> <c> <g1>
>>>> 
>>>> Sure, it's possible, but it's novel (for scalable systems), and no-one understands the performance implications.
>>>> 
>>>> We use SPARQL-style GRAPHs a lot for holding provenance information, currently it's just a query across the quad to find the provenance identifier, but this would move that to a single column join.
>>>> 
>>>>> In this proposal, such a use of quads is a purely internal decision of
>>>>> the implementer -- what's standard for interchange is the N3-like syntax
>>>>> with the graph literals.  It's just those documents are stored for easy
>>>>> access/manipulation in quads using a SameAs relation.  Elsewhere, people
>>>>> remain free to use quads, internally, however they want.
>>>>> 
>>>>> Wouldn't that solve the implementation burden?
>>>> 
>>>> No.
>>>> 
>>>> In general I have a serious issue with the way this group is chartered. It seems to take no account of the fact that there are FTSE-100 etc. companies spending serious effort and money on deployments of these technologies. IMHO it's far too late to run around messing with the underlying real-world datamodel (quads) when it has this many deployments. 10 years ago, when RDF was mostly just an academic plaything it might have been OK, but quads were already in common use then. I strongly believe this group should have been chartered to standardise what real implementations actually do, not invent random new stuff backed by no significant implementation experience.
>>>> 
>>>> We spend an eyewatering amount of money every year on power, cooling, and hardware to store quads. If RDF 1.1 makes that noticeably less efficient, then frankly we'll just ignore it.
>>>> 
>>>> - Steve [picking up toys and putting them back in the pram :)]
>>> 
>>> /me tosses pebbles at Steve's window, hoping he'll come back out and
>>> play...
>>> 
>>> Doesn't the charter say what you want?  It says:
>>> 
>>>       Care should be taken to not jeopardize existing RDF deployment
>>>       efforts and adoption. In case of doubt, the guideline should be
>>>       not to include a feature in the set of additions if doing so
>>>       might raise backward compatibility issues.
>>> 
>>> I certainly don't see us doing anything to break SPARQL, and (sorry if
>>> I'm being slow) I'm not really seeing why you'd have to change your
>>> internal representation structures, no matter what we decide for
>>> TriG-type stuff.
>> 
>> I simply can't afford that additional join.
> 
> If your systems had to do serious work through an interface that used
> graph literals as its only way to do graph reference, that would be a
> problem, yes.   I think in practice we're going to have to do some kind
> of hybrid approach, and I'm thinking that might solve your problem here.

It just doesn't sound like a step forwards. More complex, no more capability.

> So, anyway, you want quads, a la SPARQL.  Are you okay with the WG
> mandating that quads always have TriG/state semantics or always have
> TriG/equality or do you see some other way to address the highlighted
> use cases (shared crawler, archiving crawler, endorsement, and keeping
> inferred triples separate)…?

Well, we do most of those things with quads, so yes. The ones that we don't do, other people already do.

>>> I guess the problem is: how do we provide interop between people who are
>>> currently doing things in different ways?   It looks to me like the
>>> current designs I'm talking about are all sort of isomorphic,
>>> transformable into each other at the input and/or output stage, so that
>>> systems can do whatever they want internally.   Maybe I just haven't dug
>>> deeply enough.
>> 
>> Well, there's one approach that is very common (quads) and a bunch that are rare (quints, graph literals, multi triples). In my eyes that doesn't make them equal contenders.
>> 
>> There's a reasonable argument that quads were popularised by SPARQL, but I don't really think that's the case, and the why is irrelevant anyway.
> 
> I'm not opposed to quads or SPARQL at all; I'm just trying to understand
> what's reasonable to add (or even change *shudder*) to provide solutions
> for the use cases.

OK, but we, and many other people round the planet, have implementations that do those things using quads and SPARQL as-is, so I'm not terribly sympathetic to calls to change things.

I'm sure there are areas where things aren't done the cleanest way possible, but sometimes you have to let a bit of scruffy into your life ;)

- Steve

-- 
Steve Harris, CTO, Garlik Limited
1-3 Halford Road, Richmond, TW10 6AW, UK
+44 20 8439 8203  http://www.garlik.com/
Registered in England and Wales 0535 7233 VAT # 849 0517 11
Registered office: Landmark House, Experian Way, Nottingham, Notts, NG80 1ZZ
Received on Wednesday, 1 February 2012 18:09:59 UTC