Re: Reification and Provenance modelling from Bob Ferris on 2011-09-21 (public-rdf-comments@w3.org from September 2011)

From: Bob Ferris <zazi@smiy.org>
Date: Wed, 21 Sep 2011 13:11:25 +0200
To: public-rdf-comments@w3.org
Message-ID: <4E79C65D.9010703@smiy.org>
Hi Richard,

On 9/20/2011 11:30 PM, Richard Cyganiak wrote:
> On 20 Sep 2011, at 19:28, Bob Ferris wrote:
>> (albeit, I get the impression that I cannot really convince you from my proposal ;) )
>
> I'm not so much interested in proposals at this time, but in use cases and requirements.

I think that the important use cases are already covered in [1]. My 
specific one is powered by multiple information providers and requires 
an access control mechanism. Especially important for that use case is 
to be able to push back changes to its origins, i.e., if I have a 
resource description that is aggregated by information from multiple 
information providers, I need to know which statement is from which 
information provider and, furthermore, if single statements are spread 
over multiple graphs (views), I need to be able to handle changes on 
these statements as well.

> That's because this group needs to understands what people are trying to achieve. Otherwise we can't effectively compare different proposals.
>
>> On 9/20/2011 5:16 PM, Richard Cyganiak wrote:
>>> I would assume that the default graph contains all triples regardless of their named graph.
>>
>> So far I do not have seen a triple store, which duplicates all statements in its default graph
>
> Most of them do this AFAIK. I'm pretty sure it's the default in Virtuoso, and we're running TDB in that configuration. I'm pretty sure that I've seen it for 4store as well. This scenario is explicitly pointed out as a “useful arrangement” in the SPARQL 1.1 spec:

In a multiple sources and access control driven system this is not 
really applicable.

>
> http://www.w3.org/TR/sparql11-query/#exampleDatasets
>
>> i.e., this would break a bit the concept of name graphs, e.g., imagine if I have a named graph with all my personal data, I wouldn't be happy if this data is also query-able via the default graph.
>
> Access control is an orthogonal issue.

Yes, but a very important one from my POV. Therefore, separating 
personal data spaces into single graphs is a good approach.

> If you have a way of specifying access control to named graphs, then I would expect the store to exclude them from the default graph if the client is not authorized to see them. In standard SPARQL, if your default graph is public, then so are all your named graphs.
>
>>> Then a statement identifier approach could be queried like this:
>>>
>>> SELECT * WHERE {
>>>     TRIPLE ?t { ?s ?p ?o }
>>>     ...
>>> }
>>
>> I do not think that we would need such a TRIPLE keyword.
>
> How else would you bind a variable to a statement identifier? For example, “give me the statement identifier for the triple {<bob>  a foaf:Person}”?

SELECT ?t WHERE {
	?s ?p ?o ?t }

(please keep in mind, I proposed the usage of the statement identifier 
as an optional position)

>
>> My use case of my proposal is reification and how to relate single statements a.k.a. shortcut relations to its reification class instances.
>
> Now we're getting somewhere. Can you explain why this use case of property reification isn't well-addressed by named graphs? An example might help.

I don't want to scramble this information into separate graphs, i.e., 
shortcut relations and reification class instances should be able to 
co-exist in one and the same graph.

>
>>>> To make statements about them somewhere else we usually need an identifier to refer to them, or?
>>>
>>> No, because graphs are literals, so one can repeat the literal to make statements about it.
>>
>> Well, then I have the same disadvantage as in the existing Named Graph proposal, i.e., statements of one named graph do not have any semantically relation to identical statements of another named graphs.
>
> That's not true. The semantic relation between the statements is that they're identical. It's like using the literal number 1 in two different graphs, or the string "Bob". We don't need to assign an identifier to these literals in order to know that they're the same. Literals are self-denoting in RDF.

Okay, you are right. However, graphs can be more complex than a simple 
number- or string-typed literal. Furthermore, we would utilise these 
graphs for further processing of our model. Usually a literal can be 
seen as a kind of leaf in a graph representation, or?

>
> Just to repeat this: If RDF had graph literals or “triple literals”, and the same literal occurred in two different graphs, then the design of RDF literals requires that they'd have to match if you asked a query.
>
>>> Occurrences of the same literal in different graphs are semantically equivalent (unlike, say, blank node identifiers).
>>
>> Do really intend this always?
>
> It's definitely how literals are defined in RDF. I didn't perceive any problem with that so far.

Yes, because we didn't have graphs in literals right now. They reflect a 
further structure, which would be processed in the same way as the rest 
of our RDF model - in contrast strings would be processed, e.g., by a 
NLP tool.

>
>> I don't think so, see my example above. Hence, we have to cover both cases.
>
> Not sure what you mean here. I don't understand the case where you sometimes would want 1 and 1 to be identical and sometimes not.

Quoted from [2]:

"one can also decouple a reused statement by changing its statement
identifier; i.e., the triple of the statement are still the same
but the relation to the original statement might now be another e.g.,
reflected by a provenance statement e.g., <#s20> :original <#s19>"

i.e. if I intend that an utilised statement in multiple graphs belongs 
semantically together, so that I really refer to that statement, then 
I'll utilise the same statement identifier; otherwise, I'll utilise a 
different statement identifier (and if necessary I can still relate 
these statements to each other).
Let's imagine the following use case: you are trying to implement an 
algorithm that ranks information from multiple information providers. 
Before the aggregation and federation task, you would usually store the 
information fetched from different information providers separately. 
Therefore, you could utilise Named Graphs and statement identifiers. 
Different information providers can provide the same information, i.e., 
the same statements. However, to keep track of their origin you will 
maybe address them by different statement identifiers at the beginning.

>
>>> RDF graphs and named graphs are abstract data models, and implementers are free to store them any way they want internally,
>>
>> Yes, I know. However, why do we talk nowadays about quad stores instead of triple stores.
>
> We talk about “graph stores” and “SPARQL stores” too. That's what they are storing in an abstract sense, considering the interface they present to the world. This doesn't mean that they are internally organized in any particular way. (Some “quad stores” are actually column stores, and some are quint stores etc)
>
>>> I'm still trying to understand what the perceived problem with single-triple named graphs is.
>>
>> Real world knowledge description are then, at the moment with the existing SPARQL specification, not really query-able, if we have many isolated single-triple named graphs.
>
> I don't understand what this means. Can you give me an example of such a knowledge description, and an example query that you cannot express in SPARQL if the data is organized in single-triple named graphs?

Let's take the multiple information providers scenario. If I would store 
the federated information still in separate graphs to keep track of the 
provenance, an information resource would not really be query-able, 
because single statements are isolated into separate graphs. (please 
keep the statement duplication proposal aside here)
However, by utilising statement identifiers I can still track the 
provenance and single statements are not scrambled into separate graphs 
and I can easily query this information by specifying the graph that 
contains all these statements.

(please keep in mind that my proposal suggested to separate the indexing 
of graphs, i.e., you should be able to query a graph as usual, i.e., 
directly define query patterns for the statements of that graph instead 
of "?g :contains ?s" ...)

>
> (There should be a law that forbids invoking the “real world” in an argument unless you give a real-world example ;-)
>
>>> Regarding #2, it's probably false because the RDF abstract syntax does not constrain implementations, and I'm unconvinced that an optimized implementation of your scheme would actually be more space-efficient than an optimized implementation of named graphs.
>>
>> Well, the current Named Graphs semantics (as defined by Bizer et al.) say (more or less) that equal statements in separate graphs do not have any relation to each other. As you said above the literal-graph proposal treat equal literals as equal (without any identifier). Both proposals do not really reflect real world need, where would need to be able to represent both options as needed.
>
> How would you represent these two options using statement identifiers?

Here is an example (following the syntax as introduced in [2]):

<#alice> :friend <#bob> <#s1> . # a statement that can be identified by 
statement identifier #s1
<#alice> :friend <#bob> <#s2> . # a statement that can be identified by 
statement identifier #s2

<#g1> rdf:type rdfg:Graph <#s3> ;
<#g1> :contains <#s1> <s#4> . # a graph that contains the statement #s1

<#g2> rdf:type rdfg:Graph <#s5> ;
<#g2> :contains <#s1> <s#6> . # another graph that contains the 
statement #s2

<#g3> rdf:type rdfg:Graph <#s7> ;
<#g3> :contains <#s2> <s#8> . # a graph that contains the statement s#2

#g1 and #g2 contain the same statement (#s1)
#g3 contains another statement (#s2)

> AFAICT two statements with different statement identifiers would have the same relationship that two single-triple named graphs have, or is there some difference that I'm missing?

Yep, they are not really semantically related to each (initially, of 
course, one could related them explicitly, e.g., via a :original 
relation etc.).

>
>> Albeit, maybe graphs that contain the same statements are an edge case, however, we have to be able to represent edge cases as well. The power of its expressiveness is so far the success of RDF.
>
> I tend to agree, and my argument is that single-triple named graphs actually work reasonably well as an edge case (while having the advantage that they also work very well for the common case).

Yes, if I utilise the traditional Named Graphs syntax, then such 
statements are not really semantically related to other. Otherwise, if I 
utilise the graph literals syntax, then identical graphs are strongly 
semantically related to each other.

>
>>>> However, I believe that there is a strong antipathy for single-triple graphs.
>>>
>>> This is not a technical argument.
>>
>> The technical argument is that one of the bad query handling with single-triple graphs (see above).
>
> You mean stores that don't support mirroring the named graphs into the default graph?

I intended to address the query-ness issue, i.e., scramble information 
(caused by "unnecessary" graph isolations) vs. composed information 
(produced by the utilisation of statement identifiers and statements 
that are de-coupled from its graph enclosure).

> That's not a complaint about the proposal, but a complaint about the state of implementations; and that's something we can't fix by writing something else into the spec.
>
> Best,
> Richard

Cheers,


Bo


[1] http://www.w3.org/2011/rdf-wg/wiki/TF-Graphs-UC
[2] 
http://lists.w3.org/Archives/Public/public-rdf-comments/2011Jan/0001.html
Received on Wednesday, 21 September 2011 11:12:03 UTC