Re: In RDF what is the best practice to represent data provenance (source)? from Michael Schneider on 2007-01-17 (semantic-web@w3.org from January 2007)

From: Michael Schneider <m_schnei@gmx.de>
Date: Thu, 18 Jan 2007 00:32:26 +0100
To: chris@bizer.de, semantic-web@w3.org
CC: semantic_web@googlegroups.com
Message-ID: <45AEB20A.40803@gmx.de>
[ FROM: semantic_web@googlegroups.com; CC: semantic-web@w3.org ]

On 09.01.2007 09:15, Chris Bizer wrote:
> Hi Michael,
> 
>>> RDF reification doesn't work for practical reasons
>> 
>> Hi, Chris!
>> 
>> Why is this so? I always had some vague feeling that reification
>> does not have many friends within the community, but I never found
>> a real reason for this: Neither a technical reasons, nor a modeling
>> reason.
> 
> Some reasons are listed below, others in the Named Graphs journal
> paper http://www.websemanticsjournal.org/ps/pub/2005-23

Thanks, Chris, for the link, I finally found the time to read it. You
and your co-authors propose an extension to RDF, where it would be
possible to reference complete RDF graphs by name (URI), and so it would
be possible to annotate such a named graph by adding properties to the
URI. I think that this is really a missing feature in RDF, so I
personally appreciate this proposal of named graphs.

To annotate a single RDF triple, you further propose building a
singleton named graph, which contains only this single triple. This
might bring some difficulties with it, because a singleton set is of
course different from its contained instance, but let's forget about
this for now. The issue we are actually discussing in this thread is, if
such a named singleton graph would be an adequate replacement for RDF
Reification. My answer is no, and I will explain, why I think so.

My reason is, in fact, given by yourself in section 3.3 of your paper,
where you say that "RDF reification operates at the semantical level,
not the syntactic". You give an example for this observation: There is
some reified statement

   :r a rdf:Statement ;
      rdf:subject :s1 ;
      rdf:predicate :p ;
      rdf:object :o .

and some annotation of ":s1 :p :o"

   :r :someAnnotation :x .

Now, when :s1 happens to be the same as some other resource :s2

   :s1 owl:sameAs :s2

then, the above ':someAnnotation x' also holds for ":s2 :p :o".

Now I say: This is not a bug, it is a feature! RDF Reification is a
means to describe /relationships/ in the given domain, it is /not/ a
means to describe syntactical RDF triples which denote such
relationships. For example, if you have the triple

   alice isMarriedWith bob

in your triple store, you might be interested in telling since when they
are married, in which church the marriage was created, and so on. And if
it turns out that the URI 'bob' denotes the same person as 'robert',
than all those annotation should of course also hold for

   alice isMarriedWith robert

So, for such a purpose, you should IMO use RDF Reification!

On the other hand, if you want to say something about the (syntactical) 
triple in the graph itself, e.g. that this triple (not the marriage) has 
been created in the triple store at 2007-01-17, than a named singleton 
graph would be the right thing to use. RDF Reification would be simply 
the wrong tool, because it might be that the 'bob' version of the triple 
has been stored at another time than the 'robert' version.

The problem is, that, currently, in RDF there is no means to describe
syntactical triples, so RDF Reification is often /abused/ for this
purpose. With named graphs, this bad usage would not be necessary anymore.

What I am saying here is that named (singleton) graphs and RDF
reification are completely complementary, they do not intersect in
functionality, so we cannot drop the one for the other.

Now, the next point to discuss is, what about the annotation of complete
triple /sets/? Say, we have some RDF graph consisting of different facts
about alice and bob. If we want to talk about the RDF graph itself (the
syntactical thing), I would like to use a /named graph/ for this
purpose. But if I had to talk about the /set of relationships/ which
interpret this RDF graph, than named graphs would be the wrong tool.
Below, you cite some proposal where reified statements are collected
within some /rdf:Bag/. This would be an approach I would prefer for the
latter need.

So to summarize: There are actually four different representational use 
cases, with four different solutions:

   * referencing a (syntactical) triple:
     use a singleton named graph!

   * referencing a (semantical) relationship:
     use RDF reification!

   * referencing an (syntactical) RDF graph:
     use a named graph!

   * referencing a (semantical) set of relationships:
     use a bag of reified statements!


Now, I will come to the more technical issues you were mentioning:

> Snip: RDF reification provides a mechanism for representing
> meta-information about triples but tries to stay inside the bounds of
> a pure triple data model at the same time. This approach has some
> substantial drawbacks:
> 
> Triple bloat. RDF reification increases the number of triples in an 
> graph significantly. An effect which is called “triple bloat”
> [CS04a]. Describing the elements of a triple using the reification
> vocabulary causes an at least threefold increase alone.

Triple bloat is a fact, of course. But I would not regard it as a real
problem, until I have to store a really huge number of reified
statements. As long as there are less than, say, a few millions
of them - and many triple sets in a future SW will probably be that
small - I do not care about space, when there is even enough space left
for this on my cell phone's internal memory. :)

So, what remains, are those really big sets of triple annotations. But,
as I already mentioned in my previous mail, there are already existing
systems which support me to cope with this problem (e.g. JENA). In fact,
a straight forward idea is to just store all those reified statements
physically as quadruples in a relational table, each quadruple component
being an integer referencing that row in the resources table, where the
according URI is placed. This would be a pretty space friendly approach.
When accessing such a reified statement from some triple store, it would
still be seen as a four-triple-per-statement construct, but that would
just be a /view/ to the data.

> Querying reified Statements. It is rather cumbersome to query 
> information which is represented as reified statements using RDF
> query languages such as SPARQL. As a single reified statement is
> represented by multiple triples, queries over reified statements also
> involve multiple triple patterns for a single statement and therefore
> quickly become unreadable and confusing. Figure 4.10 shows a SPARQL
> query to retrieve all information about people who have an email
> address against the reification our example graph.

I did not find this Figure 4.10, but what you probably mean is something
like this:

   SELECT ?s ?p ?o
   WHERE {
     ?r a rdf:Statement .
     ?r rdf:subject ?s .
     ?r rdf:property ?p .
     ?r rdf:object ?o .
   }

This is, of course, somewhat inconvenient, at least when you have to
write it down manually. But, IMHO, this is just a problem with N3/TURTLE
syntax, not with reification itself. What I am missing for long is
special support for reification in N3. Reification would deserve this as
a built-in feature of RDF. Note, that there is some nice syntactical
trick in RDF/XML-ABBREV for existing triples: Just add an attribute
"rdf:ID" to the property tag which should become the predicate of the
reified statement. For N3, well, what about something like this:

   `:s :p :o :r` :hasDate "2007-01-11"^^xsd:date .

Here, I use some kind of "quoting syntax" (backticks) to build a
quadruple, where the fourth entry, ':r', would be the URI of the reified
statement. If you just want to create a blank node reified statement,
where the node ID isn't used anywhere else, you could omit the fourth
component, writing:

   `:s :p :o` :hasDate "2007-01-11"^^xsd:date .

And this syntax would then of course have to be propagated to SPARQL,
which would make the abouve query more compact.

Just an idea...

> If a query engine is not especially optimized for this kind of
> queries it would answer them slowly, as evaluating multiple triple
> patterns implies a join for each pattern [MK03].

Again, it shouldn't be too hard to do such an optimization. It's at
least easy for a SPARQL parser to match the above pattern of four
triples per reified statement. So, when the writer of such a SELECT
statement learns to write reified statements always in this way (or,
better, in the syntactically-sugared-way I proposed above), the parser
can produce an internal quadruple representation of the form
"[?s ?p ?o ?r]", and then do single lookups in that quadruple table
which I proposed above. That should be reasonable time efficient.

My guess now: If an optimization is reasonably easy to realize, and if
it is important enough (keep in mind that reification is still a core
language feature of RDF), it will probably show up soon in all relevant
triple store implementations (at least soon after it has shown up in the
competitors' triple store implementation :) ).

> Redundant Meta-Information. RDF reification requires metainformation
> to be attached separately to each reified statement. This further
> increases the size of the graph and might lead to inconsistencies
> when meta-information is changed. In order to allow meta-information
> to be expressed at a higher level, Graham Klyne proposes to group
> reified statements together using an rdf:Bag [BG04] and to attach 
> meta-information to this bag instead of having to attach it
> separately to each reified statement [Kly00]. His approach eliminates
> redundant meta-information but leave the other problems of
> reification untouched, as each original triple is still described by
> at least three reification triples plus one extra triple to relate
> the reified statement to the bag.

See my discussion above about using this rdf:Bag approach for
representing sets of relationships.

> Single Level of Granularity. The RDF reification mechanism allows 
> meta-information to be expressed only on a single, fixed level of 
> granularity. Within most information exchange and publication
> scenarios, RDF information is provided as graphs consisting of
> multiple statements. These scenarios therefore do not require
> meta-information about individual statements and it would be more
> suitable to use a mechanism that allows meta-information to be
> expressed at different levels of granularity.
> 
>>> and there are discussions about removing it from the RDF Spec
>>> (see ISWC 2006 Web2.0 panel discussion).
> 
> Tim Berners-Lee said on the panel, that reification sucks and that he
>  would like to have it removed from the spec.

Well, I would really love to see him changing his mind some day
to saying that reification only /mostly/ sucks. ;-)

> I think there is some video coverage of the panel discussion on the
> ISWC website, if you want to hear his original statements.
> 
>>> A more current approach are Named Graphs. Please refer to the
>>> SPARQL Specification for details: 
>>> http://www.w3.org/TR/rdf-sparql-query/#rdfDataset
>> 
>> I did not know RDF datasets by now, so I read this section.
>> However, I am not quite sure if I correctly understand what you
>> mean by pointing me to this topic. I can see RDF datasets as
>> collections of some non-named default graph and some additional
>> named graphs. Probably a good thing for query languages like
>> SPARQL. But for RDF, do you mean that reified statements, possibly
>> regarded as "single-statement-graphs", are too restricted, so that
>> reification should be substituted for a generalized mechanism, by
>> which one is able to annotate complete graphs?
> 
> Yes. Examples of using Named Graphs to represent provenance
> information and other meta-information are found in:
> 
> http://sites.wiwiss.fu-berlin.de/suhl/bizer/WIQA/index.htm 
> http://sites.wiwiss.fu-berlin.de/suhl/bizer/WIQA/browser/index.htm 
> http://www.websemanticsjournal.org/ps/pub/2005-23 
> http://www.w3.org/2004/03/trix/
> 
>> I would have a few things to say on this, but before starting to 
>> comment on my own hypotheses, I first want to here what you really
>> mean.
> 
> I'm looking forward to your comments ;-)
> 
> Cheers
> 
> Chris

Best regards,
Michael

>> Michael Schneider wrote on 2006-12-30 in Google Group:Semantic Web:
>> 
>> 
>>>> Hi, Bryan!
>>>> 
>>>> Alan told you, which properties you can use to annotate your
>>>> data.
>>>> 
>>>> Note, however, that in RDF you can only add properties to
>>>> /resources/. Because you want to annotate /triples/, you first
>>>> have to regard these triples as resources. In RDF, this can be
>>>> done by "reifcation", see:
>>>> 
>>>> http://www.w3.org/TR/rdf-primer/#reification
>>>> 
>>>> Your example would then need to be extended the following way
>>>> (note, that the original definition of ":Texas" must /not/ be
>>>> deleted, otherwise you would just talk about some data triple,
>>>> which does not really exist):
>>>> 
>>>> <rdf:Statement rdf:about="#referenceOfTexasPopulationTriple">
>>>> 
>>>> <!-- subject, predicate and object of data triple: --> 
>>>> <rdf:subject rdf:resource="#Texas"/> <rdf:predicate 
>>>> rdf:resource="http://www.geography.fake/geo#population"/> 
>>>> <rdf:object>20851820</rdf:object>
>>>> 
>>>> <!-- annotation of the refied data triple --> 
>>>> <dcterms:references>United States 2000 
>>>> Census</dcterms:references> 
>>>> <dcterms:issued>2000-04-01</dcterms:issued>
>>>> 
>>>> </rdf:Statement>
>> 
>> BryanJacobson wrote on 2006-12-28 in Google Group:Semantic Web:
>> 
>>>>> I'm very new to the world of RDF and the Semantic Web.
>>>>> 
>>>>> I want to represent a fact/triple: (Texas population
>>>>> 20851820).
>>>>> 
>>>>> I think I would represent it as follows in RDF:
>>>>> 
>>>>> <rdf:RDF 
>>>>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
>>>>> xmlns:geo="http://www.geography.fake/geo#"> <rdf:Description
>>>>> rdf:about="http://www.geography.fake/geo/Texas"> 
>>>>> <geo:population>20851820</geo:population> </rdf:Description> 
>>>>> </rdf:RDF>
>>>>> 
>>>>> What is the best practice for representing: * Where this data
>>>>> comes from: (United States 2000 Census). * Given that
>>>>> populations change constantly, the point in time associated
>>>>> with this population (April 1, 2000).
>>>>> 
>>>>> If appropriate, go ahead and tell me I should be looking at
>>>>> this completely differently.
>>>>> 
>>>>> Many thanks! -- Bryan
>> 
>>
Received on Wednesday, 17 January 2007 23:32:35 UTC