revisiting tracing of statement origin (long)

Many moons ago, there was a discussion [1] on a topic that is dear to
my heart. The closest I can come to an issue on this topic is [2].
Meanwhile, I've been rethinking how tracing of statement origin can be
handled. My thoughts aren't that far along but I thought its better to
share them than to be overtaken by events. 

A primary use-case from my perspective is the often described ability
of RDF to allow aggregation of distinct RDF sources. Two different
examples of this are Mozilla which keeps the various sources that make
up the aggregate store separate and most of the rest of the RDF
implementations which either keep the individual "models" totally
separate or slurp all of the sources into a single model. 

For example, I would like to be able to slurp up a large set of rss1
channels into
a single RDF db and then be able to interact with it as both a single
dataset or be able to distinguish the original document context from
which the various statements originated. 

The tack I was pursuing last year was based on the premise that the
M&S should be taken literally as to reification everywhere. If you
accepted that premise (which is under discussion in rdfcore [3]) then
you would generate alot more triples (although with many possible
optimizations) which would potentially allow you to trace statements
back to both the document and the rdf:Description element that they
occurred in.

Lately, I've been thinking of a different approach. As before, it
focuses on the document centric aspects of RDF rather than the model
centric ones. In the same way that you can distinguish between the
docu-heads and the data-heads in the XML world, I think you can
distinguish between the docu/data-heads and the model/logic-heads in
the RDF world. In both cases, this is a gross generalization. Having
made this generalization, I see myself falling into the former camp. 

On to the approach. The desired result of this or any similar exercise
in my mind is to be able to "join" triples generated from a set of
documents and still be able to distinguish which document each triple
came from. This especially includes triples that have the same
[s,p,o]. 

Lets say I have two RSS channels that I aggregate. Both have an item
about today's article about the McVeigh execution at nytimes.com
(http://www.nytimes.com/2001/06/12/national/12MCVE.html). The two
channels have completely different viewpoints on the execution. Let's
call them channelA and channelB. Here are examples of their rss1
document fragments:

===================

http://controversy.com/channelA/2001/06/12.rdf

<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item
rdf:about="http://www.nytimes.com/2001/06/12/national/12MCVE.html">
   <rss1:description>It was a sad day...</rss1:description>
   <contro:theRefs>
     <rdf:Bag>
        <rdf:li rdf:resource="really sad url" />
     </rdf:Bag>
   </contro:theRefs>
</rss1:item>
===================
http://controversy.com/channelB/2001/06/12.rdf

<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item
rdf:about="http://www.nytimes.com/2001/06/12/national/12MCVE.html">
   <rss1:description>It was a happy day...</rss1:description>
   <contro:theRefs>
     <rdf:Bag>
        <rdf:li rdf:resource="really happy url" />
     </rdf:Bag>
   </contro:theRefs>
</rss1:item>
===================

If these are directly slurped into a single rdf store we would get
something like the following when we serialize back to RDF/XML

===================
<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item
rdf:about="http://www.nytimes.com/2001/06/12/national/12MCVE.html">
  <rss1:description>It was a sad day...</rss1:description>
   <contro:theRefs>
     <rdf:Bag 
      
rdf:about="http://controversy.com/channelA/2001/06/12.rdf#gen10">
       <rdf:li rdf:resource="really sad url" />
     </rdf:Bag>
   </contro:theRefs>
   <rss1:description>It was a happy day...</rss1:description>
   <contro:theRefs>
     <rdf:Bag 
      
rdf:about="http://controversy.com/channelB/2001/06/12.rdf#gen11">
       <rdf:li rdf:resource="really happy url" />
     </rdf:Bag>
   </contro:theRefs>
</rss1:item>
===================

Unfortunately, we can't tell one rss1:description from the other in
the joined result. Still, notice that the rdf:Bag that were unlabeled
in the source documents have been labelled by the processor. AFAIK,
this is universal behavior on the part of RDF processors when they
encounter RDF/XML resources that are either labeled with an ID or are
unlabeled. I.e., the are labeled with a URIref whose URI is that of
the source RDF/XML document and whose fragment identifier is
implementation dependant. 

The Approach
============

If you assume that this behavior is correct then you are already a
major part of the way to having your triples traceable to the source
document without generating any more triples (note that  Sirpac
doesn't seem to label anonymous nodes with the URI of the document.
Instead, it uses "_".). 

What's left are the classic RDF/XML resources that are labeled with an
rdf:about. What I am proposing is that these be considered a shorthand
for an anonymous resource that has a new property that has the
rdf:about as its value. In some ways, this is similar to the approach
that was taken by Henrik Frystyk Nielsen at WWW9 [4]. 

Rather than rdf:about being treated as a magical attribute, it becomes
a property of an RDF/XML resource. This potentially allows a
distinction between RDF/XML resources and Web resources although I'll
leave that as a separate exercise.

While we're hacking at the syntax, It would be really interesting to
unify several other aspects of RDF/XML including the distinction
between string-valued properties and XML-valued properties and between
resources and literals. I hint at that with the naming of the new
property that holds the value formerly known as the rdf:about
attribute. The property would be named rdf:aboutURI. The hint is that
you could also have an rdf:aboutLiteral. 

So here's the example above recast using rdf:aboutURI
===================
http://controversy.com/channelA/2001/06/12.rdf

<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item>
 <rdf:aboutURI rdf:resource="http:.../12MCVE.html"/>
 <rss1:description>It was a sad day...</rss1:description>
 <contro:theRefs>
   <rdf:Bag>
      <rdf:li rdf:resource="really sad url" />
   </rdf:Bag>
 </contro:theRefs>
</rss1:item>
===================
http://controversy.com/channelB/2001/06/12.rdf

<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item>
 <rdf:aboutURI rdf:resource="http:.../12MCVE.html">
 <rss1:description>It was a happy day...</rss1:description>
 <contro:theRefs>
   <rdf:Bag>
      <rdf:li rdf:resource="really happy url" />
   </rdf:Bag>
 </contro:theRefs>
</rss1:item>
===================

In the serialization below, the processor has generated URIref using
the baseURI of the source documents for all the anonymous nodes (which
now include the formerly explicit rdf:about labeled nodes).

===================
<rdf:RDF ns-decls
xmlns:contro="http://www.controversy.com/controVocab/">
...
<rss1:item rdf:about="http:.../channelA/2001/06/12.rdf#gen01">
 <rdf:aboutURI rdf:resource="http:.../12MCVE.html"/>
 <rss1:description>It was a sad day...</rss1:description>
 <contro:theRefs>
   <rdf:Bag rdf:about="http:.../channelA/2001/06/12.rdf#gen02">
      <rdf:li rdf:resource="really sad url" />
   </rdf:Bag>
 </contro:theRefs>
</rss1:item>
<rss1:item rdf:about="http:...channelB/2001/06/12.rdf#gen01">
 <rdf:aboutURI rdf:resource="http:.../12MCVE.html">
 <rss1:description>It was a happy day...</rss1:description>
 <contro:theRefs>
   <rdf:Bag rdf:about="http:.../channelB/2001/06/12.rdf#gen02">
      <rdf:li rdf:resource="really happy url" />
   </rdf:Bag>
 </contro:theRefs>
</rss1:item>
===================

There are some issues with this approach. One is that you are
generating another triple for each resource labeled using an rdf:about
in the source document. Another is that systems that directly join on
the subject of triples wont work. OTOH, they have a very
straightforward work-around of doing the join on a property rather
than on the subject. In XSLT terms, it would be something like:

<xsl:variable name="joinlist" 
              select="*/rdf:aboutURI[@rdf:resource=$joinURI]/.." />

rather than: 

<xsl:variable name="joinlist" 
              select="*[@rdf:about=$joinURI]" />


[1]
http://lists.w3.org/Archives/Public/www-rdf-logic/2000Nov/0112.html
[2] http://www.w3.org/2000/03/rdf-tracking/#rdfms-contexts
[3] http://www.w3.org/2000/03/rdf-tracking/#rdfms-reification-required
[4] http://www.ilrt.bris.ac.uk/discovery/2000/08/www9-slides/henrik/

Received on Tuesday, 12 June 2001 01:04:28 UTC