Re: Polyphasic Knowledge Representation, Named graphs, quads, quints, K-arity. was: Re: statements about a graph (Named Graphs, reification) from editor@content-wire.com on 2007-09-07 (semantic-web@w3.org from September 2007)

From: <editor@content-wire.com>
Date: Fri, 7 Sep 2007 16:53:51 +0700
To: "Stephen D. Williams" <sdw@lig.net>, "Michael Schneider" <schneid@fzi.de>
Cc: "Bijan Parsia" <bparsia@cs.man.ac.uk>, "Richard Cyganiak" <richard@cyganiak.de>, "K-fe bom" <u9x3n_15so@hotmail.com>, <semantic-web@w3.org>, "Jenn Sleeman" <jsleeman@redpebble.com>, <linux@anthonynassar.com>, "Jim Hoover" <jhoover@wellslanders.com>, <dminorfugue@gmail.com>, "Bruce Israel" <israel@tux.org>, "Behling, Josef" <jbehling@HPTI.com>, "Chuck Bell" <olias_sunhillow@hotmail.com>, "Michael Gray" <gray@american.edu>, "Ryszard S. Michalski" <michalski@mli.gmu.edu>
Message-ID: <00f301c7f135$0c3ea020$b30a010a@waralak>
Stephen

I came across the idea of quadruples  - ok quads - when reading the draft at Dr Azamts Book, precisely Chap X
It did sound as a strong argument
And I am glad to see it discussed below
Azamat, maybe you could share with the list a snippet where relevant>?

cheers

Paola Di Maio

  ----- Original Message ----- 
  From: Stephen D. Williams 
  To: Michael Schneider 
  Cc: Bijan Parsia ; Richard Cyganiak ; K-fe bom ; semantic-web@w3.org ; Jenn Sleeman ; linux@anthonynassar.com ; Jim Hoover ; dminorfugue@gmail.com ; Bruce Israel ; Behling, Josef ; Chuck Bell ; Michael Gray ; Ryszard S. Michalski 
  Sent: Friday, September 07, 2007 1:59 AM
  Subject: Polyphasic Knowledge Representation, Named graphs, quads, quints, K-arity. was: Re: statements about a graph (Named Graphs, reification)


  I agree with most or all of your reasoning below.
  Early this year, I was talking to Tim BL in Boston just before the Semantic Web interest group meeting and my main question was: Why Triples and not Quads?  His immediate response is that they are quads, just not explicit in the typical syntaxes, except N3 where you can (re)state a triple as the subject of another triple, thereby meta-referencing it.  (This is still ambiguous, as noted in: [1].) 

  In my mind, this type of quad, and the idea of named graphs, and of RDF document's URL/URI as the ID of the resulting graph, all are the same or overlapping concepts with a little semantic sugar.  Triples are always quads where the statement "handle" is implicit.  More clearly, there are two implicit things about a triple: the identity of the triple (which, traditionally I think, is most clearly represented by the complete value of that triple) and the context of that triple.  In that sense, statements are actually _quints_.  There are many reasons to make statements, or otherwise draw conclusions, based on the identity and context of a triple, yet there is no easy way to do this in many cases and fewer ways to interchange this effectively.

  This has to be fixed, sooner or later.  I understand that it has taken time to absorb and react to the first steps of the knowledge representation capabilities and implications of the Semantic Web / RDF / OWL work.  We now are increasingly bumping into the limitations of simple triples.  Reification, meta-chains of statements, and (worst of all) one-for-one mapping statements can all technically solve parts of the "advanced" problems encountered in the real world, but they are all very clumsy in practice and make search and traversal needlessly complex.

  In some of the work I do, I need to solve problems that RDF/OWL/etc. are seemingly perfect for, except that I need the following:

    a.. Statements versioned by time (all versions in the same knowledge base (KB), and the ability to reason over them by time) with both happened-at and known-by timestamps. 
    b.. Provenance for statements and contexts, including various measures of likelihood, trust, probability. 
    c.. Security levels, ownership, ACLs, etc. 
    d.. Dependency - derived from chains for tracking, explaining, and cleaning up after (i.e. retraction / knowledge maintenance) automated reasoning engines.

    e.. Alternate versions of statements / properties from different provenance or even different likelihoods or theories from the same source. 
    f.. Views of subsets of large KBs of this data, including flat temporal, series temporal, security policy, viewpoint/provenance filtering/merging, etc. 
    g.. The ability to generate, share, and efficiently make use of a "delta" or stack of "deltas" between a parent document / KB and updates.  Ideally, this or similar mechanism would allow rapid access to the result of combining many clumps that resulted in a particular view.

  The resulting views are slices through the KB which can be thought of as planar in a "horizontal", point in time, or "vertical", over a period of time, direction through clumps of statements and their versions.  The slices themselves can be simple RDF or something of higher K-arity.  K-arity refers to the degree and type of data beyond K3=RDF triples.

  Minimally, explicit quads would be a huge improvement, while implicit quads would still exist in certain contexts.  A (locally or globally) unique statement ID allows concise triples rather than reification and a handle to indicate any provenance, context/group/URI membership, etc.  Versioning with quads is doable as a new quad could have a statement pointing to the old or alternate versions.  This is somewhat unsatisfying because it would require analysis and maintenance to make changes that should be simple "insert this triple-plus-timestamp" which would, in most cases, logically replace the old version.  One option is to reuse the same statement ID with a different timestamp (or provenance or other K-arity attribute) and different content.  A flat view sees only a single version now or at a particular point in time.

  A full-blown representation might have statements that include, in addition to S-P-O: statement identity, context, both timestamps, provenance id/context, security context, and dependency context:  K10.  Many of these might point to a node that might link to many values and in turn be shared by many statements.  Some part of the time, that may be desirable.  In some applications however, these sometimes fundamental meta-properties of a statement are used pervasively and cumbersome if they don't have special status.  Queries and results could be greatly simplified if filtering were done in layered and mostly automatic ways and results were simplified into key statements with most metainformation being more subtly managed and represented.  This can all be done, technically, with triples and reification.  In practice however, both in-memory during queries, response, iteration, and other operations and for interchange, it seems much better to have key pervasive metainformation have standard ontology / slots.  This could possibly to be managed as a combination of tuples and context graphs (which commonize the shared metainformation to reduce per-statement K-arity).  I have some SPARQL extensions designed that work well with time for instance, greatly simplifying certain knowledge filtering constraints.

  I call this set of requirements the "Polyphasic Knowledge Representation Problem" and my partial solutions "Polyphasic Knowledge Representation" (PKR).  (I'm open to a better name if you can summarize better.  "Polyphasic" seems like a good physics analogy where different versions and provenances of overlapping information are available in overlapping "phases" of knowledge.  Some people think it's a little too Trek-kitsch.)  Many of these may seem special-case or "advanced" to many, but I feel this is where things are going.  It is not hard to find direct use in a lot of this availability of data and metadata for various businesses including retail analysis, credit/banking, research, sales tracking and analysis, etc.

  Additionally, I have been active in the area of efficient (both size and processing) XML interchange and representation.  This has been the topic of the Binary XML (now completed) and Efficient XML Interchange [2] (now in progress) working groups.  As I am now defining an efficient RDF interchange and representation, the problems of what are actually needed for an "advanced" and efficient solution provide key requirements.  The K-arity PKR effective structure of knowledge, where K={3-10}, seems to cover it.  Is there a good, strong argument against this kind of representation, given that conversion to or through K3 should be possible?

  Additionally, part of my thinking and work, but not the XBC or EXI working group consensus, is the idea of a type of format that is directly and randomly accessible _and_ modifiable in place in a reasonably efficient way, in addition to support for low-level deltas and stable virtual pointers.  Knowledge representation for high performance applications is the application that lead to those concepts in the first place.

  Comments and interest are welcome.  I could use suggestions on solution ideas and best venues to publish papers.

  [1] http://www.w3.org/TR/rdf-primer/#reification
  [2] http://www.w3.org/TR/2007/WD-exi-20070716/

  sdw

  Michael Schneider wrote: 
[sorry, this has again become a very long mail]

Hi, Richard and Bijan!

  -----Original Message-----
From: semantic-web-request@w3.org 
[mailto:semantic-web-request@w3.org] On Behalf Of Bijan Parsia
Sent: Tuesday, September 04, 2007 6:51 PM
To: Richard Cyganiak
Cc: Michael Schneider; K-fe bom; semantic-web@w3.org
Subject: Re: statements about a graph (Named Graphs, reification)


On 4 Sep 2007, at 17:30, Richard Cyganiak wrote:

    Michael,

On 4 Sep 2007, at 15:29, Michael Schneider wrote:
      Ok, then let's discuss more practical issues (leaving this 
        subtle RDF
    semantics stuff to the academic world). Until now, we had the only  
usecase
that someone wanted to annotate a complete RDF document,
        Sorry to be jumping in, but do you mean "in this thread"? 
    
Yes. I tried to be at least a little on-topic. ;-)

  Because other use cases are prevalent.

    which already exist
somewhere having an URI. This is certainly the easiest case to  
handle in
practice.
        Yes. I think it's also by far the most common case.
      I think almost certainly not. Consider EARL:
	http://www.w3.org/TR/EARL10-Schema/

Or annotation axioms in OWL 1.1.

Or Swoop Change Sets (which do chunk out, so they are a little  
different).

    But there will probably often be the more demanding situation,
where I want to make assertions about some ad hoc set of RDF  
triples, which
is not yet published as a special RDF document anywhere.
        To be honest, I'm not sure that this case occurs *that* much in  
practice.
      Quite often (or will). I want to record when an axiom in my owl  
ontology has been last modified. Do I have extract that axiom and  
publish it in a separate document?
    
I have been pondering about some specific szenario for quite a while now,
which I did not yet see being discussed elsewhere. And I would like to know
from you what you are thinking about it. I will try to present this scenario
in the form of a little story, because this will make things easier to
understand.

Assume there is Alice, who owns a homepage, which is enriched with some
additional RDF. One of the statements within her homepage is

    me:alice foaf:knows he:bob .

by which Alice tries to tell the world that she knows some other person Bob.

Now there is Charly, who is an old friend of both Alice and Bob. He knows,
that Alice knows Bob since 1998. Charly also owns an RDF'ed homepage, and so
he likes to make this knowledge explicit by stating something like

    "Alice knows Bob" dc:date 1998 .

Charly does not have access to Alice's homepage, so she cannot put this
statement just into Alice's triple store, or even adjust Alice's
foaf:knows-triple into some n-tuple. But even if she could, she would not
like to do this: It's actually her, who asserts this statement, so this
information should really go into her own triple store. But what she wants
to ensure in any case is that this statement is "visible" on the semantic
web. This means that if anyone (or any semantic web crawler) should stumble
over this statement, he/it should, with pretty high confidence, be able to
understand that this is really a statement which annotates Alice's
foaf:knows statement - rather than just being some arbitrary RDF triple.

Last, there is Dave. Dave has recently found Alice's homepage with her
"foaf:knows" statement within. Dave does not know Alice personally, but he
is very interested in social relationships between arbitrary people. And
more, he is interested in what others have to say about such social
relationships. :) So he wonders if there are any additional statements about
Alice's foaf:knows statement anywhere on the Semantic Web. Dave has already
installed a copy of the Semantic Web Client Library [1], so he has at least
a good chance to have access to some larger portions of the SemanticWeb
(let's suppose for a moment that we are already a few years in the future
from now, where there is already satisfying linking between existing data).
Now, what SPARQL query should he execute? He want's to find as many
assertions about the Alice's foaf:knows statement, as possible, but he also
want's to avoid too many false positives, of course.

So, this example demonstrates the scenario. There are on the one hand
parties (the Alices) which create informations on the SemWeb, encoded in
triple form. There are other parties (the Charlies) wanting to create
annotations for these triples in separated stores. These parties are
interested in having their stored annotations encoded in a searchable way.
And there are again other parties (the Daves) which like to search for such
triple annotations.

Now, the above example is a little oversimplified, I admit. But it is not
hard for me to imagine professional mashup services ("Charly 2.0" :)), which
crawl the whole Semantic Web for triple data of a specific kind (e.g. social
relationships), and then enrich this found data by additional annotations.
This will provide quite new views on the original data. For these mashup
services it will be of utmost importance that their triple annotations will
be effectively searchable. And then, there will also be general SemanticWeb
search services (the professional Daves). The value of these search services
will enhance largly for their users, if these services also take the triple
annotations of the diverse mashup services into account.

So, there are two questions here, which turn out to be closely related:

  * How should triple annotations be encoded on the public Semenatic Web, so
that they can easily be detected, and identified to really be triple
annotations?

  * How should queries for triple annotations look like in the Semantic Web?

First, it is clear that if Charly uses some special custom method to encode
her triple annotations, there will be no realistic perspective that her data
will be found. "Custom reification" methods can be completely resonable for
being used within specific applications, or for closed user groups. But for
a searcher like Dave, who wants to broadly query the whole SemanticWeb for
data created by possibly lots of different, unknown, and unrelated parties,
this is certainly not an option. But even, if Dave really is going to
include specialized encoding schemes into his query, then this will only be
the published schemes of very important parties. So no hope then for Charly
(and many other normal users or "small players" in the Semantic Web) to get
their data being found.

So what will happen in such a situation? If no standard encoding scheme
already exists, there will probably emerge a few encoding schemes, rapidly
introduced by some first-to-marked organisations (simply because these orgs
need such a scheme AFAP), and everyone else will then use these few schemes.
And after some years of usage, the W3C would step in making a standard based
on those encoding schemes which have survived until then.

But in the case of RDF, I think that people will rather adopt RDF
reification, for several reasons:

  * It's already there, ready for use, and it's part of the official RDF
standard.

  * It is just more triple data, so it can simply be put into the existing
triple stores. And every RDF aware software out there will be able to handle
this kind of data out of the box.

  * It seems reasonably easy to understand and use for non-expert people (I
have experienced this, when I tried to explain RDF reification to a complete
RDF novice).

  * There is existing tool support (like in Topbraid Composer [2])

  * At least in the beginning, Charly will probably think: "Well, whoever
will search for triple annotations, he will certainly at least come to the
idea to search for rdf:Statements. I don't have any clue for what else he
will search, so I use RDF reification for my encoding. This will be the
savest path, if any." I would call this argumentation a "maximum likelyhood
estimation". :)

  * And Dave will think: "Well, at least I should search for rdf:Statements,
because this will be the nearest people will think of, when they encode
their triple annotations." Again some maximum likelihood estimation. 

And an according SPARQL query is pretty simple:

      construct { $stmt $p $o }
      where { $stmt a rdf:Statement; rdf:subject me:Alice; rdf:predicate
foaf:knows; rdf:object he:bob . }

Well, not nice, but it works for Dave, and that is the important point.

And anticipating one of the most likely objections to my argument: I don't
believe that anyone of the "ordenary semantic web users" out there, who is
actually interested in putting triple annotations into the SemWeb or
searching for them, will really be interested in debates about
"non-existing" or "broken semantics" of RDF reification. I, personally, like
such debates, but this is in the end just ivory tower bosh. So I won't
bother these people with questions like: "Hey, don't you know that talking
about the insertion date of a triple into an RDF store is something
semantically completely different, than talking about the date since Alice
knows Bob?" These people do not need the academic world to provide them
lessons in philosophy. :) What they really need from the academic community
is a pragmatic tool, which serve their needs, so they can start to do their
most important job: Filling the SemWeb with content! And RDF reification
actually provides such a tool, when it is only regarded as a common
vocabulary, which makes it technically possible to associate an URI to some
RDF triple. (Sorry, this paragraph has gone a little flamy, but I really
couldn't resist. ;-))

The third candidate is NamedGraphs. But in order to estimate if this
approach can be used for the above scenario, I need to know more about it.
This was the reason why I asked in my last mail "How do named graph data get
published into the Semantic Web?". If it is (with reasonabe effort) possible
for instance to search for the URIs of all NamedGraphs of the form

     :g { me:alice foaf:knows he:bob }

then NamedGraphs work equally well like Reification for this purpose,
because I can then, in a second step, query for all those triples in the
SemWeb, which have the found NamedGraph's URI as their subject. And
NamedGraphs would bring this big advantage with them that they can talk
about more than a single triple (though I have difficulties to see what this
serves me in my usecase above. Perhaps other people will be able to find an
example, where searching for annotated "multi-triples" would really make
sense).

But, we must not conceil that NamedGraphs have a very bad disadvantage in
comparison with Reification, anyway: NamedGraphs are not a standard. And if
this approach does not get into RDF, or at least into common use, very soon,
it will possibly lose its chance to become a player at least in the above
scenario. 

/This/ will of course only be a topic /if/ the above scenario is relevant at
all. Because my whole argumentation pro RDF reification depends on the
estimation, that the above scenario is a really relevant usecase (of course
with mashup and search services instead of Charlies and Daves :)). If this
is not the case, then I won't speak for RDF reification any longer, because
I then see no real use for it anymore. (At least, until another scenario
comes to my mind ;-)).

So what do you think?


Cheers,
Michael

[1] http://sites.wiwiss.fu-berlin.de/suhl/bizer/ng4j/semwebclient/
[2] http://www.topbraidcomposer.com/

--
Dipl.-Inform. Michael Schneider
FZI Forschungszentrum Informatik Karlsruhe
Abtl. Information Process Engineering (IPE)
Tel  : +49-721-9654-726
Fax  : +49-721-9654-727
Email: Michael.Schneider@fzi.de
Web  : http://www.fzi.de/ipe/eng/mitarbeiter.php?id=555

FZI Forschungszentrum Informatik an der Universität Karlsruhe
Haid-und-Neu-Str. 10-14, D-76131 Karlsruhe
Tel.: +49-721-9654-0, Fax: +49-721-9654-959
Stiftung des bürgerlichen Rechts
Az: 14-0563.1 Regierungspräsidium Karlsruhe
Vorstand: Rüdiger Dillmann, Michael Flor, Jivka Ovtcharova, Rudi Studer
Vorsitzender des Kuratoriums: Ministerialdirigent Günther Leßnerkraus
Received on Friday, 7 September 2007 09:51:01 UTC