RE: motivation for bNodes/existentials in RDF; note for parsers from Massimo Marchiori on 2002-04-08 (www-rdf-comments@w3.org from April to June 2002)

From: Massimo Marchiori <massimo@w3.org>
Date: Sun, 7 Apr 2002 21:45:41 -0400
To: massimo@w3.org, phayes@ai.uwf.edu
Cc: bwm@hplb.hpl.hp.com, www-rdf-comments@w3.org
Message-Id: <200204080145.VAA14959@tux.w3.org>
Pat, from your reply I essentially get that I have to be more 
technical and give more details (mea culpa, but usually hard to do 
so when using email, one usually just wants to synthesize and save time ;).
So I'll try to be less "concise" (polite wording for "cryptic" ;) and expand
http://lists.w3.org/Archives/Public/w3c-rdfcore-wg/2002Apr/0045.html
(not to extreme, but with a bit more of detail).
I see you'll be in Amsterdam for WOWG, so chatting will probably be
much faster and clearer, but anyway, here's a first short expansion
on the mailing list for the record.

Motivation: this is essentially something dating back to 1999, when
studying how to formalize the "M" in M&S. The modus operandi has been just
to formally extract the "minimum" that M&S was saying, and nothing more
(so, carefully analyzing M&S *line by line* with mathematical glasses
and trying to extract the "normative minimum").

So, line-by-line analyzing the spec, that's the conclusion one can get.
First, there's big line that can be drawn, as usual, between the 
formalization of the data model, and the formalization of the semantics.
Now, I talk of an "algebraic" data model because I include in the data
model part the functional requirements about containers. Because, this
is the maximum where we can go easily without much arbitrariness, 
because you don't need to define a semantics yet. 
So anyway, the data model is a normal one (essentially like MT, 
although even here some choices have to be made, eg in what kind
of graphs are allowed... but that's another thread ;).
Then you can impose on the data model an equational part, stating
the requirements on constraints.
I'll keep easy here and just show the easiest part, the Bag.
Struggling with ASCII, let's just indicate with 
BAG(t_1,...,t_k) a shorthand to denote a part of
an RDF graph that have a bag with components t_1,...,t_k.
Then, the fact that a bag is an unordered list of resources is
expressed using the following equivalence relation Eq:

B1:  BAG(t_1,...,t_k)  Eq  BAG(t_pi(1),...,t_pi(k))
     (where pi is any permutation in [1,k])

[historical note: old w3c members might recognize the complications
here partly stem from issue 30 of the original RDF M&S issue list, cf
http://www.w3.org/RDF/Group/Syntax/issuesd.html#c30 ,
which was at the time (1998) not accepted and closed...]

Then, every RDF application APP (formally, map from rdf graphs to your
domain of interest) should be congruent to Eq (i.e., 
G1 Eq G2 => APP(G1)==APP(G2) ). Tantamount, we can reason not on the 
simple data model any more, but on the quotient data model
(set of equivalence classes wrt Eq).
Things are slightly more complex than this actually, but that's
a good hint to see why we can go from a data model to an 
"algebraically defined" data model, so to say.
Now, the good thing of having this algebraic data model is that, 
so far, things are pretty smooth, as you don't have to define 
the semantics yet (which is the less formal part of M&S, and so
the most troublesome to distill in a sense). The equations 
express behavioural constraints: every application APP "doing something"
with an instance of the RDF data model must treat equivalent (wrt
the above equations) instances in the same way.


Then, on to the semantics part (not the "needed" one, or the 
most "useful" one, just the bare minimum (good or bad) that M&S expresses). 
Now, I won't stay here to express the whole deriving semantics, possible
choices and various subtelties (as, that would equal to write down 
a huge paper that I/nobody didn't find any time to write in years...),
so just quickly focus on the parts that matter for our discussion 
(in a simplified fashion).
An easy way to define such a semantics is to use matching graph rules:
we give a graph (extended with variables as nodes as well) and every 
time it matches a part of the graph, it triggers and subst it with
something else. The "translation", so to say, eventually produces
facts into a target (HOL) logic. Precise definition of the logic
is inessential to the present discussion (although yes, of course
there'd be much to say here too ;)
Now, here alas we only have text, so struggling with 
ASCII we can use verbatim triples (variables start with the dollar sign, 
and variables matching anonymous nodes only are denoted by $A).

Rule for containers:

  If there's a subgraph of the form:
  $A rdf:type rdf:Bag
  $A rdf:_1 $1
  $A rdf:_2 $2
  ...
  $A rdf:_n $n
  
  then produce
  Bag($1,...,$n)
(where Bag is then a predicate adequally defined, to respect the 
equational data model rules).
And, similarly for Seq and Alt.

Rule for reification:
  If there's a subgraph of the form:
  $1 $2 $A
  $A rdf:subject $S
  $A rdf:object $O
  $A rdf:predicate $P
  $A rdf:type rdf:statement

  then produce
  $2($1,$P($S,$O))

Note generalized formality sloppiness here and above (eg "$2(...)" would be 
something like [[$2]](...) where [[..]] gives the appropriate predicate, 
etc etc). Also note how how the reification rule needs the "$1 $2" part to 
trigger, formalizing the fact that reification alone says nothing.

So, the above rules just do essentially what M&S say: containers 
are lists, and reifications allow someone ($1) to make a statement ($2) 
about another statement ($P($S,$O)).

Now, after all these premises, we are finally to anonymous nodes:
where have they gone?
Line-by-line analyzing the M&S spec, you'll see that anonymous nodes 
are only used to specify containers and reifications.
This doesn't happen by chance, but just because their precise content 
is *not needed*: they disappear in the semantics (see above, no $A's !).


So, that's what I meant when talking in
http://lists.w3.org/Archives/Public/www-rdf-comments/2002AprJun/0012.html
about the extension chain and saying that anonymous nodes are not 
"first class" nodes in M&S:
Anonymous nodes don't "denote" anything, they're just syntactical 
convenience in the data model, *exactly like anonymous variables are, 
for example, in Prolog*.


Now, the whole importance of the above constructions are that, as said,
they (claim ;) constitute a (possible) "minimal interpretation" of M&S: 
are formally grounded in the M&S spec (the "M" part), line by line, and 
one can show that anything in them follows precisely from corresponding 
lines in the spec, and conversely, that any line in the spec has its 
corresponding (and, the correspondance is not stretched, but as faithful
as possible). Of course, with a possible range of choices (that is
clearer the moment you try to do actually the thing, line by line...),
but that do not modify the essential features of the resulting semantics
(in our case, treatment of anonymous nodes).

As instead the current MT differs in essential parts (see eg anonymous nodes),
I'd like to hear explanations on how "minimal" this interpretation is 
(e.g., bare-bones, compare to the previous one wrt anonymous nodes), 
as I think MT is stretching M&S, in some parts even in excellent ways, 
but still, stretching (to this extent, it'd have helped even more to see 
how containers and reification, two of the essential features of RDF, are 
dealt with). It might turn out the current MT is actually the best way 
to go (even with a stretch, eventually who cares? ;), but the analysis 
is far from trivial, and has to be done.

Thanks,
-M
Received on Sunday, 7 April 2002 21:46:45 UTC