Re: model theory for RDF/S from Pat Hayes on 2001-09-28 (www-rdf-logic@w3.org from September 2001)

From: Pat Hayes <phayes@ai.uwf.edu>
Date: Thu, 27 Sep 2001 19:42:28 -0500
To: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Cc: www-rdf-logic@w3.org
Message-Id: <p0510100fb7d954dd1a5a@[205.160.76.184]>
>I thought I might give a stab at cleaning up some of the problems that I
>see with Pat's model theory.  Here are the results for RDF (without
>reification or containers).  Note that this should be considered a
>derivative work from Pat's work, and not a completely separate model
>theory.
>
>I have several themes here.  First, I wanted to be very formal, so no
>misconceptions can arise.  To this end, I have produced a very sparse
>document.  (However, I have had to use some non-standard symbology for the
>set-theoretic relationships.  Make the obvious fixes.)  Second, I wanted
>to stick as close as possible to the ``Formal Model for RDF''.  This drove
>some of the changes from Pat's model theory.  Third, I wanted to annotate
>some of the issues and choices in the model theory.
>
>
>Syntax:
>
>URI is the collection of URI names, i.e., some collection of strings
>
>   [NB:  This ignores all aspects of the structure of URIs.]
>
>L is the collection of literals, i.e., some collection of strings,
>disjoint from URI
>
>   [NB:  This ignores non-string literals, but they wouldn't cause
>	any more problems.]
>
>   [NB: I'm sticking with graphs, even though they have some problems for
>        RDF.  Triples have their own problems with respect to RDF.]
>
>A basic untidy RDF graph,

The major utility of the graph syntax arises from the tidiness 
property, so I would prefer to build it into the definition.

>  R, is a four-tuple (that can be considered to be a
>partially node labeled, totally edge labeled, directed graph)
>		< N, E, LN, LE >
>	where N is the set of nodes in the graph
>	      LN :(partial) N -> URI u L gives labels for nodes
>	      LE :(total) E -> URI gives labels for edges
>	      E <= N' x N is the set of edges in the graph
>		where N' = { n : LN(n) in URI }

  E <= N' x N  does not allow for two arcs with different labels 
between the same two nodes. Better to have E be a set with a mapping 
into N' x N (or mappings into N' and N).

But why bother with all this? The concept of a labelled graph is 
standard and uncontroversial, so this is one place where we can say 
some mathematics in a reasonably intuitive way without sacrificing 
either precision or readability by non-mathematicians.

>   [NB: This accounts for literals not being allowed as labels of edges, nor
>	as the labels of nodes that are heads of edges, but does not
>	account for edge labels being properties.  It would be possible to
>	make the latter distinction here.]
>
>   [NB: I don't think that Pat's definition of RDF graphs has the
>        requirement that heads of edges cannot be labeled with literals.]

Right, that was a slip, as I've already acknowledged. Here is the 
suggested rewording (not yet approved by the WG):

"An RDF graph is a partially labeled directed graph satisfying the 
following conditions: arcs are labelled with URIs; nodes from which 
arcs emerge are either unlabelled, or labelled with URIs; all other 
nodes are either unlabelled, or labelled with URIs or literals; and 
distinct labeled nodes have different labels. "

>An basic untidy RDF graph is ground if LN is a total function on N.
>An basic untidy RDF graph is tidy

an untidy graph is tidy?? Yech.

>(also called a basic RDF graph) if LN is
>injective (not necessarily total).

These are the same notions as those expressed above in English, but 
using more mathematical terminology. We decided on a less 
mathematical style deliberately, in view of the likely audience. If 
re-written for professional mathematical logicians, the entire model 
theory could be described in a single page. (1.5 pages using ASCII)

>Literal Values:
>
>LV is a non-empty set, currently the set of strings. 
>XL : L -> LV, giving meaning to literals
>
>   [NB: This ignores many aspects of the intended meaning of strings,
>        such as whether the mapping is surjective or injective.]
>
>
>Interpretations:
>
>Let V be a subset of URI.
>
>   [NB: Does V have to be non-empty, does it have to be finite?

No. (Though maybe it should indeed always include rdf:type and hence 
be nonempty, as you pointed out.) Not necessarily finite, though, 
which will important later.

>So far
>        there are no such requirements, but we have to be careful that this
>        is OK, and that we don't make unnecessary requirements on V.]
>An interpretation, I, on a vocabulary V, is a four-tuple < IR, IP, IEXT, IS>
>	where IR is a non-empty set, called resources

Subtle terminological point. I didn't want to say that IR is *all* 
the resources, or that the members of IR are called 'resources' by 
definition. This word, used in this context, is a technical term of 
art which as far as I am concerned belongs to the W3C, or maybe to 
Tim Berners-Lee, or someone; but in any case not to me. So my 
position is, I'm not *defining* it here, and moreover I don't really 
know what it means; but, the M&S says that RDF is about these things, 
so the model theory goes along with that and declares that the 
universe shall consist of them, whatever they are. I tried to make 
the point clear in the introductory comments of the model theory 
document.

>	      IP <= IR, called properties
>	      IEXT : IP -> powerset ( IR x (IR u LV) )
>	      IS : V -> IR
>
>   [NB: There is no requirement that IR and LV be disjoint nor is there any
>        requirement that they be subsets of one another.   We have to be
>        careful that this is OK, and that we don't accidentally disturb this
>        (non-)relationship.]

Right, and I believe that the model theory does not disturb it.

>Interpretations of ground basic untidy RDF graphs:
>
>Let R = < N, E, LN, LE > be a ground basic untidy RDF graph.
>
>   [NB: There appears to be no harm in allowing untidy graphs here, but this
>        could easily be restricted to tidy graphs.]
>
>An interpretation I on the vocabulary V with
>	{ LN(n) | n in N } ^ URI <= V
>	{ LN(e) | e in E } <= V
>is extended to R as follows
>
>   [NB: I am being a bit vague here in just what is the domain of I.]
>
>   [NB: It is explicit here that the vocabulary of the interpretation can
>        have ``names'' that do not appear in the graph.  Pat's theory is
>        vague on this point.]

I protest, it is quite clear. An interpretation is defined on a set V 
. Of course V  may contain things that are not in the set of graph 
labels (see section 1..1, third paragraph.) In fact, there can even 
be graph labels that are not in V, and the interpretation will make 
any triples containing them false.

There is a subtle but important aspect of mathematical style here. We 
don't want to *define* an interpretation to only apply to a 
vocabulary associated with a graph, since we may well want to discuss 
whether or not (or the extent to which) one graph is true or false in 
an interpretation of a different graph. Such matters come up 
immediately when proving things about skolemization or instances or 
mergings, for example. That is why I defined an interpretation 
relative to a vocabulary rather than to a graph.

>1. if LN(n) is defined and in L then I(n) = XL(LN(n))
>
>2. if LN(n) is defined and in V then I(n) = IS(LN(n))

Seems to me to be pointless to attach the interpretations to the 
nodes when nodes and labels are 1:1 in a ground graph. There is no 
need to introduce the LN and LE mappings into the equations, and they 
clutter up the equations to no useful purpose.

>    [NB: These two give an interpretation for all nodes in R.]
>    [NB: The ``defined'' part is used later.]
>
>3. if <f,g> is in E
>    then I(E) = true if IS(LE(<f,g>)) in IP and <I(f),I(g)> in 
>IEXT(IS(LE(<f,g>))),
>         I(E) = false otherwise

Shouldn't that be I(<f,g>) =true if..... ?

The terminology LE(<f,g>) presumes that an edge can be defined as a 
pair of its endpoints, which isn't generally true. It would be better 
to define a graph as a set E of edges - intuitively triples, since 
they define their endpoints - (with LE as here) and with an EE (Edge 
Ends) mapping to a pair of nodes. Then the condition is:

3. if e is in E
     then I(e)= true if IS(LE(e)) in IP and EE(e)=<f,g> and 
<I(f),I(g)> in IEXT(IS(LE(e)))
      I(E)=false otherwise.

or possibly to have a pair of mappings S  and O from edges to nodes, 
and then one gets

3. if e is in E
    then I(e)=true if IS(LE(e)) in IP and <I(S(e)),I(O(e))> in IEXT(IS(LE(e)))
    I(E)=false otherwise

But all of these seem to me to be needlessly complex, and also 
confusing, in that the various mappings all have different meanings, 
eg in the last term, LE is a syntactic graph labelling and IS a 
semantic interpretation.

>    [NB: This cleans up the meaning of edges whose labels are not mapped to
>	properties.

It seems to me that there is nothing to be cleaned up. There is no 
need to explicitly mention whether or not IS is defined on a label; 
it it is not, then the I(e)=true if... condition is evidently not 
satisfied (since it mentions membership in a set which is the value 
of that mapping, so if the mapping has no value then there is no such 
set), so the truth-conditions establish that I(e)=false. This usage 
of 'otherwise' is normal style in stating truth-conditions, but I 
will add a sentence (and maybe an example) to the document to make 
the point clear.

Apart from the tighter use of mathematical terminology and style, the 
only significant change here is the emphasis on assigning 
interpretations to nodes rather than to their labels, which 
introduces the LN and LE mappings into the semantic equations. While 
strictly correct, I feel this is overly pedantic. The whole utility 
of the graph syntax is that the tidiness of the graph means that 
labelled nodes can be identified by their labels: each node label 
occurs on only one node. So on the labelled nodes, nodes and their 
labels are in exact 1:1 correspondence , and it is a familiar and 
conventional 'abuse' of terminology, used throughout working 
mathematics under such circumstance, to use one of the corresponding 
things to refer to the other (eg not bothering to distinguish 
notationally between a singleton set and its only member, or between 
an algebra and its underlying carrier set). Since the vocabulary *is* 
the labels rather than the nodes, to phrase the model theory in the 
way I did seems to me to be clearer and more intuitive while being 
just as precise, and I would propose to keep it that way.

If we are dealing with untidy graphs, the graph syntax has no 
structural advantages over a lexical syntax, and it would then be 
preferable to simply attach the model theory directly to the 
N-triples notation (which in an earlier version of the model theory 
is what we in fact did, but that had problems of its own.).

>It might also be possible to make graphs with such
>	edges ill-formed in some way.]

I don't like that idea, it mixes up syntax and semantics 
inappropriately. Those 'undefined' URIs may be defined in a different 
interpretation.

>4. I(R) = false if I(e) = false for some e in E,
>    I(R) = true otherwise
>
>
>Interpretations of basic untidy RDF graphs:
>
>   [NB: I'm writing out extended interpretations completely for purposes of
>        being completely pedantic.

Right. I think this style is way too pendantic for most readers. The 
simpler language in the model theory document says the same thing 
much more concisely and isn't likely to be misunderstood by anyone.

>An extended interpretation on a vocabularly V is a two-tuple
>	<<IR,IP,IEXT,IS>,A'> 
>where <IR,IP,IEXT,IS> is an interpretation on V
>   and A' is a mapping to IR.
>
>   [NB: This means that unnamed nodes have to be resources, not literals
>        that are not resources.

Right. Actually, unnamed nodes have to *denote* resources, to be very 
picky. Some literal values may well be resources, of course; the 
model theory is agnostic on this.

>This may have consequences!]

It would if we allowed literals to be instances of blank nodes, but 
in section 3.1 I was careful to only allow URIs.  (It does mean that 
one needs to be sensitive about substitution when considering 
checking entailment, but that is a computational issue; still, I 
could add a warning comment about this point, and I will do so in the 
next update.)

All this pussyfooting about literals would be greatly eased if RDF 
simply declared that all labels were URIs and that literals were a 
particular syntactic subclass of URIs with a fixed interpretation. 
Then there would be no need to introduce any syntactic distinction, 
LV could be a subset of IR, and all would be more straightforward. 
The correct treatment of literals in RDF is being discussed now, and 
it's not clear to me what the WG will decide on this issue; there are 
many agendas to be considered, not just model-theoretic elegance. But 
I am confident that the RDF model theory can be fitted onto whatever 
is decided, without serious alterations.

However, Peter, a question for a professional Description Logician: 
this would allow literals to be assigned non-literal property values 
by an RDF assertion. Wouldn't that break DAML+OIL?

>
>Let R = < N, E, LN, LE > be a basic untidy RDF graph and
>let U = { n in N | LN(n) is not defined }
>
>An extended interpretation I = <<IR,IP,IEXT,IS>,A'> on the vocabulary V with
>	{ LN(n) | n in N } ^ URI <= V
>	{ LN(e) | e in E } <= V
>	and the domain of A' = U
>is extended to R as follows
>
>0. if LN(n) is not defined I(n) = A'(n)
>
>1. if LN(n) is defined and in L then I(n) = XL(LN(n))
>
>2. if LN(n) is defined and in V then I(n) = IS(LN(n))
>
>3. if <f,g> is in E
>    then I(E) = true if IS(LE(<f,g>)) in IP and <I(f),I(g)> in 
>IEXT(IS(LE(<f,g>))),
>         I(E) = false otherwise
>
>4. I(R) = false if I(e) = false for some e in E,
>    I(R) = true otherwise
>
>Let R = < N, E, LN, LE > be a basic untidy RDF graph and
>let U = { n in N | LN(n) is not defined }
>
>An interpretation I = <IR,IP,IEXT,IS> on the vocabulary V with
>	{ LN(n) | n in N } ^ URI <= V
>	{ LN(e) | e in E } <= V
>is extended to R as follows:
>
>   I(R) = true if there is some mapping A' from U to IR such that 
><I,A'>(R) = true,
>   I(R) = false otherwise
>
>
>An interpretation I with vocabulary V is called a model of a
>basic untidy RDF graph R = <N,E,LN,LE> iff 
>1.	{ LN(n) | n in N } ^ URI <= V
>	{ LN(e) | e in E } <= V, and
>2. I(R) = true
>
>
>
>Taking care of rdf:type:
>
>A core RDF interpretation, i.e., RDF without reification or containers, is
>an interpretation over a vocabularly that includes rdf:type with the
>following extra conditions
>
>      1. IS(rdf:type) is in IP
>      2. IEXT(IS(rdf:type)) <= IR x IR

Is there any real need for condition 2 here in RDF? I hope it can be 
avoided, since it would mean that a triple

aaa rdf:type LLL .

where LLL is a literal, comes close to being a contradiction. Right 
now, all it implies is that IR and LV overlap, but if anyone were to 
ever claim that they didn't overlap, then it would be. I don't like 
having land-mines concealed in the model theory.

>
>Now for the claims:
>
>All these sorts of claims have to be backed up with theorems and lemmas
>like the ones that Pat has.  I would go even further than Pat has,
>extending to RDF query languages, at least whenever such become
>well-defined.

In an earlier draft I did indeed do this, but we decided to remove it 
since RDF query languages are not yet stable. (The comments on 
detecting entailment by variable substitution in consequents, at the 
end of section 3, are what is left of of that material, summarized. )

I agree that the model theory should cover queries eventually.

>Claim 1:
>
>A core RDF interpretation that is a model of a basic untidy RDF graph R
>contains everything (and more) that is in the intended core RDF (i.e., RDF
>without reification or containers) meaning of R.  In other words, a model
>contains all the information implicitly (or explicity) in the graph, and
>maybe more, and contradicts nothing that is implicitly (or explicitly) in
>the graph.

I'm not quite sure what this means. Can you express the notion of 
'contains all the information' in model-theoretic terms?

>
>Claim 2:
>
>A core RDF interpretation that is not a model of a basic untidy RDF graph R
>has something that is not compatible the intended core RDF meaning of R.
>In other words, a non-model is missing or incorrect on something that is
>implicitly (or explicitly) in the graph.

This is what the anonymity lemmas try to make precise.

>
>Claim 3:
>
>For every basic untidy RDF graph R there is a core RDF interpretation that
>captures exactly the closure of the intended core RDF meaning of R and that
>is a model for R.  That is, roughly, that there is a model that makes
>everything implicitly (or explicitly) in the graph true, and everything
>else false.

Right. That is the strong Herbrand lemma.

>   [NB:  This claim is not true for more-powerful representation systems.]

Oh, indeed. Part of my motivation for stating these lemmas was to 
show how different RDF is from more expressive languages. (I didn't 
feel it was appropriate to spend a lot of prose on this in the 
document, however. Anyone who knows about the other systems will 
probably find it kind of obvious in any case.)

Pat




-- 
---------------------------------------------------------------------
IHMC					(850)434 8903   home
40 South Alcaniz St.			(850)202 4416   office
Pensacola,  FL 32501			(850)202 4440   fax
phayes@ai.uwf.edu 
http://www.coginst.uwf.edu/~phayes
Received on Thursday, 27 September 2001 20:42:40 UTC