Re: Review RDF 1.1 Semantics (ED 3rd June 2013)

On Jun 12, 2013, at 5:53 AM, Antoine Zimmermann wrote:

> Pat, Peter,
> 
> 
> This is my review of RDF 1.1 Semantics. Sorry for sending this so late.
> On the plus side, I'd say that overall, the presentation have been much improved, interpretations being independent from a vocabulary is a big bonus, making D-interpretations independent from the RDF vocabulary is also much better. Putting the rules in context with the corresponding entailment regime is also good.
> 
> Now, for the main criticism, I have two outstanding problems with the current version:
> 1. D-entailment using a set rather than a mapping;
> 2. Define entailment of a set as entailment of the union.
> 
> 
> 1. D-entailment
> ===============
> 
> Concerning 1, the implication of the new definition is that given a D, it not generally possible to know what are the valid D-entailments.
> 
> For instance, consider D = {http://example.com/dt}. What does the triple:
> 
> <s> <p> "abc"^^<http://example.com/dt> .
> 
> D-entails? The specification does not say.

How can it? Those entailments must depend on what is known about the datatype and its value space.

> Moreover, because of the absence of a known mapping from IRIs to datatypes, there are a few ill-defined conditions:
> For instance, in Section 9, the table "Semantic conditions for datatyped literals" says:
> 
> """
> For every other IRI aaa in D, and every literal "sss"^^aaa, IL("sss"^^aaa)=L2V(I(aaa))(sss)
> """
> 
> L2V is only defined for datatypes, whereas I(aaa) is not constrained to be a datatype. Even though it was constrained to be a datatype, this would not define the value of IL("sss"^^aaa), unless aaa is one of the normative XSD datatype IRIs.

True, I will adjust the wording to cover the anomalous case.
> 
> In any case, no matter how you tweak the definitions, the application MUST have a mapping from the set of "recognised" datatype IRIs to some specific datatypes.

If the application has no such mapping then it is not able to treat literals with that type in any special way, so they get treated exactly as though they were an unknown name, which is how it would be treated in simple interpretations.  So it is not true that the application MUST have this mapping: only that if it does not, then the presence of this datatype IRI does not change any entailments. Which is what the "recognized" terminology is supposed to suggest. 

Exactly the same point could be made regarding the 2004 specification. The semantics there referred to a datatype map, but the syntax of RDF did not provide any way to describe or denote this map. When provided with some RDF containing literals typed with http://example.com/dt, there is nothing that defines the datatype map that is supposed to be used on this RDF. And both in 2004 and here, if you are faced with an IRI typing a literal and you do not know what datatype it is supposed to be denoting, then you simply treat the literal as you would any other unknown name.

What one should do, of course, faced with a literal typed with an unknown IRI, is to use that IRI itself to try to find out what it identifies. In other words, you should try to find out what its **denotation** is, because that will be the datatype. Which is exactly what the current account of datatypes suggests you should do. 
 
> 
> 
> Later in Section 10, it says that "if D < E and S E-entails G then S D-entails G." Since no constraints are given on how to interpret "recognised" non-XSD datatype IRIs, it is possible that the same IRI in D-entailment is interpreted differently in E-entailment.

There is normative prose to ensure that this cannot happen.
> 
> In Section 11, table "RDF semantic conditions:
> 
> """
> For every IRI aaa in D, <x,I(aaa)> is in IEXT(I(rdf:type)) if and only if x is in the value space of I(aaa)
> """
> 
> This is ill-defined because the value space of I(aaa) may not exist.

In which case, x cannot be in the nonexistent value space, so the definition says that <x, I(aaa)> is not in the extension, so any triple 

xxx rdf:type aaa .

is false. 

> Again, even if I(aaa) is constrained to be a datatype, how do we know what is its value space? Therefore, the condition cannot be verified in general.

In order to know the value space of any datatype, we have to appeal to external sources. There is a presumption that datatypes are identified by IRIs, which seems reasonable in the RDF context. 

> 
> Finally, the reasons why this change has been made are unclear.

It was an editorial decision, to make the exposition simpler and easier to understand. (I would add that putting datatype maps into the semantics in the first place was also an editorial decision.) It does not materially change the semantics and does not affect any entailments. 

> The working group was not chartered to do anything about that, the workshop in 2010 did not point at all to any problems with datatype maps, this Working Group did not discuss or complained about the D being a map when the change was made. No prior discussions were attempted before making the change.
> 
> Implementations that rely on custom datatypes are interpreting the custom datatype IRIs according to one specific, known datatype, therefore, they do have a datatype *map* implemented.

Or, you could say the same thing by saying, they have a fixed interpretation of the custom datatype IRIs. Which is a natural and IMO clearer way to express the same thing. 

> There is zero motivation to make such a change.

It simplifies the exposition by removing an irrelevant and confusing complication, making the semantics (marginally) easier to follow. As it has been repeatedly asserted that understanding the arcane specification documents is the greatest barrier to RDF deployment, this is not a trival motivation. 

The 2004 mode of presentation had its problems as well. It was quite possible, for example, to consider a D which mapped xsd:string to, say, rdf:PlainLiteral. There was nothing to constrain the 'standard' IRIs to map to their "obvious" meanings, or to the meanings obtained by conventional Web methods such as following HTTP links.  Richard had already noted how crazy this was in one of this blogs about a year ago, and there was considerable comment to the effect that this ought to be fixed. 

> 

> 2. using union
> ==============
> This issue is different from the previous one because it does not make the definitions and propositions incorrect.
> 
> I see two problems with the new definition: first, it makes the notion of entailment in RDF different from the standard, universally accepted notion of entailment in logic.

It is completely standard to treat a set of assertions as equivalent to its conjunction.

> In general, no matter what semantics is considered entailment is defined as follows:
> 
> """
> A set S of formulas in the language entails a formula F in the same language if and only if all interpretations that satisfy all the formulas of S also satisfy F.
> """
> 
> That's what was in RDF 2004, that's what's in OWL, that's what's in any logic with a model-theory.

The key point is that in any normal logic, these two ways of phrasing are exactly equivalent, so the choice between them is purely aesthetic. But in RDF, now that we explictly allow two different graphs to share a bnode, this equivalence is rather trickier. Basically, we now allow in RDF a situation where we can conjoin (union) graphs both inside and outside quantifier scopes. This is not a normal situation in any conventional logical syntax, so we are to some extent on our own, and appealing to what is "normal" isn't helpful.

> There are also inconvenient consequences for manipulation of RDF graphs:
> how is it supposed to be implemented? Assume we have two representations of two graphs. How do you know what's the union of the two graphs?

First you use the scoping rules on bnodeIDs to make sure that you keep all the various blank nodes distinct. Then you take the union of the sets of triple representations. Exactly like you do now. 

> You do not have access to the bnodes, only to identifiers or locations in files or in memory. There is a rule of thumb saying "different documents, different bnodes".

I prefer to talk about identifier scopes rather than documents, as it is more precise. But yes.

> And what about RDF graphs in an in-memory model?

Again, I presume that the model somehow specifies identity of blank nodes. In which case, what is the problem? 

> What about two examples of RDF graphs in Turtle in a written article? They are in the same document, they certainly share bnodes, right?

Well, that depends upon what the text of the article says about its use of blank node identifiers, presumably, unless it is using a diagrammatic network represetnation, in which case identity of blank nodes can be checked visually. 

Your point about not having access to the actual blank nodes applies equally to ANY definition of ANY way of combining graphs. Go back in time to 2004 and imagine two RDF/XML documents coming from independent sources which do not use the same bnodeIDs anywhere. They each describe an RDF graph. Do those graphs share blank nodes? How could you possibly tell? If they did, how would you fix that situation? You don't have access to the *actual blank nodes*, only to some bnodeIDs in some surface (document) syntax. You can just tell the abstract-syntax story and say, make sure they don't share any bnodes before combining them, as we did in 2004 (without saying HOW you were supposed to do this.) The problem is, that blanket merge-not-union rule is now wrong in some cases. The graphs really can share bnodes, now, and then merging them would lose information. If they don't share blank nodes, then merging and unioning are the same operation; if they do share a blank node, then taking the union is the correct thing to do, because the merge loses the information about the sharing. Either way, unioning is correct. 

> Now if we take the simple case when the application is able to determine that the bnodes are disjoint, how can it perform a union? The answer is that it must *separate apart* the bnode identifiers.

All this is about the blank nodes *themselves*, not the bnodeIDs. BnodeIDs are purely a surface syntax matter. There are no bnodeIDs in the abstract RDF graph syntax. And to perform a union, you just, well, take the union of two sets. What could be simpler or more obvious?

> So, while in 2004 there was a coherence between the way merge was defined and the way it has to be implemented, now there is a discreprency between the definition and the pratice.

It is (and it always was) important to keep the levels clearly distinct. BnodeIDs are part of a surface syntax and obey the scoping rules of that surface syntax. Those rules determine when two bnodeIDs identify the same bnode, and (implicitly) when they don't. Once that is all worked out, then we define operations on the graphs (not on the surface documents), and it is these operations on the graphs that we are describing here. In 2004, you could do all this surface bnodeID standardizing apart, and you could STILL be left with a situation where the actual blank nodes needed to be "standardized apart" even after all the bnodeIDs had been dealt with. Which is ridiculous, but we had to say it because there was nothing in the 2004 RDF abstract syntax model that ensured that distinct graphs from unrelated sources did not accidentally share blank nodes. We have now made that clear, so there is no need to protect against "accidental" bnode identities; and we have also made it possible for distinct graphs to genuinely share bnodes, so we need to allow this case to be handled correctly. Both of these mean that the union, rather than the merge, is the correct way to combine graphs. If two graphs really do share the same blank node, then they should stay being that blank node, because then their union faithfully represents the larger graph of which they are parts. 

I will write a paragraph or section to try to make this all very obvious. 

> Then, once the separation apart is made to produce a representation of the union, the created graph is, by definition of union, sharing bnodes with the two original graphs. But how can the overlap of bnodes be recognised in and out of the application? One would need meta-information about the relationship between the graphs.

No need to call it meta-information, but any dataset surface notation could create a situation where two distinct named graphs in a dataset share a common blank node. I guess this would probably be handled by the dataset surface syntax rules for identifier scopes. 

> And how to represent and store that relationship?

Ask the implementors of your dataset-representing system :-)

> Also, if one wants to keep two graphs that share bnodes separate (say, they are distinct graphs in the same TriG files). Then these graphs cannot be stored separately if one wants to retain equivalent inferences on the set of graphs. That is, if I have {G1,G2} such that G1 and G2 share some bnodes, storing G1 apart would create a "copy" of G1 with disjoint bnodes. The new graph, H1, would be equivalent to G1, but the set {H1,G2} would not yield the same entailments as {G1,G2}.

Right, because you would have lost some information when you performed that separation. But note, if you insist upon enforcing graph merging rather than union, you would have lost this information even when using the dataset. In fact. if we insist upon merging rather than unioning graphs, there really is no point in allowing graphs to share bnodes in a dataset, since graph operations will "un-share" them. 

> Finally, the decision to replace merge with union was first put into the document without prior discussion with the Working Group, without evidence that it follows practices,

It clearly does, since it is routine to take the union of graphs. In fact, this is by far the most common operation in RDF. All entailment rules for example union (rather than merge) consequences into a graph, and unioning rather than merging is required inside datasets. 

> without evidence that it solves known issues.

It solves an issue with the 2004 definitions that I have been noting ever since the WG began (and in fact before then).

> The notion of merge was not identified as a subject of concern during the W3C workshopin 2010. Implementations do implement the RDF 2004 correctly.

I do not think that there is a single RDF implementation that implements the 2004 notion of merge. Remember, it is not talking about standardizing apart bnodeIDs (that is indeed often done) but blank nodes *themselves*. If you can find me any RDF engine that does this, I will buy you a good steak dinner.

> Conclusion:
> ===========
> More generally, any change like this is disturbing education. If this design is standardised at the end of the year, there will be a gap between what's in the standard and what has been written for years in tutorials, courses, research papers, and so on.

Actually I think there will not be. Tutorials, courses, and implementations have all considered standardizing apart to be an operation applied to blank node *identifiers*, even when they used the blank-node terminology. This has been a source of muddle and confusion ever since the original specs were published, in fact. The current way of describing things helps to keep the situation clearer.

> 
> Considering that I see no added value compared to 2004 from both these changes, and having even identified flaws, I oppose publication of RDF 1.1 Semnatics in such a state.

To reproduce the merge language from the 2004 documents, without making corrective changes elsewhere (eg to the semantic rules for bnodes) would now be an error, and would make the specifications internally incoherent. That is not an option. IMO this change is the simplest, most conventional and clearest way to correct the confusion in the 2004 specs.

> Note that the solution I propose is simple and simpler than what is proposed: to go back to the old design concerning entailment of a set of graphs and datatype map.
> My proposal is also less likely to trigger unsupportive comments in the Last Call phase. We cannot aford to spend more time in inventing new design.
> 
> 
> 
> Minor remarks:
> ==============
> I think there are too many sections. Simple interpretations and simple entailment can be subsections of a common section. The same for D-interpretations and D-entailment.  Same for RDF interpretations and RDF entailment; same for RDFS.

You might be right. I will try making a simplified version with this more compact presentation style throughout.

> 
> Section 3:
> """For example, RDF statements of the form:
> 
> ex:a  rdfs:subClassOf  owl:Thing .
> 
> are forbidden in the OWL-DL [OWL2-SYNTAX] semantic extension."""
> 
> -> This triple can be a valid part of an OWL 2 DL ontology. A better example would be:
> 
> ex:a  rdfs:subClassOf  "Thing" .
> 
> Moreover, perhaps a reference to OWL 2 mapping to RDF graphs [1] would be better, since [OWL2-SYNTAX] defines OWL 2 ontologies in terms of a functional syntax that does not say anything about the constrains in the RDF serialisation.

Good point, I will change the reference. 

> 
> Section 4:
> "A typed literal contains two names" -> We do not have the notion of typed literals since all literals are typed.
> "Two graphs are isomorphic when each maps into the other by a 1:1 mapping on blank nodes." -> this is very much underspecified. There are other constraints on isomorphisms.

?There are? What are they? 

> "Graphs share blank nodes ... of distinct blank nodes." -> this discussion should not be here. In fact, it should rather appear in Concepts.

I am happy as long as it appears somewhere. Several people have suggested putting some of this material into Concepts. but after having it marked as an Issue for several drafts I just gave up on it in order to move forward to LC.

> In any case, it does not belong to notation and terminology.
> "This document will often treat a set of graphs as being identical to a single graph comprising their union, without further comments." -> if my concerns above are taken into account, this should be removed. A definition of merge should be added instead. By the way, I haven't seen many sets of graph being treated as a single graph. Actually, I think I only saw it twice. So we cannot say "often".

I will remove "often"

> 
> Section 5:
> Make it a subsection of "Simple semantics"? "Simple entailment"?
> "a function from expressions (names, triples and graphs) to semantic values:" -> what's a "semantic value"?

Good question. I will rephrase.

> "triple s p o then ..." -> why not "triple (s, p, o)" ?

Yes

> Same remark in item 4 of Section 5.2
> 
> Section 6:
> Make it a subsection of "Simple semantics"? "Simple entailment"?
> "a graph G simply entails a graph E when every interpretation which satisfies G also satisfies E, and a set S of graphs simply entails E when the union of S simply entails E" -> change this to "a set S of graphs simply entails a graph E when every interpretation which satisfies all graphs in S also satisfies E"
> Remove the Change Note.

No. See above. (There is an alternative approach, which is to define the bnode truth conditions to apply to bnode scopes rather than to graphs. An earlier edit did have this option, but it was removed after objections from PFPS; and you have argued against the notion of scope. )

> Section 6.1:
> "the inference from (P and Q) to P, and the inference from foo(baz) to (exists (x) foo(x))." -> the notation "(P and Q)" etc is rather obscure in this context.

Indeed. I had intended to remove this, and I will.

> Perhaps it would be good to present the usual First Order Logic translation of the semantics. BTW, the usual FOL translation would not be valid for entailments over a set of graphs because {FOL(G1),FOL(G2)} is equivalent to FOL(merge(G1,G2)).

But this FOL map is no longer correct when the graphs share blank nodes. In any case, this is overkill for the presentation at this point. The analogy to FOL is intended only to be helpful to some readers in passing, and I now think it is causing more harm than good. 

> The example with ex:a ex:p _:x is confusing RDF graphs and RDF documents, as well as bnodes and bnode identifiers.

Standardizing apart is purely a bnodeID concern. You do this when you might accidentally conflate bnodes coming from different sources just because the local IDs happen to coincide. But when, after you have done due care with bnodeIDs, you find that two graphs really do share an actual blank node, then (as this example tries to illustrate) you should NOT separate ithat node into two blank nodes, because that loses information. 

I will try to find a way to make the wording clearer. 

> Then, while the naive readers would intuitively imagine that taking the union of the two triples would simply amount to putting them together, they realise that they have to "standardise apart" the bnode identifiers.

Well, as in this example, they don't always have to. But if they do, the result loses information and is no longer exactly equivalent to the graph they started with. I think this is easy to understand. The real loss of information happens when a blank node shared between two graphs is separated into two blank nodes. 

> 
> Section 7:
> "For any graph H, if sk(G) entails H then there is a graph H' such that G entails H' and H=sk(H')" -> this should rather be: "For any graph H, if sk(G) entails H then there is a skolemization sk'(H) of H such that G entails sk'(H)"

No, because sk'(H) would use different Skolem vocabulary, so sk'(H) =/= sk(H')

> 
> Section 8:
> Remove the second change note, as per my concerns above.
> "datatype d refers to (denotes) the value" -> why not just say "denotes"

Yes.

> "L2V(d)(string)" -> rather, L2V(d)(sss)
> "rdf:plainLiteral" -> "rdf:PlainLiteral"

OK

> "the datatype it refers to MUST be specified unambiguously" -> yes, there MUST be a mapping from datatype IRIs to datatypes, i.e., there must be a datatype map. This is a MUST, why doesn't it appear as a constraint in the formal semantics?

The datatype map is just the interpretation mapping restricted to the vocabulary of recognized IRIs. We have to allow for the case where an IRI is used as a datatype IRI but its interpretation isnt known to us. We could make this illegal in some way, but if its legal (IMO as it should be) then what do we say, semantically? The stadnard way to handle lack of information in model theory is to encode it as multiple satisfying interpretations. So we allow I(ddd) to vary, to represent the case where we don't know what ddd actually means. The trouble is, however, that this way of finding out what it means is by a mechanism which is *outside of*  and *invisible to* the model theory itself. This situation does not arise in conventional logical semantics, but it is central here on the Web. So we have to appeal to conditions which are not stated and are not even expressible in the formal equations. We can't give formal truth conditions for "being a recognized datatype IRI".

If we say that I(ddd)=d for some fixed datatype d, then we have in effect said that this IRI denoting its datatype is logically necessary, a tautology. Which is incorrect and confusing. What actually determines the datatype identified by an IRI is the state of the Web, which is something altogether outside of model theory, and we don't have any formal way to refer to it. Which is why I think just leaving it be the denotation of the datatype IRI is exactly appropriate. 

> 
> Section 9:
> Make it a subsection of "D-semantics"? "D-entailment"?
> 
> Section 10:
> Make it a subsection of "D-semantics"? "D-entailment"?
> "a set S of graphs (simply) D-entails or entails recognizing D a graph G when every D-interpretation which makes S true also D-satisfies G." -> "a set S of graphs (simply) D-entails a graph G when every D-interpretation which satisfies all graphs in S also D-satisfies G."
> 
> Section 10.1:
> why not put the general rule for datatype entailment:
> """
> aaa uuu "xxx"^^ddd => aaa uuu "yyy"^^eee
> where L2V(I(ddd))(xxx) = L2V(I(eee))(yyy)
> """

Good idea.

> 
> Section 11:
> Make it a subsection of "RDF semantics"? "RDF entailment"?
> 
> 
> Section 12:
> Make it a subsection of "RDF semantics"? "RDF entailment"?
> 
> Section 12.1:
> Group the rules together, as in Section 14.1
> 
> Section 13:
> Make it a subsection of "RDFS semantics"? "RDFS entailment"?
> 
> Section 14:
> Make it a subsection of "RDFS semantics"? "RDFS entailment"?
> 
> Section 15:
> "plus an optional default graph" -> the default graph is not optional, there must be exactly one

Ah, OK. 
> 
> Appendix A:
> "follows exactly the terms used in [OWL2-SYNTAX]" -> it is [OWL2-PROFILES], in Section 4.3. OWL2-SYNTAX does not rely on RDF triples

OK

> "Every RDF(S) closure, even starting with the empty graph, will contain all RDF(S) tautologies" -> not all, the closure as defined is finite, while there are infinitely many tautologies. All tautologies concerning the vocabulary of the initial graph, union the tautologies in the RDF and RDFS vocabularies.

Yes. I will rewrite this.

> 
> Appendix C:
> The proof that every graph is satisfiable does not need introducin Herbrand interpretation and does not need to build an interpretation for each graph considered. There is a single interpretation that makes all RDF graph simply true. Consider a domain comprising only one element x. Map all IRIs and literals to x, including those used as predicates. Make the IEXT of x be the single pair {(x,x)}. This simply satisfies all graphs.

Yes, it does. But I wanted to use the Herbrand interpretation idea in another proof. Hmmm, I will think about this.
> 
> Appendix D.1:
> "The subject of a reification,/a>" -> typo

OK
> 
> Appendix D.2:
> The RDF container vocbulary should also mention rdfs:member, rdfs:containerMembershipProperty.

Those are in RDFS, not RDF. Their meaning requires using RDFS classes and rdfs:subProperty, respectively.

Pat



> -- 
> Antoine Zimmermann
> ISCOD / LSTI - Institut Henri Fayol
> École Nationale Supérieure des Mines de Saint-Étienne
> 158 cours Fauriel
> 42023 Saint-Étienne Cedex 2
> France
> Tél:+33(0)4 77 42 66 03
> Fax:+33(0)4 77 42 66 66
> http://zimmer.aprilfoolsreview.com/
> 

------------------------------------------------------------
IHMC                                     (850)434 8903 or (650)494 3973   
40 South Alcaniz St.           (850)202 4416   office
Pensacola                            (850)202 4440   fax
FL 32502                              (850)291 0667   mobile
phayesAT-SIGNihmc.us       http://www.ihmc.us/users/phayes

Received on Friday, 14 June 2013 06:33:07 UTC