Re: UNSAID drafted and mapped to SQL from Pat Hayes on 2004-12-22 (public-rdf-dawg-comments@w3.org from December 2004)

From: Pat Hayes <phayes@ihmc.us>
Date: Tue, 21 Dec 2004 20:06:49 -0800
To: Bob MacGregor <bmacgregor@siderean.com>
Cc: bgrosof@mit.edu, "Eric Prud'hommeaux" <eric@w3.org>, public-rdf-dawg-comments@w3.org
Message-Id: <p06001f01bdee7f08fad9@[192.168.1.4]>
>This issue of UNSAID is really much bigger than SPARQL -- it affects 
>a significant
>fraction of Web users.

True. BUt SPARQL is one place where I, personally, have a snowball's 
chance of drawing a line in the sand. If NAF reasoning becomes 
ubiquitous in SWeb notations, then IMO the Sweb is basically already 
a failure. It will just be conventional DB technology with URIs added.

>If my understanding is correct, all of the RuleML users
>will be taking the Deductive Database / SQL mindset,  which assumes 
>closed world
>semantics.

RuleML started this way, but now has adapted to the point where the 
reasons for allowing NEG (ie classical negation) as well as NAF are 
widely understood and acknowledged. I could wish for more, but this 
is a step in the right direction.

>The key part of the discussion involves whether or not a system can 
>trust/assume
>that closed world semantics applies to some graph / RDF dataset.

Right; though I would say that the real issue is how this fact about 
a graph (or a "knowledge resource" of some kind) can itself be 
represented and communicated across the sweb. I think we agree on 
this.

>While its
>certainly the case that nothing in RDF sanctions that assumption, only a small
>amount of machinery is needed,  namely; named graphs and a single predicate
>that can assert that a particular graph has closed-world semantics.

That would do for simple cases. In practice however I think there 
will be different kinds of closure. A graph might have all the 
information about who works for a company - all the employees - but 
only partial information about, say, their marital status; or a KB 
might be complete wrt a certain kind of triple for the items in it, 
eg a list of folk which undertakes to be complete wrt their US 
citizenship, without of course claiming to include every US citizen. 
And so on. (see later)

However, I agree that using RDF to do this task, and named graphs to 
be the primary naming device, is a smart way to go.

>  Possibly,
>the predicate would be more specific, asserting that a particular property
>(and its subproperties) are closed within the named graph.

We will need a whole vocabulary. If we can invent a 'graph closure 
ontology' that would be some actual use, I think that would be a 
valuable contribution.

Consider a property ppp, and the fact that the graph does not contain 
a triple of the form

aaa ppp bbb .

Now, what can we conclude? That the property is false of *any* such 
aaa and bbb? Call that total closure. That would be a very strong 
kind of closure.  Eg that would not handle the case where the graph 
gives US citizenship information for all the people it mentions, but 
doesnt undertake to list all US citizens. To do that we would need a 
notion like, if the graph contains a triple whose subject is aaa, but 
it does not contain

aaa ppp bbb .

then aaa does not have the value bbb for the property ppp (but if aaa 
is not mentioned at all, then nothing follows.) Call that subject 
closure.  Obviously there is an analogous version of this for the 
object rather than the subject, but maybe that will not be so 
interesting in practice. And then there might be special versions of 
this for rdf:type, ie closure conditions for classes. The simplest 
one would be that if the graph does not contain

aaa rdf:type ccc

then aaa is not in the class ccc; but again, it might be useful to 
have a version of this restricted to the aaa's actually named in the 
graph. Call them respectively class closure and subject class 
closure. (Im making these names up arbitrarily, and if there are 
better names already in use, then lets use those, of course.)

Now we could have properties of classes and property with graphs as values:

ppp totalClosedIn ggg
ppp subjectClosedIn ggg
ccc classClosedIn ggg
ccc subjectClassClosedIn ggg

and write little ontologies. And of course now a named-graph ontology 
can include its own closure conditions by using its own graph name. 
However I bet a widely used technique would be to have the 'closure' 
ontology import the 'bare facts' ontology, so that one accesses the 
latter 'through' the former. This allows the 'bare facts' to actually 
be a non-RDF resource with an RDF skin over it, such as a database 
with a SPARQL interface.

Can you think of any other kinds of closure condition we might need? 
How about being closed but only for certain values of the object? Or 
closed over property values but only for subjects in a certain class? 
One can go on making these up for ever, and I don't have a good 
feeling for where would be a 'natural' place to stop, or if there is 
a reasonably plausible basic set which can be used to compose all the 
cases we would need.

BTW, there are some delicate issues concerning blank nodes. Suppose 
we say that ppp is subject closed, and the graph contains a triple

aaa ppp _:x

What follows? Does aaa have a ppp value? I guess we have to say it 
does, and that this is enough to escape the NAF inference. But if ppp 
is 'citizenship', then just knowing that Fred has some kind of 
citizenship hardly seems the same kind of closure as knowing what 
country Fred actually is a citizen of. (For an extreme case, imagine 
a boolean-valued property, such as criminalRecord with value 
'true^^xsd:boolean or "false"^^xsd:boolean, and suppose we just know 
that
Joe criminalRecord _:x .
I know its kind of dumb, but it could happen: its legal RDF.)

>On the other hand, until named graphs become officially blessed 
>(instead of just
>something that everyone recognizes would be a major step forward) this
>solution might not be viable for SPARQL.

True. The smart thing for us to do would be to anticipate that named 
graphs will appear, and design SPARQL so that it doesn't actually 
break when they do, but can be smoothly extended to handle them.

>  On the other hand, our implementation
>of SPARQL includes UNSAID already -- no sense in waiting around for
>what will inevitably arrive sooner or later..

:-)  You are probably right.

Pat

>
>Cheers, Bob
>
>Pat Hayes wrote:
>
>>Re: UNSAID drafted and mapped to SQL
>>
>>>On 2004-12-18 21:58:34 -0800, Pat Hayes wrote, at
>>><http://lists.w3.org/Archives/Public/public-rdf-dawg/2004OctDec/0534.h>http://lists.w3.org/Archives/Public/public-rdf-dawg/2004OctDec/0534.html:
>>>
>>>>  The message that started the thread
>>>> 
>>>><http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/>http://lists.w3.org/Archives/Public/public-rdf-dawg-comments/2004Nov/0016.html
>>>>  has an example that illustrates the point in its use case 2, the
>>>>  financial institution that must not send its prospectus to
>>>>  customers in the US or Canada. For this institution to rely on an
>>>>  UNSAID query to ensure this rule was obeyed would be very risky,
>>>>  since in general the RDF content against which the query is being
>>>>  evaluated is not known to be complete with regard to citizenship
>>>>  information. It cannot be so known, except by special access to
>>>>  off-web information, as there are currently no Web protocols for
>>>>  communicating the fact that a source is complete in this way.
>>>
>>>Indeed.  The same applies to the truthfulness of the information
>>>contained in the RDF graph, or to the trustworthyness of information
>>>about the graph's truthfulness that's transmitted inside the
>>>protocol.  That's, obviously, not a reason to declare RDF and SPARQL
>>>"very risky", and to drop them.
>>>
>>
>>There is a key difference, however. If an agency publishes some 
>>RDF/OWL content which asserts that, say, Joe is an American 
>>citizen, then the specs do indeed establish that they are asserting 
>>this, so questions of trustworthiness and responsibility for 
>>published claims can be brought into the area of rational 
>>discussion. This does not apply to negation-as-failure.  If I 
>>publish some RDF/OWL which describes some facts about citizenship 
>>but which fails to mention that Joe is an American citizen, the 
>>specs insist that I have not thereby asserted that Joe is not a 
>>citizen. If you draw that conclusion, you do so at your risk, and 
>>I, the publisher, cannot be held responsible for any consequences 
>>of that inferential act by you. It would be a dangerous (IMO) 
>>mistake for SPARQL to imply in its design that this kind of 
>>(negation-by-failure) inference was intended or meant to be 
>>supported by an RDF or OWL reasoner; it could (will) be used to 
>>deflect responsibility for mistakes to the wrong agency.
>>
>>>The point of applying UNSAID in the way described in use case 2 is,
>>>precisely, that the graph that's queried is assumed to be
>>>
>>>sufficiently complete for the querying party's purposes.
>>>
>>
>>But that assumption is invisible on the semantic web. My point is 
>>that there is no way for a software agent to be told that a graph 
>>is 'sufficiently complete' in the required sense. (No way to 
>>transmit that using http, if you like.) And recall that the 
>>intended goal of the semantic web it to allow software agents to 
>>make rational decisions. If a designer really wants to use this 
>>kind of reasoning on a source that it knows to be complete, I 
>>believe it is quite easy to do so without having UNSAID in the 
>>querying protocol. For example, the application can explicitly 
>>query for the rejection case and reject the instance if it finds 
>>the relevant triple; then it has performed an invalid inference, 
>>but has done so by using valid protocols . My quarrel is not with 
>>the reasoning strategy (though I have my doubts about it) but with 
>>the incorporation of an invalid reasoning process into the querying 
>>protocols.
>>
>>A related matter. UNSAID refers simply to the absence of a triple. 
>>But RDF supports entailment of triples by other triples, and such 
>>entailments become quite complex in RDFS and extremely complex in 
>>OWL; and RDF/XML is required by the various W3C WG charters to be 
>>the interchange syntax for these more complex languages. Suppose an 
>>OWL/RDF or RDFS triple store does not contain a certain triple, but 
>>that triple can be inferred by valid OWL or RDFS reasoning from 
>>triples that it does contain. In this case, a reasoner that relied 
>>on UNSAID to implement negation-by-failure would become logically 
>>incoherent, not merely mistaken: quite simple inputs would cause it 
>>to become enmeshed in contradictions. (It might be better to have 
>>something like UNIMPLIED rather than UNSAID, particularly as an RDF 
>>graph can be reasonably taken to be 'saying' any RDF-valid 
>>consequence of itself. )
>>
>>>  The
>>>judgment whether or not this kind of assumption is "very risky"
>>>(whatever this means) is not the protocol designer's to make, but
>>>strictly a business decision made by the party that applies the
>>>
>>>protocol.
>>>
>>
>>The anticipated uses of SW technology require such decisions to be 
>>made by software, not by designers of software. Right now there is 
>>no way to transmit the necessary information to a piece of 
>>software. (I wish there were: the lack of this ability is a notable 
>>failure of the RDF/OWL effort, I now think, for which I must bear 
>>part of the responsibility.)
>>
>>>In fact, the word "complete" is ambiguous here: While a graph may be
>>>incomplete, in the sense that it lacks facts that are out there
>>>(this is the notion of "incompleteness" that you apply to use case
>>>2),
>>>
>>
>>Lacks a particular kind of fact. I agree that the notion of 
>>'completeness' here is ambiguous; that is part of the technical 
>>problem.
>>
>>>the same graph may quite well be the querying party's complete
>>>knowledge of facts at some point of time.  In this context, UNSAID
>>>also serves to help a party know what it does not know.
>>>
>>
>>I agree that is a potentially useful thing to be able to query. 
>>However, the very fact that your use cases relied on invalid 
>>reasoning (and the draft wrote-up explicitly mentioned invalid 
>>reasoning patterns) makes me worry that it will not be used in this 
>>way, but will almost certainly be used immediately and 
>>enthusiastically in invalid ways. And that this will produce a 
>>dangerous kind of inference-rot at a very basic layer of the 
>>semantic web.
>>
>>>Here's another use case, to illustrate this: Consider a party (say,
>>>our bank) that knows it has partial information stored in an RDF
>>>graph -- e.g., some social information (say, the grandmother's
>>>maiden name) is only associated with some of the subjects (say, of
>>>class account holder) in the graph. The party needs to collect this
>>>information for all subjects of class account holder (say, due to
>>>stricter money laundering legislation). UNSAID enables the bank to
>>>acquire the missing information from those account holders for which
>>>it is needed, and later on also enables sanctions against account
>>>holders who do not provide it.
>>>
>>
>>That is an excellent use case, I agree: using UNSAID to find out 
>>what is not said. I wish they were all like this. But is UNSAID 
>>really necessary for this? Or is it only a convenience? If it were 
>>possible to handle cases like this without using UNSAID explicitly, 
>>I would prefer that SPARQL require users to use a workaround.
>>
>>>  > If SPARQL contains UNSAID then it will be inconsistent with any
>>>>  account of meaning which is based on the RDF/RDFS/OWL normative
>>>>  semantics. This will not render SPARQL unusable, but it will place it
>>>>  outside the 'semantic web layer cake' and probably lead to the
>>>>  eventual construction of a different, and rival, query language for
>>>>  use by Web reasoners.
>>>
>>>Conversely, standardization of a too restricted version of SPARQL
>>>(e.g., one without UNSAID) will drive applications to either
>>>competing query languages, or to incompatible extensions that
>>>provide the expressivity they need.
>>>
>>
>>That would be a better outcome, IMO, than having an RDF query 
>>language in widespread use which would weaken the inferential 
>>foundations of much of the semantic web. If the basic RDF protocols 
>>do not respect the RDF semantics, then there really is no point in 
>>continuing with the semantic web effort.
>>
>>>Note that this risk is not created by specifying a full version of
>>>SPARQL, including UNSAID, and by additionally profiling some subset
>>>of it that satisfies whatever assumptions you want to be able to
>>>make.
>>>
>>
>>In an ideal world, everyone would read all the warnings in the spec 
>>and obey them rationally. However, a spec designer has to consider 
>>the real world. For example, it would be quite rational to allow 
>>blank nodes in query patterns; but we find in practice that if they 
>>are allowed, the people often misuse them, or expect them to apply 
>>in ways that cannot be supported, or confuse them with query 
>>variables. So it is simpler, and better, to just not allow them, 
>>even though in some cases that requires users to express themselves 
>>more obliquely and use work-arounds. I feel strongly that UNSAID is 
>>in this category of a useful-if-you-know-exactly-how but 
>>dangerous-and-easy-to-misuse kind of a feature, and one that is 
>>better omitted than included. And I feel this way even more 
>>strongly when the email thread that suggested it, and the draft 
>>write-up of the language feature itself, both misuse it in exactly 
>>the dangerous way.
>>
>>Pat
>>
>>>Regards,
>>>--
>>>Thomas Roessler, W3C   <mailto:tlr@w3.org><tlr@w3.org>
>>>
>>
>>
>>--
>>
>>---------------------------------------------------------------------
>>IHMC		(850)434 8903 or (650)494 3973   home
>>40 South Alcaniz St.	(850)202 4416   office
>>Pensacola			(850)202 4440   fax
>>FL 32502			(850)291 0667    cell
>><mailto:phayes@ihmc.us>phayes@ihmc.us 
>><http://www.ihmc.us/users/phayes>http://www.ihmc.us/users/phayes
>>
>
>--
>
>Bob MacGregor
>Chief Scientist
>
>
>Siderean Software Inc
>390 North Sepulveda Blvd., Suite 
>2070<http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=5155+Rosecrans+Ave&csz=Hawthorne%2C+Ca+90250&country=us>
>El Segundo, CA 90245 <mailto:bmacgregor@siderean.com> bmacgregor@siderean.com
>tel: +1-310 647-4266 fax: +1-310-647-3470
>
>
>
>
>
>
>
>


-- 
---------------------------------------------------------------------
IHMC		(850)434 8903 or (650)494 3973   home
40 South Alcaniz St.	(850)202 4416   office
Pensacola			(850)202 4440   fax
FL 32502			(850)291 0667    cell
phayes@ihmc.us       http://www.ihmc.us/users/phayes
Received on Wednesday, 22 December 2004 04:08:18 UTC