Re: Distributed querying on the semantic web from Peter F. Patel-Schneider on 2004-04-20 (www-rdf-interest@w3.org from April 2004)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Tue, 20 Apr 2004 11:45:31 -0400 (EDT)
To: pdawes@users.sourceforge.net
Cc: www-rdf-interest@w3.org
Message-Id: <20040420.114531.26222400.pfps@research.bell-labs.com>
> Hi Peter,
> 
> [My Ramblings snipped - see rest of thread for info]
> 
> Peter F. Patel-Schneider writes:
>  > 
> [...]
>  >
>  > Well, yes, but I don't think that the scheme that you propose is workable
>  > in general.  Why not, instead, use information from the document in which
>  > the URI reference occured?  I would claim that this information is going to
>  > be at least as appropriate as the information found by using your scheme.
>  > (It may, indeed, be that the document in which the URI reference occurs
>  > does point to the document that you would get to, perhaps by using an
>  > owl:imports construct.  This is, to me, the usual way things would occur,
>  > but I view it as extremely important to allow for other states of affairs.)
>  >
> 
> Unfortunately, most of the RDF I consume doesn't contain this
> contextual linkage information (or even appear in well formed
> documents). Take RSS1.0 feeds for example: If there's a term I don't
> know about, the RSS feed doesn't contain enough context information
> for my SW agent to get me a description of that term.

Yes, this is a definite problem with some sources - they use terms
without providing information about their meaning.  Such sources are
broken and, in my view, violate the vision of the Semantic Web.

How, then, to do something useful in these situations?  A scheme that
goes to a standard location (namely the document accessible from the
URI of the URI reference) is probably no worse than any other.
However, it should always be in mind that this scheme incorporates a
leap of faith: faith that the standard document has information about
the term; faith that the standard document has usefully-complete
information about the term; faith that the document using the term is
using it in a way compatible with the information in the standard
document.  Each of these can leaps of faith can be counter to reality
and, worse, they can be counter to reality in undetectable ways.

> I'd like a facility for doing this that doesn't rely on a centralised
> search-engine.
> 
> Ideally I'd also like a facility to search based on any URI: If
> somebody is selling Bicycles, I'd like my agent to be able to find
> other Bicycle sellers in order to compare prices etc..

But how do you expect your scheme to solve this problem?  It
(probably) won't find sellers of racing bicycles, whereas an IR
approach would (probably).  It (probably) won't find sellers of
personal non-motorized transportation devices, but neither would an IR
approach.  Moreover, to find such sellers, it places severe burdens on
information sources, both computational and legal.

>   [... more snipping ...] 
> 
>  > > I suspect that for trust to work on any implementation of an
>  > > open-world semantic web (centralised or decentralised), the authority
>  > > of a statement will have to be decoupled from the location it was
>  > > discovered. If that is the case, then it won't matter if you serve the
>  > > statement or it comes from somewhere else.
>  > 
>  > Aaah, but it really does matter who serves a statement.  Perhaps not to the
>  > model-theoretic semantics of the Semantic Web, but certainly the source of
>  > information matters in the external social (and legal) world. Ignoring the
>  > fact that the Semantic Web is part of our imperfect and messy world is not
>  > going to helpful for its widespread adoption.
> 
> I agree with you that the source of the statement is important, but
> disagree in your implied definition of source.
> 
> I view this as a user-interface issue: The only reason that you
> attribute the document contents to the 'owner' of the location that
> served the document is that at present this is a reasonably good
> guess.  

No, I attribute the contents of a document to the ``owner'' of the
location because the ``owner'' of the location is responsible for the
contents of the document.  This is not a guess (modulo correct
operation of the World Wide Web).  The owner is responsible for the
document, even if the document is only a copy of some other document.
(Yes, sometimes, such as is the case for search engines, the extent of
this responsibility is somewhat limited and does not extend to an
claim by the owner that the contents of the document are correct.)

> Having said that, you wouldn't attempt to sue google for a
> page you read from its cache.

No?  Why not?  (Well, I woudn't, but others might.) 

Google has already removed pages from its cache for various legal
reasons - and this in the United States where the protections on
communication are (still) quite strong.  In other legal jurisdictions
there are even more kinds of communication and retransmission of
communications that are prohibited.  Note that Google puts an explicit
disclaimer at the beginning of pages served from its cache (which
means that its cache cannot really be viewed as a cache).

> The pages that you view on the internet come via your internet service
> provider, however you don't attribute the legal authority of these
> pages to your ISP. (Although people did try and do this in the early
> days of the web).

This is only because ISPs have managed to obtain a particular legal
status that is (almost certainly) not available to Google.  

> Similarly, if you have a knowledge agent that hunts the internet for
> information, you become less aware of the location that individual
> statements were loaded from. Better facilities will exist for
> automatically discovering the source of a statement than applying a
> 'it came from this location so it must be asserted by them' heuristic.

Well, I would expect that a semantic search engine would try to
present the results of its search in the form
	<information source> contains/contained <information>
(Amazing! A potential use of RDF reification.)
I really don't see any way of replacing the 
  It came from this location in a context that means that it is
  asserted so it must be asserted by the owner of this location.
rule by anything else.

>  > > Without such a facility on the semantic web, I struggle to see how
>  > > it
>  > > will be bootstrapped to deal with open queries. At present, there is
>  > > no real 'web' of information to search.
>  > 
>  > Well, I think that there already exists the mechanism to have a truely
>  > decentralised web of information, with no central authorities or
>  > information sources, namely the owl:imports construct.  It is not
>  > > perfect,
>  > but I think that it is better than trusting in the absence of a
>  > > mechanism
>  > for supporting a network of trust.
> 
> True. But it doesn't allow me to do open queries. 

(What is an open query?)

>  > All this said, I have nothing particularly against imposing a notion
>  > of
>  > authority or mandating trust relationships in particular situations.
>  > If it
>  > is indeed the case that contact information for employees is
>  > (normally, or
>  > even, often) stored at particular locations, then it can be useful to
>  > impose a trust relationship from outside the Semantic Web to get
>  > applications to use this information.  I view this as a (partial)
>  > failure of the goals of the Semantic Web, but it may be the case that
>  > this
>  > is the best that can be done.
>  > 
>  > However, I would consider it to be much more in line with the goals of
>  > the
>  > Semantic Web to instead have a document that explicitly points to
>  > these
>  > employee documents to establish the trust relationship.  
> 
> My experience has been that once you start writing SW applications,
> the notion of 'document' becomes clumsy and doesn't provide much
> value. For example, we have lots of RDF published in documents at
> work, but typically applications don't go to these documents to get
> this information - they query an RDF knowledge base (e.g. sesame)
> which sucks data in from these documents.

But how then do you determine which information to use?  There has to
be some limit to the amount of information that use and I don't see
any method for so doing that does not ultimately depend on documents
(or other similar information sources such as databases).

>  > Even better would be
>  > to have some mechanism to implicitly point to these employee
>  > documents,
>  > but I do not believe that there are currently any mechanisms in the
>  > Semantic Web for so doing.
> 
> That is the mechanism I am attempting to envisage. I suspect that
> trust and authority is going to be a much simpler problem to solve
> (simply by signing statements, and treating any unsigned statements
> with suspicion) than that of decentralised information discovery to
> support open queries.
> 
> The problem is that if we don't do this soon, a number of centralized
> spike solutions will appear based on harvesting all the RDF in the
> world and putting it in one place (e.g. 'google marketplace').

Well, maybe, but I don't see much utility to harvesting the
information in random (and very large) collections of documents and
unioning all this information into one information source.  I do,
however, see lots of utility in analyzing semantic information from
lots of documents and providing pointers back to those documents, suitably
organized. 

> Cheers,
> 
> Phil

Peter F. Patel-Schneider
Bell Labs Research
Received on Tuesday, 20 April 2004 12:04:24 UTC