Re: Distributed querying on the semantic web from Peter F. Patel-Schneider on 2004-04-21 (www-rdf-interest@w3.org from April 2004)

From: Peter F. Patel-Schneider <pfps@research.bell-labs.com>
Date: Wed, 21 Apr 2004 03:20:13 -0400 (EDT)
To: pdawes@users.sourceforge.net
Cc: www-rdf-interest@w3.org
Message-Id: <20040421.032013.45260331.pfps@research.bell-labs.com>
From: "Phil Dawes" <pdawes@users.sourceforge.net>
Subject: Re: Distributed querying on the semantic web
Date: Tue, 20 Apr 2004 18:19:28 +0100

> Peter F. Patel-Schneider writes:
>  > 
>  > > Unfortunately, most of the RDF I consume doesn't contain this
>  > > contextual linkage information (or even appear in well formed
>  > > documents). Take RSS1.0 feeds for example: If there's a term I don't
>  > > know about, the RSS feed doesn't contain enough context information
>  > > for my SW agent to get me a description of that term.
>  > 
>  > Yes, this is a definite problem with some sources - they use terms
>  > without providing information about their meaning.  Such sources are
>  > broken and, in my view, violate the vision of the Semantic Web.
>  > 
>  > How, then, to do something useful in these situations?  A scheme that
>  > goes to a standard location (namely the document accessible from the
>  > URI of the URI reference) is probably no worse than any other.
>  > However, it should always be in mind that this scheme incorporates a
>  > leap of faith: faith that the standard document has information about
>  > the term; faith that the standard document has usefully-complete
>  > information about the term; faith that the document using the term is
>  > using it in a way compatible with the information in the standard
>  > document.  Each of these can leaps of faith can be counter to reality
>  > and, worse, they can be counter to reality in undetectable ways.
> 
> All true. However the web shows us that people publishing information
> do tend to go to some lengths to ensure that it is as accessible and
> usable as possible. I suspect that if it becomes a convention that
> agents go to the URI when they don't have any other information,
> people will endeavour to put useful information there.

Sure, and this would be really good.  However, I firmly believe that there
must not be any requirement, or even expectation, that this is *the*
information about a URI reference.  It must be possible, and not 
stigmatized, to have different views concerning URI references.

>  > > I'd like a facility for doing this that doesn't rely on a centralised
>  > > search-engine.
>  > > 
>  > > Ideally I'd also like a facility to search based on any URI: If
>  > > somebody is selling Bicycles, I'd like my agent to be able to find
>  > > other Bicycle sellers in order to compare prices etc..
>  > 
>  > But how do you expect your scheme to solve this problem?  It
>  > (probably) won't find sellers of racing bicycles, whereas an IR
>  > approach would (probably).  It (probably) won't find sellers of
>  > personal non-motorized transportation devices, but neither would an IR
>  > approach.  Moreover, to find such sellers, it places severe burdens on
>  > information sources, both computational and legal.
> 
> I'm not familiar with IR in this context - is this a centralised
> search service?

Sorry, ``IR'' here means ``information retrieval'', i.e., retrieval
systems like Google that do not consider the meaning of words.

> I would imagine a decentralised version could work like this:
> 
> The original source says:
> <#superbike300> <rdf:type> <http://example.com/foo/Racingbike>
> 
> My agent dereferences 'http://example.com/foo/Racingbike' and the
> representation returned gives information about racing bikes
> (e.g. it's <rdfs:subClassOf> <bah:Bike>, <owl:equivalentClass>
> <baz:RacingBike> etc...). It provides some links <rdfs:seeAlso> to
> sources of racing bike instances, and maybe some metadata about those
> sources (source has an http joseki interface etc..).

But your agent was looking for bah:Bike, so how does it know to dereference
http://example.com/foo/Racingbike?  Yes, it would be *possible* to go
backwards by searching the Semantic Web for URI references that are stated
to be subclasses of bah:Bike.  I don't have much expectation that this
would work well in practice, at least for quite some time, as it is going
to pick up lots and lots of incorrect statements.

[...big snip...]

>  > > Similarly, if you have a knowledge agent that hunts the internet for
>  > > information, you become less aware of the location that individual
>  > > statements were loaded from. Better facilities will exist for
>  > > automatically discovering the source of a statement than applying a
>  > > 'it came from this location so it must be asserted by them' heuristic.
>  > 
>  > Well, I would expect that a semantic search engine would try to
>  > present the results of its search in the form
>  > 	<information source> contains/contained <information>
>  > (Amazing! A potential use of RDF reification.)
>  > I really don't see any way of replacing the 
>  >   It came from this location in a context that means that it is
>  >   asserted so it must be asserted by the owner of this location.
>  > rule by anything else.
>
> How about 'it's signed by this private key, therefore it's asserted by
> its owner'?

I don't think that this suffices.  The owner may have changed its mind, the
original context may have been non-assertive, etc., etc.

[...snip...]

>  > > My experience has been that once you start writing SW applications,
>  > > the notion of 'document' becomes clumsy and doesn't provide much
>  > > value. For example, we have lots of RDF published in documents at
>  > > work, but typically applications don't go to these documents to get
>  > > this information - they query an RDF knowledge base (e.g. sesame)
>  > > which sucks data in from these documents.
>  > 
>  > But how then do you determine which information to use?  There has to
>  > be some limit to the amount of information that use and I don't see
>  > any method for so doing that does not ultimately depend on documents
>  > (or other similar information sources such as databases).
> 
> Signed statements. Web of trust.
> (actually we don't do this, but we should do, and would if the
> appropriate software existed)

How does this provide boundaries on what information to use?  All that
signatures provide is attribution.  All that the web of trust provides is a
sense of who can be assumed to be truthful.  None of this does anything to
help me from following links any ending up using information that depends
in mutually-inconsistent world views.

[...]

>  > I do,
>  > however, see lots of utility in analyzing semantic information from
>  > lots of documents and providing pointers back to those documents, suitably
>  > organized. 
> 
> Is that because you are most familiar with documents and search
> engines?  

I don't think so.

> E.g. if all statements were signed, and society had come to accept
> that retransmitting statements didn't amount to asserting them, would
> you rather write software against one big queryable database or lots
> of individual documents and a big index to them?

The problem is that there are mutually inconsistent world views out there
held by trustworthy agents.  Simply getting information from multiple
sources and expecting it to useful in combination is not going to work.  A
full solution to this problem is very difficult, but I believe that the
Semantic Web can get a long way by using explicit links between documents
to indicate that the information in these documents can be fruitfully
combined.

> Cheers,
> 
> Phil

Peter F. Patel-Schneider
Bell Labs Research
Received on Wednesday, 21 April 2004 03:39:42 UTC