Re: Distributed querying on the semantic web from Phil Dawes on 2004-04-20 (www-rdf-interest@w3.org from April 2004)

From: Phil Dawes <pdawes@users.sourceforge.net>
Date: Tue, 20 Apr 2004 18:19:28 +0100
To: "Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Cc: www-rdf-interest@w3.org
Message-ID: <16517.23456.240165.31082@gargle.gargle.HOWL>
Peter F. Patel-Schneider writes:
 > 
 > > Unfortunately, most of the RDF I consume doesn't contain this
 > > contextual linkage information (or even appear in well formed
 > > documents). Take RSS1.0 feeds for example: If there's a term I don't
 > > know about, the RSS feed doesn't contain enough context information
 > > for my SW agent to get me a description of that term.
 > 
 > Yes, this is a definite problem with some sources - they use terms
 > without providing information about their meaning.  Such sources are
 > broken and, in my view, violate the vision of the Semantic Web.
 > 
 > How, then, to do something useful in these situations?  A scheme that
 > goes to a standard location (namely the document accessible from the
 > URI of the URI reference) is probably no worse than any other.
 > However, it should always be in mind that this scheme incorporates a
 > leap of faith: faith that the standard document has information about
 > the term; faith that the standard document has usefully-complete
 > information about the term; faith that the document using the term is
 > using it in a way compatible with the information in the standard
 > document.  Each of these can leaps of faith can be counter to reality
 > and, worse, they can be counter to reality in undetectable ways.
 > 

All true. However the web shows us that people publishing information
do tend to go to some lengths to ensure that it is as accessible and
usable as possible. I suspect that if it becomes a convention that
agents go to the URI when they don't have any other information,
people will endeavour to put useful information there.


 > > I'd like a facility for doing this that doesn't rely on a centralised
 > > search-engine.
 > > 
 > > Ideally I'd also like a facility to search based on any URI: If
 > > somebody is selling Bicycles, I'd like my agent to be able to find
 > > other Bicycle sellers in order to compare prices etc..
 > 
 > But how do you expect your scheme to solve this problem?  It
 > (probably) won't find sellers of racing bicycles, whereas an IR
 > approach would (probably).  It (probably) won't find sellers of
 > personal non-motorized transportation devices, but neither would an IR
 > approach.  Moreover, to find such sellers, it places severe burdens on
 > information sources, both computational and legal.
 > 

I'm not familiar with IR in this context - is this a centralised
search service?

I would imagine a decentralised version could work like this:

The original source says:
<#superbike300> <rdf:type> <http://example.com/foo/Racingbike>

My agent dereferences 'http://example.com/foo/Racingbike' and the
representation returned gives information about racing bikes
(e.g. it's <rdfs:subClassOf> <bah:Bike>, <owl:equivalentClass>
<baz:RacingBike> etc...). It provides some links <rdfs:seeAlso> to
sources of racing bike instances, and maybe some metadata about those
sources (source has an http joseki interface etc..).

My agent then uses the sources to locate other <#seller>s of
<Racingbikes>.

N.B. The seller of <#superbike300> has chosen to use the term
<http://example.com/foo/Racingbike> over other identical terms because
it has a good information lookup service. 

The seller has also contacted the owner of
<http://example.com/foo/Racingbike> (maybe through some automated
mechanism) in order to let him know that she sells racing bikes.

[...snip...]

 > Google has already removed pages from its cache for various legal
 > reasons - and this in the United States where the protections on
 > communication are (still) quite strong.  In other legal jurisdictions
 > there are even more kinds of communication and retransmission of
 > communications that are prohibited.  Note that Google puts an explicit
 > disclaimer at the beginning of pages served from its cache (which
 > means that its cache cannot really be viewed as a cache).
 > 
 > > The pages that you view on the internet come via your internet service
 > > provider, however you don't attribute the legal authority of these
 > > pages to your ISP. (Although people did try and do this in the early
 > > days of the web).
 > 
 > This is only because ISPs have managed to obtain a particular legal
 > status that is (almost certainly) not available to Google.  
 > 

Do you think it will be available to Google in the future?
(ISPs have ~10 years on search engine caches)

Where does that leave open forums? wikis?


 > > Similarly, if you have a knowledge agent that hunts the internet for
 > > information, you become less aware of the location that individual
 > > statements were loaded from. Better facilities will exist for
 > > automatically discovering the source of a statement than applying a
 > > 'it came from this location so it must be asserted by them' heuristic.
 > 
 > Well, I would expect that a semantic search engine would try to
 > present the results of its search in the form
 > 	<information source> contains/contained <information>
 > (Amazing! A potential use of RDF reification.)
 > I really don't see any way of replacing the 
 >   It came from this location in a context that means that it is
 >   asserted so it must be asserted by the owner of this location.
 > rule by anything else.
 > 

How about 'it's signed by this private key, therefore it's asserted by
its owner'?


 > >  > > Without such a facility on the semantic web, I struggle to see how
 > >  > > it
 > >  > > will be bootstrapped to deal with open queries. At present, there is
 > >  > > no real 'web' of information to search.
 > >  > 
 > >  > Well, I think that there already exists the mechanism to have a truely
 > >  > decentralised web of information, with no central authorities or
 > >  > information sources, namely the owl:imports construct.  It is not
 > >  > > perfect,
 > >  > but I think that it is better than trusting in the absence of a
 > >  > > mechanism
 > >  > for supporting a network of trust.
 > > 
 > > True. But it doesn't allow me to do open queries. 
 > 
 > (What is an open query?)
 > 

Sorry - I meant querying in an open-world environment.

[...snip...]

 > > My experience has been that once you start writing SW applications,
 > > the notion of 'document' becomes clumsy and doesn't provide much
 > > value. For example, we have lots of RDF published in documents at
 > > work, but typically applications don't go to these documents to get
 > > this information - they query an RDF knowledge base (e.g. sesame)
 > > which sucks data in from these documents.
 > 
 > But how then do you determine which information to use?  There has to
 > be some limit to the amount of information that use and I don't see
 > any method for so doing that does not ultimately depend on documents
 > (or other similar information sources such as databases).
 > 

Signed statements. Web of trust.
(actually we don't do this, but we should do, and would if the
appropriate software existed)


 > >  > Even better would be
 > >  > to have some mechanism to implicitly point to these employee
 > >  > documents,
 > >  > but I do not believe that there are currently any mechanisms in the
 > >  > Semantic Web for so doing.
 > > 
 > > That is the mechanism I am attempting to envisage. I suspect that
 > > trust and authority is going to be a much simpler problem to solve
 > > (simply by signing statements, and treating any unsigned statements
 > > with suspicion) than that of decentralised information discovery to
 > > support open queries.
 > > 
 > > The problem is that if we don't do this soon, a number of centralized
 > > spike solutions will appear based on harvesting all the RDF in the
 > > world and putting it in one place (e.g. 'google marketplace').
 > 
 > Well, maybe, but I don't see much utility to harvesting the
 > information in random (and very large) collections of documents and
 > unioning all this information into one information source.  

Google does - see e.g. GMail.

 > I do,
 > however, see lots of utility in analyzing semantic information from
 > lots of documents and providing pointers back to those documents, suitably
 > organized. 
 > 

Is that because you are most familiar with documents and search
engines?  
E.g. if all statements were signed, and society had come to accept
that retransmitting statements didn't amount to asserting them, would
you rather write software against one big queryable database or lots
of individual documents and a big index to them?

Cheers,

Phil
Received on Tuesday, 20 April 2004 16:10:02 UTC