Re: Distributed querying on the semantic web from Patrick Stickler on 2004-04-22 (www-rdf-interest@w3.org from April 2004)

From: Patrick Stickler <patrick.stickler@nokia.com>
Date: Thu, 22 Apr 2004 11:31:03 +0300
To: "ext Peter F. Patel-Schneider" <pfps@research.bell-labs.com>
Cc: www-rdf-interest@w3.org, pdawes@users.sourceforge.net
Message-Id: <6493D783-9437-11D8-AB55-000A95EAFCEA@nokia.com>
On Apr 20, 2004, at 18:45, ext Peter F. Patel-Schneider wrote:

>
>
>> Hi Peter,
>>
>> [My Ramblings snipped - see rest of thread for info]
>>
>> Peter F. Patel-Schneider writes:
>>>
>> [...]
>>>
>>> Well, yes, but I don't think that the scheme that you propose is 
>>> workable
>>> in general.  Why not, instead, use information from the document in 
>>> which
>>> the URI reference occured?  I would claim that this information is 
>>> going to
>>> be at least as appropriate as the information found by using your 
>>> scheme.
>>> (It may, indeed, be that the document in which the URI reference 
>>> occurs
>>> does point to the document that you would get to, perhaps by using an
>>> owl:imports construct.  This is, to me, the usual way things would 
>>> occur,
>>> but I view it as extremely important to allow for other states of 
>>> affairs.)
>>>
>>
>> Unfortunately, most of the RDF I consume doesn't contain this
>> contextual linkage information (or even appear in well formed
>> documents). Take RSS1.0 feeds for example: If there's a term I don't
>> know about, the RSS feed doesn't contain enough context information
>> for my SW agent to get me a description of that term.
>
> Yes, this is a definite problem with some sources - they use terms
> without providing information about their meaning.

???

The term is denoted by a URI. The authoritative meaning of that term
should be obtainable via that URI (e.g. by using a solution such as
URIQA).

Each source which uses a term should not have to bundle along the
definition of that term! Nor should it be manditory that the source
have to indicate how/where that term is fully defined by the owner
of that term.

All that should matter is the URI. Period. That's all. Nothing more
should be required for the agent to obtain the authoritative description
of that term, if required.

There is *NOTHING* wrong with RSS 1.0 in this regard. There is no
reason whatsoever why an RSS instance should indicate how the
definitions of the terms used should be obtained.

If some client doesn't understand a term, there should be a standardized
SW-optimized means for the client to obtain the term's definition (and
IMO, that should be done using URIQA or something similar).


> Such sources are
> broken and, in my view, violate the vision of the Semantic Web.

Then it would appear that your vision of the SW has little
intersection with more commonly held vision of the SW.

>
> How, then, to do something useful in these situations?  A scheme that
> goes to a standard location (namely the document accessible from the
> URI of the URI reference) is probably no worse than any other.
> However, it should always be in mind that this scheme incorporates a
> leap of faith: faith that the standard document has information about
> the term; faith that the standard document has usefully-complete
> information about the term; faith that the document using the term is
> using it in a way compatible with the information in the standard
> document.  Each of these can leaps of faith can be counter to reality
> and, worse, they can be counter to reality in undetectable ways.

Precisely, which is why thinking in terms of "documents" and limiting
one's search for information about a term to particular documents is
non-scalable and fragile.

Just as there is no standards-imposed constraints on how representations
are stored/managed internally by a web server which responds to a GET
request for a given URI and returns a representation -- so too should 
there
be no standards-imposed (or in any other way imposed) constraints on how
authoritative descriptions are stored/managed internally by a SW server
which responds to an MGET (or similar) request and returns the 
description.

Thus, whether that term definition is expressed in one or a dozen 
places,
whether it is stored in a physical RDF/XML instance or a database, 
whether
one or a hundred people are involved in its creation or management, all
is irrelevant to the agent and should be rightly hidden from view. All
the agent wants is the authoritative description -- no matter how it
is defined/managed.

The SW needs a layer of opacity in the publication/access of resource
descriptions just as the web provides a layer of opacity in the
publication/access of representations.

RDF/XML and OWL "documents" simply get in the way, and are the wrong
level of resolution to try to provide a scalable, global, and efficient
infrastructure for the publication and interchange of resource 
descriptions
across the SW.

> Well, I would expect that a semantic search engine would try to
> present the results of its search in the form
> 	<information source> contains/contained <information>

Probably.

> (Amazing! A potential use of RDF reification.)

Named graphs will IMO provide a better solution (and certainly require 
less triples).

>> My experience has been that once you start writing SW applications,
>> the notion of 'document' becomes clumsy and doesn't provide much
>> value. For example, we have lots of RDF published in documents at
>> work, but typically applications don't go to these documents to get
>> this information - they query an RDF knowledge base (e.g. sesame)
>> which sucks data in from these documents.
>
> But how then do you determine which information to use?  There has to
> be some limit to the amount of information that use and I don't see
> any method for so doing that does not ultimately depend on documents
> (or other similar information sources such as databases).

Documents are simply the wrong mechanism, at the wrong architectural 
layer
to construct our "webs of trust". Named, signed graphs are IMO the 
answer.

(Jeremy Carroll, Chris Bizer, Pat Hayes, and I are finishing up a paper
on an approach to addressing this issue which should be web-visible 
soon).

>> The problem is that if we don't do this soon, a number of centralized
>> spike solutions will appear based on harvesting all the RDF in the
>> world and putting it in one place (e.g. 'google marketplace').
>
> Well, maybe, but I don't see much utility to harvesting the
> information in random (and very large) collections of documents and
> unioning all this information into one information source.

Apart from a very few, if even ultimately only one highly ambitious
service (such as Google) most collections of knowledge will probably
be highly specialized (e.g. harvesting all wine related knowledge, or
all knowledge about vintage golf clubs, etc.).

And most likely, such collections would not (necessarily) be collections
of "documents" but collections of knowledge -- harvested via a 
standardized
interface which rightly hides the underlying mechanisms used to manage
such knowledge.

>  I do,
> however, see lots of utility in analyzing semantic information from
> lots of documents and providing pointers back to those documents, 
> suitably
> organized.

Simply pointing back to documents is leaving all the real work for each
agent -- to parse and extract from such documents the individual bits
of information that are needed insofar as a particular term or resource
is concerned.

It's not the least bit efficient or scalable.

Consider a mobile client that needs to understand the meaning
of some property. The "document" that defines this is a monolithic
RDF/XML instance for an ontology defining 750 terms with labels
in descriptions in 17 languages. It is 2.4 MB in size.

What a fat lot of help getting the URI of that massive RDF/XML
"document" is going to be when all that is needed is a concise
description of a single property.

What the mobile client *should* be able to do, is to ask the web
authority of the URI denoting that property for a concise bounded
description of that property, and then proceed with whatever it
was doing -- with no concern for how that knowledge was managed,
stored, partitioned, etc. etc.

Thinking in terms of RDF or OWL documents insofar as global
access of resource-specific knowledge is concerned (either
authoritative or 3rd party) is not going to provide a scalable
and efficient solution.

Regards,

Patrick

--

Patrick Stickler
Nokia, Finland
patrick.stickler@nokia.com
Received on Thursday, 22 April 2004 04:34:00 UTC