Re: The Power of Virtuoso Sponger Technology from Hugh Glaser on 2009-10-18 (public-lod@w3.org from October 2009)

From: Hugh Glaser <hg@ecs.soton.ac.uk>
Date: Sun, 18 Oct 2009 16:50:58 +0100
To: Olaf Hartig <hartig@informatik.hu-berlin.de>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <EMEW3|6e344644c668052a3a86a8a414f8f76cl9HGpD02hg|ecs.soton.ac.uk|131%hg@ecs.sot>

The SWCL-style approach works pretty well as long as the RDF you want about
the URIs is the stuff you get by resolving.

It can be much more problematic if the URI is in some site such as (a
wrapped) Amazon, saying what is the price of a book identified by a
publisher's URI.
There are ways round this, but the technology is not really quite there yet.

Cheers
Hugh

On 18/10/2009 16:34, "Olaf Hartig" <hartig@informatik.hu-berlin.de> wrote:

> Hey Giovanni,
> 
> On Sunday 18 October 2009 16:01:41 Giovanni Tummarello wrote:
>> I'd say, if i understand well
>> 
>> that that works only for queries where you need the extra dereferenced
>> data just "additionally" e.g. to add a label to your result se
> 
> I'm not sure what you mean be "additionally" here. The approach works for all
> queries that could be answered by traversing RDF links and building the result
> during this process. This approach doesn't assume a huge store/index that
> holds large parts of the Web of data. Instead, all data that contributes to
> the result is discovered during the execution of the query.
> (at least in the pure form of the approach - for efficiency reasons or to
> allow for more complete results you may want to reuse the data discovered
> during previous query executions)
> 
>> if you need the remote, on the fly reference data to e.g. sort by
>> price you'd have to fetch all from the remote site ..
> 
> True. But fetching all the remote site data that is relevant for the query is
> possible with the link traversal based approach (as long as the single RDF
> graphs from the site are interlinked appropriately).
> Sure, this might become more inefficient than systems that crawl in advance.
> But, what if the descriptions for the price-sorted things are from multiple
> data sources. What if these descriptions change quite frequently or the list
> of these things changes often. Maybe even the list of sources that provide the
> descriptions change. In these cases the link traversal based approach will
> help because it allows for up-to-date answers even if their calculation might
> take some time.
> 
> Greetings,
> Olaf
> 
>> Gio
>> 
>> 
>> 
>> On Sun, Oct 18, 2009 at 2:57 PM, Olaf Hartig
>> 
>> <hartig@informatik.hu-berlin.de> wrote:
>>> Hey,
>>> 
>>> On Sunday 18 October 2009 09:37:14 Martin Hepp (UniBW) wrote:
>>>> [...]
>>>> So it will boil down to technology that combines (1) crawling and
>>>> caching rather stable data sets with (2) distributing queries and parts
>>>> of queries among the right SPARQL endpoints (whatever actual DB
>>>> technology they expose).
>>>> 
>>>> You can keep a text index of the whole Web, if crawling cycles in the
>>>> order of magnitude of weeks are fine. For structured, linked data that
>>>> exposes dynamic database content, "dumb" crawling and caching will not
>>>> scale.
>>> 
>>> Interesting discussion!
>>> 
>>> An alternative approach to query federation is the link traversal based
>>> query execution as implemented in the SemWeb Client Lib. The main idea of
>>> this approach is to look-up URIs during the query execution itself. With
>>> this approach you don't rely on the existence of SPARQL endpoints and
>>> -even more important- you don't have to know all the sources that
>>> contribute to the query result in advance. Plus, the results are based on
>>> the most up-to-date data you can get.
>>> 
>>> Greetings,
>>> Olaf
> 
>

Received on Sunday, 18 October 2009 15:52:26 UTC