Re: Just-in-time scraping, queries?

Danny Ayers wrote:
> I was wondering if anyone had come up with any strategies that might
> be useful in a scenario that came up on the SIMILE list [1]. Rickard
> is using Piggy Bank's scraper to harvest moderately large amounts of
> data into its store (30,000 items, 10 properties each), and is running
> into performance issues. I'm not sure, but he mentioned Wikipedia
> earlier, that may be the datasource.

Right now I'm using the databases at usgs.gov (earthquakes, volcanoes, 
etc.) as a start. Wikipedia is my next target though.

> I think it's reasonable to consider a triplestore as merely a cache of
> a certain chunk of the Semantic Web at large. So in a case like this,
> maybe it makes more sense to forget trying to cache /everything/, just
> grabbing things into the working model as required. But say there's a
> setup like a SPARQL interface to a store, and a scraper (HTTP GET+
> whatever translation is appropriate). How might you do figure what's
> needed to fulfil the query, what joins are required, especially where
> there isn't any direct subject-object kind of connection to the
> original data? (i.e. where there's lots of bnodes). Querying Wikipedia
> as-is via SPARQL is probably a good use case.

Indeed, I've been thinking about multi-layered triplestores as well, 
such as:
1) Multiple distributed specialized banks
fronted by:
2) Local persistent bank acting as proxy and file-store cache
fronted by:
3) In-memory cache that contains queried data

When you start working with data in Piggy Bank, for example, you start 
with some basic filtering, like "I want all earthquakes". This is sent 
to 2) which can then use 1) to get the data, possibly using the local 
cache whenever possible. The aggregated dataset is then put into 3) and 
presented to the user. Since we then have a reasonably small subset of 
all data in memory it is very fast to do further drilldown filtering, 
such as showing "all earthquakes in 1980, of magnitude 6-8". Since such 
operations would be done in memory they would be very very fast. 3) can 
even exist on the users computer so that user sessions do not eat vast 
amounts of server resources. This way each layer is used to the maximum. 
1) might even be a "fake" semantic bank, where scraping would be 
performed in real-time as the queries come in. It is easy to imagine all 
of this being combined with P2P techniques as well.

/Rickard

Received on Monday, 24 October 2005 01:41:19 UTC