Just-in-time scraping, queries?

I was wondering if anyone had come up with any strategies that might
be useful in a scenario that came up on the SIMILE list [1]. Rickard
is using Piggy Bank's scraper to harvest moderately large amounts of
data into its store (30,000 items, 10 properties each), and is running
into performance issues. I'm not sure, but he mentioned Wikipedia
earlier, that may be the datasource.

I think it's reasonable to consider a triplestore as merely a cache of
a certain chunk of the Semantic Web at large. So in a case like this,
maybe it makes more sense to forget trying to cache /everything/, just
grabbing things into the working model as required. But say there's a
setup like a SPARQL interface to a store, and a scraper (HTTP GET+
whatever translation is appropriate). How might you do figure what's
needed to fulfil the query, what joins are required, especially where
there isn't any direct subject-object kind of connection to the
original data? (i.e. where there's lots of bnodes). Querying Wikipedia
as-is via SPARQL is probably a good use case.

I can't help thinking there might be something akin to CBDs [2] that
might work, but I'm not sure offhand how one would delegate the
path-walking down to a scraper. Or maybe someone has an approach to
cross-triplestore querying that will work (a SPARQL-squared kind of
trick might be useful [3], but I suspect there might not be enough
linkage).

Thoughts?

Incidentally, there is some schadenfreudenesque comfort in knowing
that these kind of problems aren't solely SW issues. From the same
list thread:

[[
> Glad you're pushing it to the limit :-) Just curious, have you tried
> plotting 30,000 items on Google Maps?!

Yes. Doesn't work :-)
]]

Cheers,
Danny.

[1] http://simile.mit.edu/mail/ReadMsg?listName=General&msgNo=1155
[2] http://www.w3.org/Submission/2004/SUBM-CBD-20040930/
[3] http://dannyayers.com/archives/2005/10/01/sparql-squared/

--

http://dannyayers.com

Received on Sunday, 23 October 2005 20:38:54 UTC