RE: Google API -> Jena from Dan Brickley on 2002-04-19 (www-rdf-interest@w3.org from April 2002)

From: Dan Brickley <danbri@w3.org>
Date: Fri, 19 Apr 2002 08:34:49 -0400 (EDT)
To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
cc: <www-rdf-interest@w3.org>
Message-ID: <Pine.LNX.4.30.0204190751360.29191-100000@tux.w3.org>
On Fri, 19 Apr 2002, Jeremy Carroll wrote:

>
> I hadn't realised that there was a google api ...
>
>
> http://www.google.com/apis/
>
>
> wow, that's a cool idea.

I had a screenscraped version of this in my ropey old Perl RDF, which was
largely based on ideas from Guha's Mozilla RDF implementation. There are some
interesting issues raised when we try to plug in remote services to
service certain types of triple-query. Dead simple lookups work fine
(stock tickers, backlinks), you can have certain property types trigger
the lookup. More complex remote services are harder to integrate (eg. I think
Mozilla tried to wrap IMAP behind a graph API). Problem there is the
granualarity of the API makes it hard to know when to pester the remote
service, how often etc.

What I was basically doing was... (sat behind a graph match API, in
context where a predicate and object were supplied but no subject):

 my $NS_RUDOLF='http://xmlns.com/2000/08/goo/';
 if ($prop eq $NS_RUDOLF.'referer') {
 my @hits = $self->googlinks($obj);
 foreach my $h(@hits) {
      push(@triples, $NS_RUDOLF.'referer',$h,$obj);
    }
  }
  return @triples;

...with googlinks() being a quick hack 4 line screenscraper. Maybe using
SOAP will reduce that to 3, or at least improve reliability!

So...

If a graph navigation API is too granular for dealing with remote services
that aren't easily conceptualised as simple triple queries, maybe we
should be doing this at the rdf query level. Peeling apart an RDF query
(in one of the 'gimme bindings for this variable-name-decorated graph'
languages, Squish/RDFdbQL, Algae, RDQL etc), and sending a subquery to the
specialised Web service.

I've started to hack on this (in Ruby, having dumped Perl :)

rough scribbles:
http://www.w3.org/2001/12/rubyrdf/squish/service/webfetch_tests.rb

...currently gets a result set through (by hand) doing part of the query
against a local RDF graph, and part by calling a (different, scraped
still) Google backlinks API. This is an obvious candidate for automation
based on Web service description, and seems to offer a nice curve from
'simple stuff do-able now' to 'phd territory distributed query'. I've no
plans to go near the latter end! I do want to make demos that show some
basic Web service composition techniques though, ie. a single RDF query
serviced by consulting multiple Web services and binding the results
together, where the decomposition into sub-tasks is done by reading
RDF service descriptions (WSDL++++?).

the current hack is as follows:

We have a query that pulls event data from a locally parsed RSS feed:

squish='SELECT ?item, ?title, ?etype, ?org, ?loc, ?start, ?end,
	WHERE
	(rss::title ?item ?title)
	(ev::type ?item ?etype)
	(ev::organizer ?item ?org)
        (ev::location ?item ?loc)
	(ev::startdate ?item ?start)
	(ev::enddate ?item ?end)
	USING
	rss for http://purl.org/rss/1.0/
	ev for http://purl.org/rss/1.0/modules/event/
   	foaf for http://xmlns.com/foaf/0.1/ '

In theory, we have a larger query that additioanlly asks that the item is
backlinked from the url 'http:///www.bloomfieldhouse.com/'. This
corresponds to a user app which is 'find me economics events associated
with Bloomfield House'. Since the demo script embodies the decomposition
of the larger query into two smaller ones, and I've not automated that
task, I never wrote out what the full query looks like. But it'd be just
one more line in the WHERE clause.

We just do (APIs are still taking shape, some of this is counterintuitive):

url = 'http://www.bloomfieldhouse.com/'
backlinks =  SquishQuery.googleBacklinks url, 'item'
eventfeed = Loader.ntfile2graph 'events.nt'
events = SquishQuery.ask SquishQuery.new.parseFromText(squish), eventfeed

total = events.match backlinks, 'item' 	# merge in the backlinks table
 total.each do |row|
  row.values.each_key do |field| puts "\t#{field}: #{row.values[field]}  \n" end
  puts "\n\n"
end

This finds us one result,

        loc: "Mullingar, Republic of Ireland"
        title: "IEA Annual Conference 2002"
        org: "Irish Economic Association"
        end: "2002-04-14"
        start: "2002-04-12"
        item: http://www.iea.ie/conferences/
        etype: "Conference"

...corresponding to the event in the feed that google assures us also
links to (or is it from, I forget; whichever) the Bloomfield House URL.

Neither Google nor the local RSS feed has enough information alone to
answer the question. The common data model, and common naming system
(URIs) help us get to the answer. ASIDE: Interestingly, having a common
'Ontology' between Google and RSS/events was mostly irrelevant to the
problem. We got a match not because the two data sources shared an
ontology, but because they both used the same URI to name an individual
thing.


I want to try a couple of things next:

 (1) make the query engine understand when bits of a query can be farmed  out
to remote services; look at requirements on web service description that
make this deployable in the wild

  eg: the Google lookup services (scraped; not in their SOAP API yet)

  a) map onto a property (eg. goo:backlinks) and
  b) expect BOUND, BOUND, UNBOUND for subject/predicate/object in your query
     ie. you can ask for 'page1.html goo:backlinks ?p' and get back (several) values for ?p
     but you can't (unlike other RDF data sources) ask it for
	UNBOUND BOUND UNBOUND and expect a dump of all their backlinks


 (2)
investigate how this relates to the (also handy) goal of sitting
these services behind a graph API, and hiding (partially) their remoteness
from users. ie. redo my old Perl hack properly.

We can create backends for most RDF APIs that do the trick of going off to
Google when certain kinds of questions are asked. But we also need in some
contexts to expose this behaviour, since application code will need to be
sensitive to such goings on, eg. for the purposes of asking questions in
a sensible right order.

ie. if I am a query engine that does the job of implementing rdf query
against a plain 'match these possibly blanked-out triples', I can't be
entirely agnostic about what's going on behind the RDF API. Or I can, but
if I ask the triple questions in thr wrong order, I'll miss out on answers.
We need to know that the backend will only be able to answer
'bound goo:backlinks unbound' or 'bound goo:backlinks bound' but not
'unbound goo:backlinks unbound'.

Same sort of thing goes for substring searches etc., if they're being
plugged in secretetly behind an RDF graph API and applications are
trying to do query on top of those (instead of passing entire queries and
subqueries through to systems closer to the data).


I guess it'd be healthy to come up with some more practical use cases for
queries where part (but not all) of the work is done by google. Then map
these onto properties, eg. goo:backlinks, goo:goodMatchForQueryString,
goo:relatedPage, goo:assignedDmozCategory etc etc.

Dan


-- 
mailto:danbri@w3.org
http://www.w3.org/People/DanBri/
Received on Friday, 19 April 2002 08:35:51 UTC