- From: Dan Brickley <danbri@w3.org>
- Date: Fri, 19 Apr 2002 08:34:49 -0400 (EDT)
- To: Jeremy Carroll <jjc@hplb.hpl.hp.com>
- cc: <www-rdf-interest@w3.org>
On Fri, 19 Apr 2002, Jeremy Carroll wrote: > > I hadn't realised that there was a google api ... > > > http://www.google.com/apis/ > > > wow, that's a cool idea. I had a screenscraped version of this in my ropey old Perl RDF, which was largely based on ideas from Guha's Mozilla RDF implementation. There are some interesting issues raised when we try to plug in remote services to service certain types of triple-query. Dead simple lookups work fine (stock tickers, backlinks), you can have certain property types trigger the lookup. More complex remote services are harder to integrate (eg. I think Mozilla tried to wrap IMAP behind a graph API). Problem there is the granualarity of the API makes it hard to know when to pester the remote service, how often etc. What I was basically doing was... (sat behind a graph match API, in context where a predicate and object were supplied but no subject): my $NS_RUDOLF='http://xmlns.com/2000/08/goo/'; if ($prop eq $NS_RUDOLF.'referer') { my @hits = $self->googlinks($obj); foreach my $h(@hits) { push(@triples, $NS_RUDOLF.'referer',$h,$obj); } } return @triples; ...with googlinks() being a quick hack 4 line screenscraper. Maybe using SOAP will reduce that to 3, or at least improve reliability! So... If a graph navigation API is too granular for dealing with remote services that aren't easily conceptualised as simple triple queries, maybe we should be doing this at the rdf query level. Peeling apart an RDF query (in one of the 'gimme bindings for this variable-name-decorated graph' languages, Squish/RDFdbQL, Algae, RDQL etc), and sending a subquery to the specialised Web service. I've started to hack on this (in Ruby, having dumped Perl :) rough scribbles: http://www.w3.org/2001/12/rubyrdf/squish/service/webfetch_tests.rb ...currently gets a result set through (by hand) doing part of the query against a local RDF graph, and part by calling a (different, scraped still) Google backlinks API. This is an obvious candidate for automation based on Web service description, and seems to offer a nice curve from 'simple stuff do-able now' to 'phd territory distributed query'. I've no plans to go near the latter end! I do want to make demos that show some basic Web service composition techniques though, ie. a single RDF query serviced by consulting multiple Web services and binding the results together, where the decomposition into sub-tasks is done by reading RDF service descriptions (WSDL++++?). the current hack is as follows: We have a query that pulls event data from a locally parsed RSS feed: squish='SELECT ?item, ?title, ?etype, ?org, ?loc, ?start, ?end, WHERE (rss::title ?item ?title) (ev::type ?item ?etype) (ev::organizer ?item ?org) (ev::location ?item ?loc) (ev::startdate ?item ?start) (ev::enddate ?item ?end) USING rss for http://purl.org/rss/1.0/ ev for http://purl.org/rss/1.0/modules/event/ foaf for http://xmlns.com/foaf/0.1/ ' In theory, we have a larger query that additioanlly asks that the item is backlinked from the url 'http:///www.bloomfieldhouse.com/'. This corresponds to a user app which is 'find me economics events associated with Bloomfield House'. Since the demo script embodies the decomposition of the larger query into two smaller ones, and I've not automated that task, I never wrote out what the full query looks like. But it'd be just one more line in the WHERE clause. We just do (APIs are still taking shape, some of this is counterintuitive): url = 'http://www.bloomfieldhouse.com/' backlinks = SquishQuery.googleBacklinks url, 'item' eventfeed = Loader.ntfile2graph 'events.nt' events = SquishQuery.ask SquishQuery.new.parseFromText(squish), eventfeed total = events.match backlinks, 'item' # merge in the backlinks table total.each do |row| row.values.each_key do |field| puts "\t#{field}: #{row.values[field]} \n" end puts "\n\n" end This finds us one result, loc: "Mullingar, Republic of Ireland" title: "IEA Annual Conference 2002" org: "Irish Economic Association" end: "2002-04-14" start: "2002-04-12" item: http://www.iea.ie/conferences/ etype: "Conference" ...corresponding to the event in the feed that google assures us also links to (or is it from, I forget; whichever) the Bloomfield House URL. Neither Google nor the local RSS feed has enough information alone to answer the question. The common data model, and common naming system (URIs) help us get to the answer. ASIDE: Interestingly, having a common 'Ontology' between Google and RSS/events was mostly irrelevant to the problem. We got a match not because the two data sources shared an ontology, but because they both used the same URI to name an individual thing. I want to try a couple of things next: (1) make the query engine understand when bits of a query can be farmed out to remote services; look at requirements on web service description that make this deployable in the wild eg: the Google lookup services (scraped; not in their SOAP API yet) a) map onto a property (eg. goo:backlinks) and b) expect BOUND, BOUND, UNBOUND for subject/predicate/object in your query ie. you can ask for 'page1.html goo:backlinks ?p' and get back (several) values for ?p but you can't (unlike other RDF data sources) ask it for UNBOUND BOUND UNBOUND and expect a dump of all their backlinks (2) investigate how this relates to the (also handy) goal of sitting these services behind a graph API, and hiding (partially) their remoteness from users. ie. redo my old Perl hack properly. We can create backends for most RDF APIs that do the trick of going off to Google when certain kinds of questions are asked. But we also need in some contexts to expose this behaviour, since application code will need to be sensitive to such goings on, eg. for the purposes of asking questions in a sensible right order. ie. if I am a query engine that does the job of implementing rdf query against a plain 'match these possibly blanked-out triples', I can't be entirely agnostic about what's going on behind the RDF API. Or I can, but if I ask the triple questions in thr wrong order, I'll miss out on answers. We need to know that the backend will only be able to answer 'bound goo:backlinks unbound' or 'bound goo:backlinks bound' but not 'unbound goo:backlinks unbound'. Same sort of thing goes for substring searches etc., if they're being plugged in secretetly behind an RDF graph API and applications are trying to do query on top of those (instead of passing entire queries and subqueries through to systems closer to the data). I guess it'd be healthy to come up with some more practical use cases for queries where part (but not all) of the work is done by google. Then map these onto properties, eg. goo:backlinks, goo:goodMatchForQueryString, goo:relatedPage, goo:assignedDmozCategory etc etc. Dan -- mailto:danbri@w3.org http://www.w3.org/People/DanBri/
Received on Friday, 19 April 2002 08:35:51 UTC