RE: Google API -> Jena from Danny Ayers on 2002-04-19 (www-rdf-interest@w3.org from April 2002)

From: Danny Ayers <danny666@virgilio.it>
Date: Fri, 19 Apr 2002 15:13:26 +0200
To: "Dan Brickley" <danbri@w3.org>, "Jeremy Carroll" <jjc@hplb.hpl.hp.com>
Cc: <www-rdf-interest@w3.org>
Message-ID: <EBEPLGMHCDOJJJPCFHEFMEKBFLAA.danny666@virgilio.it>
>If a graph navigation API is too granular for dealing with remote services
>that aren't easily conceptualised as simple triple queries, maybe we
>should be doing this at the rdf query level. Peeling apart an RDF query
>(in one of the 'gimme bindings for this variable-name-decorated graph'
>languages, Squish/RDFdbQL, Algae, RDQL etc), and sending a subquery to the
>specialised Web service.

Yep, that's why I though Jena would be a good fit - not only Java (easy
Google interfacing) also got RDQL.

>I've started to hack on this (in Ruby, having dumped Perl :)
>
>rough scribbles:
>http://www.w3.org/2001/12/rubyrdf/squish/service/webfetch_tests.rb

Looking good (though I can't comment on Ruby)

>...currently gets a result set through (by hand) doing part of the query
>against a local RDF graph, and part by calling a (different, scraped
>still) Google backlinks API. This is an obvious candidate for automation
>based on Web service description, and seems to offer a nice curve from
>'simple stuff do-able now' to 'phd territory distributed query'. I've no
>plans to go near the latter end! I do want to make demos that show some
>basic Web service composition techniques though, ie. a single RDF query
>serviced by consulting multiple Web services and binding the results
>together, where the decomposition into sub-tasks is done by reading
>RDF service descriptions (WSDL++++?).

I'm not sure anyone needs to go near the latter end - as long as the
interfaces to the services are reasonably compatible (SOAP-RDF I guess) then
at the moment when the first service connects to the second, you're in
distributed query territory. Major potential for network effects!

(At least) a couple of issues need ironing out - rules for
stopping/preventing loops, though timing & passing a 'already visited' path
along with queries may be starting points (I bet this is in a web services
standard somewhere already). There's also the small matter of ensuring a
good quality/quantity ratio back from the queries (not unrelated to the
UNBOUND BOUND UNBOUND issue you mention below).

>the current hack is as follows:

[snip]

>This finds us one result,

[snip]

which I reckon is a reasonable proof-of-concept ;-)

>Neither Google nor the local RSS feed has enough information alone to
>answer the question. The common data model, and common naming system
>(URIs) help us get to the answer. ASIDE: Interestingly, having a common
>'Ontology' between Google and RSS/events was mostly irrelevant to the
>problem. We got a match not because the two data sources shared an
>ontology, but because they both used the same URI to name an individual
>thing.

Interesting. Now what about scraping URLs from blogs...

>I want to try a couple of things next:
>
> (1) make the query engine understand when bits of a query can be
>farmed  out
>to remote services; look at requirements on web service description that
>make this deployable in the wild
>
>  eg: the Google lookup services (scraped; not in their SOAP API yet)
>
>  a) map onto a property (eg. goo:backlinks) and
>  b) expect BOUND, BOUND, UNBOUND for subject/predicate/object in
>your query
>     ie. you can ask for 'page1.html goo:backlinks ?p' and get
>back (several) values for ?p
>     but you can't (unlike other RDF data sources) ask it for
>	UNBOUND BOUND UNBOUND and expect a dump of all their backlinks


>investigate how this relates to the (also handy) goal of sitting
>these services behind a graph API, and hiding (partially) their remoteness
>from users. ie. redo my old Perl hack properly.

Again, I fancy Jena as a concentrator.

>We can create backends for most RDF APIs that do the trick of going off to
>Google when certain kinds of questions are asked. But we also need in some
>contexts to expose this behaviour, since application code will need to be
>sensitive to such goings on, eg. for the purposes of asking questions in
>a sensible right order.
>
>ie. if I am a query engine that does the job of implementing rdf query
>against a plain 'match these possibly blanked-out triples', I can't be
>entirely agnostic about what's going on behind the RDF API. Or I can, but
>if I ask the triple questions in thr wrong order, I'll miss out on answers.
>We need to know that the backend will only be able to answer
>'bound goo:backlinks unbound' or 'bound goo:backlinks bound' but not
>'unbound goo:backlinks unbound'.
>
>Same sort of thing goes for substring searches etc., if they're being
>plugged in secretetly behind an RDF graph API and applications are
>trying to do query on top of those (instead of passing entire queries and
>subqueries through to systems closer to the data).

Hmm - this raises the idea of micro- and macro- reasoning, which could
potentially be done using exactly the same inference tools, only at a
different level of granularity?

>I guess it'd be healthy to come up with some more practical use cases for
>queries where part (but not all) of the work is done by google. Then map
>these onto properties, eg. goo:backlinks, goo:goodMatchForQueryString,
>goo:relatedPage, goo:assignedDmozCategory etc etc.

Yep - that's the kind of mapping I had in mind, almost direct from the API.
There's also other metadata (HTML+scraped) available from the returned links
that could be repackaged as triples.

Simple, practical use cases definitely needed ('I made a search engine out
of Google' doesn't sound very convincing ;-)

Cheers,
Danny.
Received on Friday, 19 April 2002 09:18:54 UTC