Re: SPARQL and Web 2 from Henry Story on 2005-10-10 (semantic-web@w3.org from October 2005)

From: Henry Story <henry.story@bblfish.net>
Date: Mon, 10 Oct 2005 13:04:45 +0200
To: Gareth Andrew <freega@freegarethandrew.org>, SWIG SWIG <semantic-web@w3.org>
Message-Id: <AC6ED448-0457-478C-862F-ECD4A91A6A83@bblfish.net>
On 10 Oct 2005, at 03:21, Gareth Andrew wrote:
> Hi Henry,
>
> I just posted a rebuttal at
> http://gingerhendrix.blogspot.com/2005/10/sparql-and-web-2.html
> but in the interests of discussion I am reposting it here (Please  
> excuse
> the third person).

No problem. Thanks for the feedback. I'll add a link to your post and  
to this
thread from my post. (It is easier to have a discussion on a mailing  
list, than
on a blog)

> [DISCLAIMER: I have no expertise in this area, I am just a lay
> commentator]

I think we are all explorers of this vast, new, unconquered land. :-)

> Henry Story has just posted describing SPARQL as a query language for
> Web 2.0. I think all his usage examples are good, but I think he's
> missed the point slightly. Henry suggests that Web 2.0 business will
> expose SPARQL endpoints over web services. This isn't going to happen
> for several reasons
>      1. Economics: There is a lot of value stored in the databases  
> Henry
>         mentions and most companies will not want competitors/users to
>         have unrestricted access to this data. Current web service  
> APIs
>         are designed so the expected value increase from user derived
>         software, is likely to exceed the loss of the value in the  
> data.

I am not saying "open all your databases, and all information in all  
your
databases". That would be crazy and often illegal  (think of  
information held
about customers for example).

No, clearly the idea is to expose only a subset of the data that the  
enterprise
has available. Enterprises should consider though that the most  
successful
web businesses are all (in one way or another) search engines. And  
search engines
have made it their business to open a huge amount of data to the world.


>      2. Performance: Even if the data is completely open, and the
>         economics doesn't come into play, performance is a major  
> issue.
>         SPARQL queries are designed to be written by Semantic Web
>         engineers, much as SQL queries are designed to be written by
>         database engineers. As an example, consider the following  
> query
>
>         PREFIX foaf:
>         PREFIX dc:
>         SELECT ?book
>         WHERE { ?book dc:creator ?who
>              ?who foaf:name "J. K. Rowling"
>         }
>
>         This query (if the WHERE clause is evaluated top to bottom) is
>         highly inefficient, it first searches for all triples with
>         property dc:creator, then filters those such that the
>         dc:creator's foaf:name is "J. K. Rowling". A much more  
> efficient
>         query reverses the patterns in the WHERE clause. I believe
>         automated query rewriting is beyond state of the art at the
>         moment and will continue to be for the foreseeable future,

That is a good point. But this is going to be true whenever you open  
a query
interface to the web. I worked at AltaVista and we had to deal with  
exactly
the same problem. If someone asks for "The cat of Danny Ayers" none  
of the
search engines first go and find all pages in which the word "the"  
appears. There
just are too many. Google even drops it. They would first look for  
pages with Danny
and Ayers, and then look for which of those pages contain the word  
"cat". And search engines have absolutely VAST indexes to search  
through, which most companies don't
have.

So whatever query language you have, be it ever so simple as text  
search, you
will have the above problem. Clearly there is a huge opportunity here  
for
people to write SPARQL drivers that optimize the queries for the  
database they
are hooked into, be it an RDF database or a plain normal relational  
database.
Given that we now have a uniform interface to query databases the  
demand for
such drivers will become very big, and so competition will work out  
the details.
Just think of java servlets. You can get some nice and simple  
implementations,
and then you can get much more sophisticated ones that reduce the  
number of threads
by using pooling or the nio socket select call.

By the way I don't recommend waiting for such drivers to be available  
to start
working on this. Because it is the first who get there that will have  
the
advantage. Just start a service like this as a low key beta site. I  
did this
at AltaVista with the BabelFish machine translation service. We had  
no idea how
successful it would be. So we just put a c cgi script out there that  
did some
pretty awful things (like fetch web pages by forking lynx) -- though  
I did make
sure it did not do the worst (I used sockets to inform the  
translators that
there was a new file requiring translation instead of having it poll  
the file
system, which could have been deadly with volume). When it was clear  
that this
was of interest, we removed the lynx fork, and improved a few other  
things that
we could fix immediately. Later I rewrote the cgi as a Java servlet  
which was
a lot cleaner, less buggy, more scalable and of course unicode enabled.

>         especially when you consider the technical challenge of  
> throwing
>         inferencing into the mix, and the social challenge of open
>         access (eg. consider the query "SELECT ?s ?p ?o WHERE
>         { ?s ?p ?o}").


With respect to some queries requiring too much processing on the  
server side
there are already numerous well established techniques for dealing  
with this
on the web. You can simply return an error message, explaining that
the server does not allow that type of query. You can cut up the results
into little chunks. This is something to look into. The best way is  
to try it out.

> that's not to say I don't see SPARQL becoming an integral part of Web
> 2.0. I envisage that the next generation of back-end storage products,
> will be based on triples, inferencing, and rules. SPARQL will be the
> query language used to interface with the backend. At the web tier,
> services will continue to be built on RESTful principles, however more
> services will expose data as RDF, and publish schemas based on  
> RDFS, OWL
> etc to enhance their meaning. At the client side aggregators,  
> smushers,
> inferencers and provers will be fundamental building blocks, and
> high-level special purpose APIs written to interface with them (eg.
> BlogEd's RDF Javabeans classes). I think there is room for SPARQL  
> again
> at this level, but it's likely to be too general and complex for your
> average application programmer.

Mhh. I think that the SPARQL interface just makes life a lot easier for
client side application programmers, and perhaps a little more  
difficult for
the server side ones. I don't think writing SPARQL queries is  
complex. But this is
easy to test. We just need to start opening a few databases out  
there, and see how
it works out and how developers take to it?  There's nothing like  
empirical data.

In fact I'd like to post a blog entry to point to  a nice test  
database with a well laid out ontology so that developers can play   
with SPARQL. Any good ideas?


Henry Story

> On Mon, 2005-10-10 at 00:46 +0200, Henry Story wrote:
>
>> I just posted this: http://blogs.sun.com/roller/page/bblfish/20051009
[snip]
Received on Monday, 10 October 2005 11:04:58 UTC