RE: The king is dressed in void from Orri Erling on 2008-06-20 (semantic-web@w3.org from June 2008)

From: Orri Erling <erling@xs4all.nl>
Date: Fri, 20 Jun 2008 10:47:13 +0200
To: <martin.hepp@uibk.ac.at>, "'Yves Raimond'" <yves.raimond@gmail.com>
Cc: "'Giovanni Tummarello'" <giovanni.tummarello@deri.org>, "'Hausenblas, Michael'" <michael.hausenblas@joanneum.at>, <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>
Message-Id: <200806200847.m5K8ls7Y018332@smtp-vbr12.xs4all.nl>

 

 

Hi all

 

 

As I see it, VOID is about creating a possibility for smart, somewhat
transparent  federating of linked data.

Federatting is the opposite of the search engine model.  As Giovanni will be
the first to know, scale in the search engine space is purchased at the
price of query expressivity.  This will continue to be the case even if the
search engine stores quads  and runs SPARQL queries.

 

A search engine is geared for lookups, whether text or SPARQL, not for
running everybody's analytics on hardware paid for and operated by the
search provider.  So there is a segment of the problem space that should be
addressed outside of a Google-style approach.  Either people build their own
warehouses and pay for the hardware and management or there just might be a
slice that could be addressdd by federated query shipping, dividing the cost
more evenly.

 

VOID, along the lines drawn on the ESW wiki VOID page  or my blog post at
virtuoso.openlinksw.com/blog would be a step in this direction, even a
necessary step.

 

So, the point is:

1

. How to compose queries that one will ship over?  SameAs's and vocabulary
and whether one can expect URI's to match between sets or needs sameAs
translations..  Even more importantly, in Andy Seaborne's immortal words [in
Beijing], the ultimate optimization is not to query at all.  So know if
there are applicable instances at the end point to begin with.

 

2. Cardinalities.  How much to expect?  This is like listing the top 100
high cardinality predicates, classes, objects, for example.  Even better
would be to amend the SPARQLL protocol to allow submitting a query to the
remote cost model and get the cost and cardinality guess back

 

These have been discussed for a long time, on and off, in the context of
SPARQL end point introspection.  But now could be a time to do something, at
least experiment, since we see a certain amount of interest building around
the topic.  Stretegically, we have to bet on both centralized models and
decentralized ones.  A centralized only approach will hit problems as soon
as the workload is not uniform.  After all, the data web is also about
repurposing in novel ways.

 

 

 

Orri

 

 

 

 

 

 

 

  _____  

From: public-lod-request@w3.org [mailto:public-lod-request@w3.org] On Behalf
Of Martin Hepp
Sent: Thursday, June 19, 2008 3:09 PM
To: Yves Raimond
Cc: Giovanni Tummarello; Hausenblas, Michael; public-lod@w3.org; Semantic
Web
Subject: Re: The king is dressed in void

 

>However, there are some cases where you can't really afford that, for
>example when "looking inside" takes too much time - for example
>because of the size of "inside".
 
But how do you decide which part of the "inside" is contained in the
"outside" description? If you want all details from the inside in the
outside, then you have to replicate the whole inside - which does not gain
anything. And if the outside is just a subset (or even: proper abstraction)
of the inside, then you will face "false positive" (the outside indicates
something would be inside, but it actually isn't) and "false negative"
(there is something inside which the outside does not tell) situations. Now
for me the whole discussion boils down to the question on whether one can
produce good descriptions that are (1) substantially shorter than the inside
data and (2), on average, keep the false positive and false negative cases
low. So you would have to find a proper trade-off and then show by means of
a quantitative evaluation that there are relevant situations in which your
approach increases retrieval performance.
 
Btw, the problem seems to me pretty much analog to full text vs.
keyword-based information retrieval. And I guess there the trend goes to
clever indexing of the full inside data than relying on the manually created
outside description. From my experience, explicit keywords are now less and
less relevant.
 
 
Best
Martin
http://www.heppnetz.de

Received on Friday, 20 June 2008 08:49:01 UTC