- From: Orri Erling <erling@xs4all.nl>
- Date: Fri, 20 Jun 2008 10:47:13 +0200
- To: <martin.hepp@uibk.ac.at>, "'Yves Raimond'" <yves.raimond@gmail.com>
- Cc: "'Giovanni Tummarello'" <giovanni.tummarello@deri.org>, "'Hausenblas, Michael'" <michael.hausenblas@joanneum.at>, <public-lod@w3.org>, "'Semantic Web'" <semantic-web@w3.org>
- Message-Id: <200806200847.m5K8ls7Y018332@smtp-vbr12.xs4all.nl>
Hi all As I see it, VOID is about creating a possibility for smart, somewhat transparent federating of linked data. Federatting is the opposite of the search engine model. As Giovanni will be the first to know, scale in the search engine space is purchased at the price of query expressivity. This will continue to be the case even if the search engine stores quads and runs SPARQL queries. A search engine is geared for lookups, whether text or SPARQL, not for running everybody's analytics on hardware paid for and operated by the search provider. So there is a segment of the problem space that should be addressed outside of a Google-style approach. Either people build their own warehouses and pay for the hardware and management or there just might be a slice that could be addressdd by federated query shipping, dividing the cost more evenly. VOID, along the lines drawn on the ESW wiki VOID page or my blog post at virtuoso.openlinksw.com/blog would be a step in this direction, even a necessary step. So, the point is: 1 . How to compose queries that one will ship over? SameAs's and vocabulary and whether one can expect URI's to match between sets or needs sameAs translations.. Even more importantly, in Andy Seaborne's immortal words [in Beijing], the ultimate optimization is not to query at all. So know if there are applicable instances at the end point to begin with. 2. Cardinalities. How much to expect? This is like listing the top 100 high cardinality predicates, classes, objects, for example. Even better would be to amend the SPARQLL protocol to allow submitting a query to the remote cost model and get the cost and cardinality guess back These have been discussed for a long time, on and off, in the context of SPARQL end point introspection. But now could be a time to do something, at least experiment, since we see a certain amount of interest building around the topic. Stretegically, we have to bet on both centralized models and decentralized ones. A centralized only approach will hit problems as soon as the workload is not uniform. After all, the data web is also about repurposing in novel ways. Orri _____ From: public-lod-request@w3.org [mailto:public-lod-request@w3.org] On Behalf Of Martin Hepp Sent: Thursday, June 19, 2008 3:09 PM To: Yves Raimond Cc: Giovanni Tummarello; Hausenblas, Michael; public-lod@w3.org; Semantic Web Subject: Re: The king is dressed in void >However, there are some cases where you can't really afford that, for >example when "looking inside" takes too much time - for example >because of the size of "inside". But how do you decide which part of the "inside" is contained in the "outside" description? If you want all details from the inside in the outside, then you have to replicate the whole inside - which does not gain anything. And if the outside is just a subset (or even: proper abstraction) of the inside, then you will face "false positive" (the outside indicates something would be inside, but it actually isn't) and "false negative" (there is something inside which the outside does not tell) situations. Now for me the whole discussion boils down to the question on whether one can produce good descriptions that are (1) substantially shorter than the inside data and (2), on average, keep the false positive and false negative cases low. So you would have to find a proper trade-off and then show by means of a quantitative evaluation that there are relevant situations in which your approach increases retrieval performance. Btw, the problem seems to me pretty much analog to full text vs. keyword-based information retrieval. And I guess there the trend goes to clever indexing of the full inside data than relying on the manually created outside description. From my experience, explicit keywords are now less and less relevant. Best Martin http://www.heppnetz.de
Received on Friday, 20 June 2008 08:49:01 UTC