Re: Using Linking Open Data datasets from Giovanni Tummarello on 2008-05-30 (public-lod@w3.org from May 2008)

From: Giovanni Tummarello <giovanni.tummarello@deri.org>
Date: Fri, 30 May 2008 02:31:59 +0100
To: "Peter Ansell" <ansell.peter@gmail.com>
Cc: "Hugh Glaser" <hg@ecs.soton.ac.uk>, "public-lod@w3.org" <public-lod@w3.org>
Message-ID: <210271540805291831w19a55483i82d6438cfcf72717@mail.gmail.com>
Hi,

> endpoint, just whether sindice overall would publish these endpoints
> or links to known sitemaps possibly where information can be found,
> including the extra information about the dbpedia topics that are
> related to the individual endpoints to promote discovery. For the

as per my previous answer. We might return sparql endpoints as well,
why not. For the moment we return links to rdf documents (or pages
with embedded microformats. API to get the RDF (Better than the usual
grddl) to come soon).

> moment the static web page approach that is taken by
> http://www.sindice.com/map is okay, although it would be nice as I say
> to also produce it as RDF in the linked data way.
>



> Sindice has limited resources I figure, and being able to utilise each
> producers sparql endpoint for queries instead of sindice would be
> valuable to everyone.

Sindice doesnt answer sparql queries but has its own (simple) query
language. Its not meant to be a substitute but just a plce where to
find general directions .. then you as a client or integrator can of
course interact directly with the target data provider sparql endpoint
(e.g. discovered via semantic sitemap)



> Btw, the map says something about billions of pieces of information
> but the index that is used says: "Now searching index V1 (around 7.02
> million documents and counting). Also try index V0 (26.9 M) " How much
> of a dump is actually utilised in the search index in order to
> compress the billions down to millions this way.

billions of pieces of information = triples or microformat equivalent
of triples ..
currently over 2b in the beta0 index. (Beta1 is growing but it will
take time since some stuff needs to be fixed before we can index much
faster.). Beta 0 only has RDF sources so most of it is dumps collected
via semantic sitemaps.

Giovanni


>
> Cheers,
>
> Peter
>
> 2008/5/30 Giovanni Tummarello <giovanni.tummarello@deri.org>:
>> Hi Peter,
>>
>> Sindice will not go and probe your sparql endpoint. The Sitemap
>> directive for exposing sparql endpoints is therefore mostly to be used
>> by clients or other forms of integration which do not involve indexing
>> the entire RDF model (or models) you offer.
>>
>> Sindice might return you sparql endpoints i nthe future (why not), it
>> will certainly use the dump files you might want to provide and slice
>> this instead of crawling when a sitemap (and a dump!) is available.
>> The end result is that you serve a singlefile (the dump) but you can
>> find your resolvable URIs (URLs) returned as results when a matching
>> query is answered. ( e.g.
>> http://www.sindice.com/search?q=semantic+sitemaps&qt=keyword the first
>> result (the talk) and the last (the paper) were indexed without
>> crawling the site)
>>
>> hope this helps
>>
>> Giovanni
>>
>>
>> On Fri, May 30, 2008 at 1:39 AM, Peter Ansell <ansell.peter@gmail.com> wrote:
>>> Does sindice utilise the SPARQL related pieces in anyway for internal
>>> processing? Does it understand or replicate the slicing mode?
>>>
>>> <sc:linkedDataPrefix
>>> slicing="scbd">http://dblp.rkbexplorer.com/id/</sc:linkedDataPrefix>
>>> <sc:sparqlEndpointLocation>http://dblp.rkbexplorer.com/sparql/</sc:sparqlEndpointLocation>
>>>
>>> If my understanding is correct, this is aimed at a search engine
>>> mostly... so it should publish this information when it finds it in a
>>> directory of sorts to be most useful. Does sindice republish this
>>> information in some form to allow directory based access to the
>>> different linked data endpoints/sites?
>>>
>>> Cheers,
>>>
>>> Peter
>>>
>>> 2008/5/30 Giovanni Tummarello <giovanni.tummarello@deri.org>:
>>>>
>>>> A validator in sindice is possible and has been discussed but the list
>>>> of things to do is now quite scary :-)
>>>>
>>>> poor man validator: plese post us about yout sitemap here
>>>> http://forum.sindice.com/index.php . Free report and quick indexing to
>>>> those who do.
>>>>
>>>> Giovanni
>>>>
>>>>> Mind you, Giovanni says that a lot of sitemaps are broken, so they fix them
>>>>> and cache the fixed ones for Sindice purposes :-)
>>>>>
>>>>>
>>>>> On 30/05/2008 00:02, "Peter Ansell" <ansell.peter@gmail.com> wrote:
>>>>>
>>>>> ...
>>>>>>>
>>>>>>> Richard
>>>>>>>
>>>>>>> [1] http://sw.deri.org/2007/07/sitemapextension/
>>>>>>
>>>>>> That looks very usable to me. Has anyone used it for linked data? How
>>>>>> do you discover these sitemaps as a linked data user, as opposed to
>>>>>> sitemaps which are traditionally submitted to search engines for
>>>>>> searching. In either case, it would be nice to have an RDF description
>>>>>> submitted as part of a sitemap to a semantic search engine so it might
>>>>>> be good to standardise that mechanism based around these ideas.
>>>>>>
>>>>>> Also, there is a reference in that document to N-Quad format, what is
>>>>>> that exactly? [2] is a bit sparse on examples so it is hard to
>>>>>> understand what is meant by the syntax.
>>>>>>
>>>>>> Also, is the slicing declaration attempting to make up for a deficit
>>>>>> in the SPARQL protocol w.r.t. DESCRIBE? Why not utilise SELECT if you
>>>>>> had an idea of what pieces of information you desire, although I guess
>>>>>> the server is in the best position to recommend information to you
>>>>>> with DESCRIBE queries. I think slicing mechanisms should be defined
>>>>>> outside of that context, although the lack of progress with CBD [3] is
>>>>>> a little worrying with respect to that bit.
>>>>>>
>>>>>> [2] http://sw.deri.org/2008/02/nx/
>>>>>> [3] http://www.w3.org/Submission/CBD/
>>>>>>
>>>>>> Cheers,
>>>>>>
>>>>>> Peter
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>
Received on Friday, 30 May 2008 01:32:39 UTC