Re: Billion Triples Challenge Crawl 2014

Thanks Michel,
I think your point is well made.

On 18 Feb 2014, at 04:42, Michel Dumontier <michel.dumontier@gmail.com> wrote:

> Hi Tim,
>   That folder contains 350GB of compressed RDF. I'm not about to unzip it because a crawler can't decompress it on the fly.  Honestly, it worries me that people aren't considering the practicalities of storing, indexing, and presenting all this data. 
>   Nevertheless, Bio2RDF does provide void definitions, URI resolution, and access to SPARQL endpoints.  I can only hope our data gets discovered.
So this is more than enough that it is Linked Data, and hard work it must be too.

I had forgotten that all your URIs are resolvable (even though I have harvested them from your dumps for sameas.org! - it was a while ago last time.)
I share your pain of making the URIs discoverable - I have quite a few sites with many M of URIs (such as sameAs.org), and am resigned to the fact that they will be in some sense underrepresented in the Crawl, especially as in the case of sameAs.org, it’s use is essentially in a dynamic context, and so it is unlikely that links will be found to it from elsewhere.
(Actually, as I type that I realise that is is no longer so true, as the Ordnance Survey has added rdfs:seeAlso links from all it’s URIs.)

I think the issue here is how does the Crawl get the URIs to crawl.
Personally I think that one solution is the existing one of sitemaps, possibly augmented by voiD.
In fact, sitemaps is how I crawl all the ePrints repositories, and harvest by doing URI resolution of each URI.
So http://researchrepository.murdoch.edu.au leads me to http://researchrepository.murdoch.edu.au/sitemap.xml which leads to http://researchrepository.murdoch.edu.au/id/repository which has all the URIs in them.
In that case I think it is a Semantic Sitemap, but it could be an ordinary one.
Eg http://www.amazon.co.uk leads to http://www.amazon.co.uk/robots.txt which leads to http://www.amazon.co.uk/sitemap-manual-index.xml which leads to things like http://www.amazon.co.uk/sitemap_brands.xml

As I said earlier, I think doing it via URI resolution is great, and sitemaps make it easy for the Crawl team and others to extend their ldspider when they get the time.

Of course, what you or I could do is provide a page of sample URIs somewhere, and then ask Andreas to include that page as part of the Crawl I think?

By the way, I didn’t find your void document or a sitemap, I’m afraid - perhaps you could add something like what amazon has in at the bottom of its robots.txt:
# Sitemap files
Sitemap: http://www.amazon.co.uk/sitemap-manual-index.xml
Sitemap: http://www.amazon.co.uk/sitemaps.f3053414d236e84.SitemapIndex_0.xml.gz
Sitemap: http://www.amazon.co.uk/sitemaps.1946f6b8171de60.SitemapIndex_0.xml.gz

Very best
Hugh

> 
> m.
> 
> Michel Dumontier
> Associate Professor of Medicine (Biomedical Informatics), Stanford University
> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
> http://dumontierlab.com
> 
> 
> On Sat, Feb 15, 2014 at 10:31 PM, Tim Berners-Lee <timbl@w3.org> wrote:
> On 2014-02 -14, at 09:46, Michel Dumontier wrote:
> 
>> Andreas,
>> 
>>  I'd like to help by getting bio2rdf data into the crawl, really. but we gzip all of our files, and they are in n-quads format.
>> 
>> http://download.bio2rdf.org/release/3/
>> 
>> think you can add gzip/bzip2 support ?
>> 
>> m.
>> 
>> Michel Dumontier
>> Associate Professor of Medicine (Biomedical Informatics), Stanford University
>> Chair, W3C Semantic Web for Health Care and the Life Sciences Interest Group
>> http://dumontierlab.com
>> 
> 
> An on 2014-02 -15, at 18:00, Hugh Glaser wrote:
> 
>> Hi Andreas and Tobias.
>> Good luck!
>> Actually, I think essentially ignoring dumps and doing a “real” crawl, is a feature, rather than a bug.
> 
> 
> Michel, 
> 
> Agree with High. I would encourage you unzip the data files on your own servers 
> so the URIs will work and your data is really Linked Data.
> There are lots of advantages to the community to be compatible.
> 
> Tim 
> 
> 
> 

-- 
Hugh Glaser
   20 Portchester Rise
   Eastleigh
   SO50 4QS
Mobile: +44 75 9533 4155, Home: +44 23 8061 5652

Received on Tuesday, 18 February 2014 11:54:33 UTC