Re: Updated LOD Cloud Diagram - Missed data sources.

Hi David, Mark, and Hugh,

>>>>> But I wonder where so many other sites (including mine) went ?
>>>> 
>>>> The problem with crawling the Web of Linked Data is really that it is hard to get the datasets on the edges that set RDF links to other sources but are not the target of links from well-connected sources.
>>> 
>>> I'm curious, why you don't just crawl the whole Web looking for linked data?

Sorry, we are not Google and simply did not have the resources to crawl the whole Web and as for RDF/XML when dereferencing each URL.

>> Or better yet, work with one of the search engines or Open Crawl so you can use their indexes. 
> Well there is possibly a quick answer to this.
> Google, at least, doesn’t index Linked Data.
> Well, certainly not the kind that does conneg.
> See other recent messages on this list about the problem of SEO of Linked Data, which is another side of the same coin.

Yes, as Hugh already pointed out the mayor search engines do not index Linked Data. The same is true for the CommonCrawl corpus (http://commoncrawl.org/) which also only includes HTML pages and no RDF documents.

Alternatively, one could of course search for HTML documents that contain links pointing at RDF/Linked Data documents (for instance using <link rel="alternate" type="application/rdf+xml" ...> in the header part of an HTML document).

We did not do this for our current crawl and using search engines with high coverage like Google for this also does not work as Google does not provide HTML source code search (at least as far as I know).

Using the alternative search engine NerdyData (https://search.nerdydata.com/) which offers source code search and searching for "rel='alternate' type='application/rdf+xml' indicates that there are at least 15,000 websites offering such links.

But looking at the first results, it appears that most of these sites point at RSS feeds rather than at Linked Data documents (RDF documents that contain RDF links pointing at other data items potentially in other data sources).

It would be great if somebody would investigate this deeper and produce a list with Linked Data URIs that could be used as seeds for further crawls.

Concerning our crawl, one also needs to keep in mind that we only sampled each Linked Data site and did not crawl all URIs that we did discover within each Linked Data site. By crawling deeper it is quite possible that you would find documents that contain attritional links pointing at formerly unknown Linked Data sites.  

Tobias Käfer and Andreas Harth from KIT are currently working on a more complete Linked Data crawl (for the Billions Triples Challenge 2014) and I'm very much looking forward to this corpus being released and to see how many Linked Data sites they discovered.

Cheers,

Chris



> Checking Google:
> Looking at http://dbpedia.org/resource/Birching
> If I take a URI from (the RDF I get from) that page, and search for it in Google, I think I would expect it to take me to quite a few RDF documents in various formats.
> But, for example,
> https://www.google.com/#filter=0&q=%22http://ru.dbpedia.org/resource/Розги%22
> (asking for all results in the filter=0), shows no RDF documents at all.
> Of course, RDF documents would have …/data/… in them, rather than …/resource/… or …/page/… And, in fact, searching for dbpedia/data
> https://www.google.com/#q=%22dbpedia.org%2Fdata%22
> only gives 1.2M hits, which is way short of what it would be.
>
> Not my field, so I may have it wrong, but I felt like checking it out on a stormy Sunday afternoon!
>
> Best
> Hugh
>
>> 
>> Regards,
>> Dave
>> --
>> http://about.me/david_wood
>> Sent from my iPad
>> 
>>> 
>>> Mark.

Received on Monday, 18 August 2014 10:12:47 UTC