Re: ANN: WebDataCommons.org - Offering 3.2 billion quads current RDFa, Microdata and Miroformat data extracted from 65.4 million websites from Dan Brickley on 2012-04-17 (public-lod@w3.org from April 2012)

From: Dan Brickley <danbri@danbri.org>
Date: Tue, 17 Apr 2012 19:22:10 +0200
To: Peter Mika <pmika@yahoo-inc.com>
Cc: Martin Hepp <martin.hepp@ebusiness-unibw.org>, "public-vocabs@w3.org Vocabularies" <public-vocabs@w3.org>, "public-lod@w3.org" <public-lod@w3.org>, Chris Bizer <chris@bizer.de>
Message-ID: <CAFfrAFoD1uCDVvDpNhS8kLXpL24EUZ8BYdfM0Kn8xH8_P7NQCA@mail.gmail.com>

On 17 April 2012 18:56, Peter Mika <pmika@yahoo-inc.com> wrote:
>
> Hi Martin,
>
> It's not as simple as that, because PageRank is a probabilistic algorithm (it includes random jumps between pages), and I wouldn't expect that wayfair.com would include 2M links on a single page (that would be one very long webpage).
>
> But again to reiterate the point, search engines would want to make sure that they index the main page more than they would want to index the detail pages.
>
> You can do a site query to get a rough estimate of the ranking without a query string:
>
> search.yahoo.com/search?p=site%3Awayfair.com
>
> You will see that most of the pages are category pages. If you go to 2nd page and onward you will see an estimate of 1900 pages indexed.
>
> Of course, I agree with you that a search engine focused on structured data, especial if domain-specific, might want to reach all the pages and index all the data. I'm just saying that current search engines don't, and CommonCrawl is mostly trying to approximate them (if I understand correctly what they are trying to do).

According to http://commoncrawl.org/faq/

"What do you intend to do with the crawled content?
Our mission is to democratize access to web information by producing
and maintaining an open repository of web crawl data that is
universally accessible. We store the crawl data on Amazon’s S3
service, allowing it to be bulk downloaded as well as directly
accessed for map-reduce processing in EC2."

No mention of search as such. I'd imagine they're open to suggestions,
and that the project (and crawl) could take various paths as it
evolves. (With corresponding influence on the stats...).

Our problem here is in figuring out what can be taken from such stats
to help guide linked data vocabulary creation and management. Maybe
others will do deeper focussed crawls, who knows? But it's great to
see this focus on stats lately, I hope others have more to share.

Dan

Received on Tuesday, 17 April 2012 17:22:42 UTC