Large hyperlink graph published, covering 3.5 billion web pages and 128 billion hyperlinks from Robert Meusel on 2013-11-12 (public-html@w3.org from November 2013)

From: Robert Meusel <robert@informatik.uni-mannheim.de>
Date: Tue, 12 Nov 2013 14:33:53 +0100
To: public-html@w3.org
Message-ID: <52822E41.5010103@informatik.uni-mannheim.de>

Hi all,

the Web Data Commons team is happy to announce the publication of a new 
large hyperlink graph.

The graph has been extracted from the Common Crawl 2012 web corpus [1] 
and covers 3.5 billion web pages and 128 billion hyperlinks between 
these pages. To the best of our knowledge, the graph is the largest 
hyperlink graph that is available to the public.

The graph can be downloaded in various formats from

http://webdatacommons.org/hyperlinkgraph

We provide initial statistics about the topology of the graph at

http://webdatacommons.org/hyperlinkgraph/topology.html

We hope that the graph will be useful for researchers who develop

·Search algorithms that rank results based on the hyperlinks between pages.

·SPAM detection methods which identity networks of web pages that are 
published in order to trick search engines.

·Graph analysis algorithms and can use the hyperlink graph for testing 
the scalability and performance of their tools.

·Web Science researchers who want to analyze the linking patterns within 
specific topical domains in order to identify the social mechanisms that 
govern these domains.

We want to thanks the Common Crawl project for providing their great web 
crawl and thus enabling the creation of the WDC Hyperlink Graph.

The creation of the WDC Hyperlink Graph was supported by the EU research 
project PlanetData and by Amazon Web Services. We thank your sponsors a lot.

Best Regards,

Chris, Oliver & Robert

[1] http://commoncrawl.org/

Received on Tuesday, 12 November 2013 13:34:15 UTC