ANN: Large hyperlink graph published, covering 3.5 billion web pages and 128 billion hyperlinks

Hi all,

 

we are happy to announce the publication of a new large hyperlink graph.

 

The WDC Hyperlink Graph has been extracted from the Common Crawl 2012 web
corpus [1] and covers 3.5 billion web pages and 128 billion hyperlinks
between these pages.  To the best of our knowledge, the graph is the largest
hyperlink graph that is available to the public.

 

As the graph covers hyperlinks between web pages and not hyperlinks between
data items, it is of course a bit off-topic for this list, but we thought it
might still be interesting to some people on this list, maybe for comparing
linking pattern in the LOD cloud with linking patterns in the classic
document Web.

 

The graph can be downloaded in various formats from 

 

http://webdatacommons.org/hyperlinkgraph

 

We provide initial statistics about the topology of the graph at

 

http://webdatacommons.org/hyperlinkgraph/topology.html

 

We hope that the graph will be useful for researchers who develop

+ Search algorithms that rank results based on the hyperlinks between pages.

+ SPAM detection methods which identity networks of web pages that are
published in order to trick search engines.

+ Graph analysis algorithms and can use the hyperlink graph for testing the
scalability and performance of their tools.

+ Web Science researchers who want to analyze the linking patterns within
specific topical domains in order to identify the social mechanisms that
govern these domains.

 

We want to thanks the Common Crawl project for providing their great web
crawl and thus enabling the creation of the WDC Hyperlink Graph. 

The creation of the WDC Hyperlink Graph was supported by the EU research
project PlanetData and by Amazon Web Services.  We thank your sponsors a
lot.

 

Best,

 

Christian Bizer, Oliver Lehmberg & Robert Meusel

 

[1] http://commoncrawl.org/

 

--

Prof. Dr. Christian Bizer

Data and Web Science Research Group

Universität Mannheim, Germany 
chris@informatik.uni-mannheim.de

www.bizer.de

 

Received on Tuesday, 12 November 2013 16:37:44 UTC