[ANN] Distributed DBpedia Extraction (Open Beta)

Dear all,

We are happy to announce an early beta version of Distributed DBpedia
Extraction with Hadoop / Spark. Things are still rough but we want beta
testers to report their experience - and extraction time of course. :)

https://github.com/dbpedia/distributed-extraction-framework

Read ahead if you are interested
==================================

Right now we only support extraction, which means that you need to download
the dumps with the existing method (distributed downloading is our next
step)

Setting up the framework and performing a distributed extraction is fairly
easy; we have outlined all the details in the README and added a script for
firing up a Spark+HDFS cluster quickly on Google Compute Engine.

For a single language, the whole extraction job (including redirects) is
executed in parallel. If you add multiple languages, all jobs are submitted
to Spark, and based upon Spark’s configured scheduling mode, they’ll be
scheduled over the cluster in parallel either in a FIFO (default) or FAIR
manner.

We did some tests on a small 3-node cluster: 1 master (2 core 7.5G RAM -
GCE n1-standard-2), 2 slaves (4 core 15G RAM each - GCE n1-standard-4) with
4 workers on each slave. Using the English Wikipedia, the distributed
framework took a total of 3hrs. 21 min. to finish extraction (including the
pre-extraction redirects computation). We’ll add more tests and benchmarks
to the GitHub wiki pages very soon.

Any feedback is more than welcome. We keep track of our future tasks and
bugs @GitHub
https://github.com/dbpedia/distributed-extraction-framework/issues

Cheers,
Nilesh, Sang & Dimitris

Acknowledgements: This project is sponsored by the Google Summer of Code
project.
https://www.google-melange.com/gsoc/project/details/google/gsoc2014/nileshc/5841554954518528



You can also email me at contact@nileshc.com or visit my website
<http://nileshc.com/>

Received on Thursday, 31 July 2014 07:26:43 UTC