- From: Paul Houle <ontology2@gmail.com>
- Date: Wed, 18 Sep 2013 11:15:26 -0400
- To: Linked Data community <public-lod@w3.org>, "semantic-web@w3.org" <semantic-web@w3.org>
- Message-ID: <CAE__kdR8EXVuMN4zv=+QCwBFF5+hn_pC6GzC2VcqZjg8OEjoXg@mail.gmail.com>
I am reporting the first really useful product from the work on Infovore, an open source framework for processing large RDF data sets with Hadoop. Even if you have no experience with Hadoop, you can run Infovore in the AWS cloud by simply providing your AWS credentials. As of https://github.com/paulhoule/infovore/releases/tag/t20130917 there is a first draft of 'sieve3', which splits up an RDF data set into mutually exclusive parts. There is a list of rules that apply to the triple, matching a rule diverts a triple to a particular output, and triples that fail to match a pattern fall into the 'output'. The horizontal subdivision looks like this http://www.slideshare.net/paulahoule/horizontal-decomposition-of-freebase Here are the segments 'a' - rdfs:type 'description' 'key' -- keys represented as expanded strings 'keyNs' -- keys represented in the key namespace `label` -- rdfs:label `name` -- type.object.name entries that are probably duplicative of rdfs:label `text` -- additional large text blobs `web` -- links to external web sites `links` -- all other triples where the ?o is a URI `other` -- all other triples where the ?o is not a Literal Overall this segmentation isn't all that different from how DBpedia is broken down. Last night I downloaded 4.5 GB worth of data from `links` and `other` out of the 20 GB dump supplied by Freebase and I expect to be able to write interesting SPARQL queries against this. This process is fast, completing in about 0.5 hrs with a smallAwsCluster. I think all of these data sets could be of interest to people who are working with triple stores and with Hadoop since the physical separation can speed most operations up considerably. The future plan for firming up sieve3 is to get spring configuration working inside Hadoop (I probably won't put spring in charge of Hadoop first) so that it will be easy to create new rule sets in either by writing Java or XML. This data can be download from the requester paid bucket s3n://basekb-lime/freebase-rdf-2013-09-15-00/sieved/
Received on Wednesday, 18 September 2013 15:15:57 UTC