On the horizontal decomposition of Freebase

I am reporting the first really useful product from the work on Infovore,
 an open source framework for processing large RDF data sets with Hadoop.
 Even if you have no experience with Hadoop,  you can run Infovore in the
AWS cloud by simply providing your AWS credentials.

As of


there is a first draft of 'sieve3',  which splits up an RDF data set into
mutually exclusive parts.  There is a list of rules that apply to the
triple,  matching a rule diverts a triple to a particular output,  and
triples that fail to match a pattern fall into the 'output'.

The horizontal subdivision looks like this


Here are the segments

'a' - rdfs:type
'key' -- keys represented as expanded strings
'keyNs' -- keys represented in the key namespace
`label` -- rdfs:label
`name` -- type.object.name entries that are probably duplicative of
`text` -- additional large text blobs
`web` -- links to external web sites
`links` -- all other triples where the ?o is a URI
`other` -- all other triples where the ?o is not a Literal

Overall this segmentation isn't all that different from how DBpedia is
broken down.

Last night I downloaded 4.5 GB worth of data from `links` and `other` out
of the 20 GB dump supplied by Freebase and I expect to be able to write
interesting SPARQL queries against this.  This process is fast,  completing
in about 0.5 hrs with a smallAwsCluster.  I think all of these data sets
could be of interest to people who are working with triple stores and with
Hadoop since the physical separation can speed most operations up

The future plan for firming up sieve3 is to get spring configuration
working inside Hadoop (I probably won't put spring in charge of Hadoop
first) so that it will be easy to create new rule sets in either by writing
Java or XML.

This data can be download from the requester paid bucket


Received on Wednesday, 18 September 2013 15:15:59 UTC