On the horizontal decomposition of Freebase from Paul Houle on 2013-09-18 (semantic-web@w3.org from September 2013)

From: Paul Houle <ontology2@gmail.com>
Date: Wed, 18 Sep 2013 11:15:26 -0400
To: Linked Data community <public-lod@w3.org>, "semantic-web@w3.org" <semantic-web@w3.org>
Message-ID: <CAE__kdR8EXVuMN4zv=+QCwBFF5+hn_pC6GzC2VcqZjg8OEjoXg@mail.gmail.com>

I am reporting the first really useful product from the work on Infovore,
 an open source framework for processing large RDF data sets with Hadoop.
 Even if you have no experience with Hadoop,  you can run Infovore in the
AWS cloud by simply providing your AWS credentials.

As of

https://github.com/paulhoule/infovore/releases/tag/t20130917

there is a first draft of 'sieve3',  which splits up an RDF data set into
mutually exclusive parts.  There is a list of rules that apply to the
triple,  matching a rule diverts a triple to a particular output,  and
triples that fail to match a pattern fall into the 'output'.

The horizontal subdivision looks like this

http://www.slideshare.net/paulahoule/horizontal-decomposition-of-freebase

Here are the segments

'a' - rdfs:type
'description'
'key' -- keys represented as expanded strings
'keyNs' -- keys represented in the key namespace
`label` -- rdfs:label
`name` -- type.object.name entries that are probably duplicative of
rdfs:label
`text` -- additional large text blobs
`web` -- links to external web sites
`links` -- all other triples where the ?o is a URI
`other` -- all other triples where the ?o is not a Literal

Overall this segmentation isn't all that different from how DBpedia is
broken down.

Last night I downloaded 4.5 GB worth of data from `links` and `other` out
of the 20 GB dump supplied by Freebase and I expect to be able to write
interesting SPARQL queries against this.  This process is fast,  completing
in about 0.5 hrs with a smallAwsCluster.  I think all of these data sets
could be of interest to people who are working with triple stores and with
Hadoop since the physical separation can speed most operations up
considerably.

The future plan for firming up sieve3 is to get spring configuration
working inside Hadoop (I probably won't put spring in charge of Hadoop
first) so that it will be easy to create new rule sets in either by writing
Java or XML.

This data can be download from the requester paid bucket

s3n://basekb-lime/freebase-rdf-2013-09-15-00/sieved/

Received on Wednesday, 18 September 2013 15:15:59 UTC