W3C home > Mailing lists > Public > public-lod@w3.org > February 2013

Re: [Freebase-discuss] [BULK] 13 Million triples are invalid in the Freebase Quad Dump

From: Kingsley Idehen <kidehen@openlinksw.com>
Date: Wed, 13 Feb 2013 17:33:24 -0500
Message-ID: <511C14B4.3050203@openlinksw.com>
To: "public-lod@w3.org" <public-lod@w3.org>

On 2/13/13 5:26 PM, paul@ontology2.com wrote:
> A system called parallelSuperEyeball has been added to the freebase 
> processing chain.  I took apart the parser from the Jena framework to 
> extract something that parses individual nodes in N-Triples files so 
> that invalid triples do not stop the triple parsing process.  The 
> earlier partitionFreebaseRDF removes superfluous information and 
> reformats the data for scalable parallel processing.
> I call the resulting product,  which partitions valid and invalid 
> facts from Freebase, ":BaseKB Lime",  and it's a refereshing 
> alternative to the difficulties that people have with off-brand
> Linked Data products that don't conform to industry standards.
> You can confirm these claim for yourself by downloading
> https://github.com/paulhoule/infovore/archive/t20130213.tar.gz
> cd infovore
> mvn clean install
> cd hydroxide-apps
> mvn appassembler::assemble
> cd ..
> source ./hydroxide-apps/path.sh
> export INFOVORE_BASE=/freebase/
> export INFOVORE_FREEBASE_FILE=/freebase/freebase-rdf-2013-01-27-00-00.gz
> export INFOVORE_INSTANCE=2013-01-27
> mkdir /freebase/data/$INFOVORE_INSTANCE
> partitionFreebaseRDF
> superParallelEyeball
> And then in /freebase/data/2013-01-27/work you'll find
> baseKBLime -- 716 million valid triples to load in your RDF store or 
> otherwise use
> baseKBLimeRejected -- 13 million invalid "triples"
> freebase-raw-rejected.tsv -- quite literally a handful of completely 
> broken lines from the quad dump that don't even end with a period.
> I'm planning on fine tuning the rules on what the first stage 
> accepts,  getting a newer version of the quad dump, and publishing 
> :BaseKB Lime for download soon.
> _______________________________________________
> You are receiving this message because you are subscribed to the Freebase-discuss mailing list.
> To post a message to the list: Freebase-discuss@freebase.com
> To unsubscribe, view archives, etc: http://lists.freebase.com/mailman/listinfo/freebase-discuss



Kingsley Idehen	
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

Received on Wednesday, 13 February 2013 22:33:52 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:29 UTC