W3C home > Mailing lists > Public > public-lod@w3.org > February 2013

Re: [Freebase-discuss] [BULK] 13 Million triples are invalid in the Freebase Quad Dump

From: Suresh Partha <sureshpartha21@yahoo.com>
Date: Mon, 18 Feb 2013 02:01:33 -0800 (PST)
Message-ID: <1361181693.41356.YahooMailNeo@web161004.mail.bf1.yahoo.com>
To: Kingsley Idehen <kidehen@openlinksw.com>, "public-lod@w3.org" <public-lod@w3.org>, "paul@ontology2.com" <paul@ontology2.com>


Hi Paul,

If you could put those extracted valid triples ( 716 million valid triples from Freebase) for download, it would be very helpful.

Thanks.

________________________________
 From: Kingsley Idehen <kidehen@openlinksw.com>
To: "public-lod@w3.org" <public-lod@w3.org> 
Sent: Thursday, February 14, 2013 4:03 AM
Subject: Re: [Freebase-discuss] [BULK] 13 Million triples are invalid in the  Freebase Quad Dump
 

FYI

On 2/13/13 5:26 PM, paul@ontology2.com wrote:

A system called parallelSuperEyeball has been added to the freebase processing chain.  I took apart the parser from the Jena framework to extract something that parses individual nodes in N-Triples files so that invalid triples do not stop the triple parsing process.  The earlier partitionFreebaseRDF removes superfluous information and reformats the data for scalable parallel processing.
> 
>I call the resulting product,  which partitions valid and invalid facts from Freebase, “:BaseKB Lime”,  and it’s a refereshing alternative to the difficulties that people have with off-brand 
>Linked Data products that don’t conform to industry standards.
> 
>You can confirm these claim for yourself by downloading 
> 
>https://github.com/paulhoule/infovore/archive/t20130213.tar.gz
> 
>cd infovore
>mvn clean install
>cd hydroxide-apps
>mvn appassembler::assemble
>cd ..
>source ./hydroxide-apps/path.sh
>export INFOVORE_BASE=/freebase/
>export INFOVORE_FREEBASE_FILE=/freebase/freebase-rdf-2013-01-27-00-00.gz
>export INFOVORE_INSTANCE=2013-01-27
>mkdir /freebase/data/$INFOVORE_INSTANCE
> 
>partitionFreebaseRDF
>superParallelEyeball
> 
>And then in /freebase/data/2013-01-27/work you’ll find
> 
>baseKBLime – 716 million valid triples to load in your RDF store or otherwise use
>baseKBLimeRejected – 13 million invalid “triples”
>freebase-raw-rejected.tsv – quite literally a handful of completely broken lines from the quad dump that don’t even end with a period.
> 
>I’m planning on fine tuning the rules on what the first stage accepts,  getting a newer version of the quad dump,  and publishing :BaseKB Lime for download soon.
> 
>  
>
>
>_______________________________________________
You are receiving this message because you are subscribed to the Freebase-discuss mailing list.
To post a message to the list: Freebase-discuss@freebase.com To unsubscribe, view archives, etc: http://lists.freebase.com/mailman/listinfo/freebase-discuss


--  Regards, Kingsley Idehen	      
Founder & CEO 
OpenLink Software     
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about LinkedIn Profile: http://www.linkedin.com/in/kidehen 
Received on Monday, 18 February 2013 10:02:01 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:29 UTC