Re: [Dbpedia-discussion] RDF Validator puts Freebase and DBpedia Live to the test

FYI

On 4/9/13 8:01 PM, Paul A. Houle wrote:
> PRESS RELEASE
> Paul Houle,  Ontology2 founder,  stated that "we updated Infovore to 
> accept
> data from DBpedia,  and ran a head to head test,  in terms of RDF 
> validity,
> between Freebase and DBpedia Live."
> "Unlike most scientific results",  he said,  "these results are 
> repeatable,
> because you can reproduce them yourself with Infovore 1.1.  I 
> encourage you
> to use this tool to put other RDF data sets,  large and small,  to the 
> test."
> The tool parallelSuperEyeball was run against both the 2013-03-31 Freebase
> RDF dump and the 2012-04-30 edition of DBpedia Live.
> Although Freebase asserts roughly 1.2 billion facts, Infovore rejects 
> roughly
> 200 million useless facts in pre-filtering.  Downstream of that we found
> 944,909,025 valid facts and than 66,781,906 invalid facts,  in addition to
> 5 especially malformed facts.
> This is a serious regression compared to the 2013-01-27 RDF dump,  in 
> which
> only about 13 million invalid triples were discovered. The main cause 
> of the
> increase is the introduction of 40 million or so "triples" lacking an 
> object
> connected with the predictate ns:common.topic.notable_for.  Previously,
> the bulk of the invalid triples were incorrectly formatted dates.
> The rate of invalid triples in Dbpedia Live was found to be orders of 
> magnitude
> less than Freebase.
> Only 8,664 invalid facts were found in DBpedia Live, compared to 
> 247,557,030
> valid facts.  The predominant problem in DBpedia Live turned out to be
> noncomfortmant IRIs that came in from Wikipedia.  This is comparable in
> magnitude to the number of facts found invalid in the old Freebase 
> quad dump
> in the process of creating :BaseKB Pro.
> Just one of the tools included with Infovore, parallelSuperEyeball is an
> industrial strength RDF validator that uses streaming processing and 
> the Map/Reduce
> paradigm to attain nearly perfect parallel speedup at many tasks on common
> four core computers.  Infovore 1.1 brings many improvements,  
> including a threefold
> speedup of parallelSuperEyeball and the new Infovore shell.
> Please take a look at our github project at
> https://github.com/paulhoule/infovore/wiki
> and feel free to fork or star it.  Note that many infovore data 
> products are
> also available at
> http://basekb.com/
> Because infovore is memory efficient,  it is possible to use it to 
> handle much
> large data sets than can be kept in a triple store on any given 
> hardware.  The
> main limitation in handling large RDF data sets is running out of disk 
> space,
> which it can do quickly by avoiding random access I/O.
> "We challenge RDF data providers to put their data to the test",  said 
> Paul
> Houle,  "Today it's an expectation that people and organizations 
> publish only
> valid XML files,  and the publication of superParallelEyeball is a step to
> a world that speaks valid RDF and that can clean and repair invalid 
> files."
> Ontology2 is a privately held company that develops web sites and data 
> products
> based on Freebase,  DBpedia,  and other sources.  Contact 
> paul@ontology2.com with
> questions about Ontology2 products and services.


-- 

Regards,

Kingsley Idehen	
Founder & CEO
OpenLink Software
Company Web: http://www.openlinksw.com
Personal Weblog: http://www.openlinksw.com/blog/~kidehen
Twitter/Identi.ca handle: @kidehen
Google+ Profile: https://plus.google.com/112399767740508618350/about
LinkedIn Profile: http://www.linkedin.com/in/kidehen

Received on Wednesday, 10 April 2013 01:37:02 UTC