W3C home > Mailing lists > Public > public-lod@w3.org > April 2010

Re: DBpedia hosting burden

From: Andy Seaborne <andy.seaborne@talis.com>
Date: Thu, 15 Apr 2010 13:36:04 +0100
Message-ID: <4BC70834.9020700@talis.com>
To: public-lod@w3.org
CC: dbpedia-discussion <dbpedia-discussion@lists.sourceforge.net>
I ran the files from 
http://www.openjena.org/~afs/DBPedia35-parse-log-2010-04-15.txt through 
an N-Triples parser with checking:

The report is here (it's 25K lines long):


It covers both strict errors and warnings of ill-advised forms.

A few examples:

Bad IRI: <=?(''[[Nepenthes>
Bad IRI: <http://www.european-athletics.org‎>

Bad lexical forms for the value space:
(there is no February the 31st)

Warning of well known ports of other protocols:

Warning about explicit about port 80:


and use of . and .. in absolute URIs which are all from the standard 
list of IRI warnings.

Bad IRI: <http://dbpedia.org/resource/..> Code: 
8/NON_INITIAL_DOT_SEGMENT in PATH: The path contains a segment /../ not 
at the beginning of a relative reference, or it contains a /./ These 
should be removed.


Software used:

The IRI checker, by Jeremy Carroll, is available from
http://www.openjena.org/iri/ and Maven.

The lexical form checking is done by Apache Xerces.

The N-triples parser is the one from TDB v0.8.5 which bundles the above 
two together.

On 15/04/2010 9:54 AM, Malte Kiesel wrote:
> Ivan Mikhailov wrote:
>> If I were The Emperor of LOD I'd ask all grand dukes of datasources to
>> put fresh dumps at some torrent with control of UL/DL ratio :)
> Last time I checked (which was quite a while ago though), loading
> DBpedia in a normal triple store such as Jena TDB didn't work very well
> due to many issues with the DBpedia RDF (e.g., problems with the URIs of
> external links scraped from Wikipedia).
> I don't know whether this is a bug in TDB or DBpedia but I guess this is
> one of the problems causing people to use DBpedia online only - even if,
> due to performance reasons, running it locally would be far better.
> Regards
> Malte
Received on Thursday, 15 April 2010 12:36:33 UTC

This archive was generated by hypermail 2.3.1 : Wednesday, 7 January 2015 15:16:05 UTC