W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > August 2014

[UCR] Idiosyncratic data (Geonames) dump in RDF

From: Ghislain Atemezing <auguste.atemezing@eurecom.fr>
Date: Sat, 9 Aug 2014 11:32:11 +0200
Message-Id: <F9A3C474-60EC-4164-BEE3-9293AB70145E@eurecom.fr>
Cc: atemezin@eurecom.fr
To: public-dwbp-wg@w3.org
Hi all,
I’ve just came across this issue with Geonames dump in RDF, that seems to be a quite “normal” situation. 
The issue is well described here [1] , as it says :

"Geonames is a great resource for geographical information. Helpfully they publish data exports in a variety of formats, allowing others to process and manipulate the data locally. Unfortunately the RDF data dump that is available from: 
[http://download.geonames.org/export/dump/all-geonames-rdf.txt.zip] 
is a little idiosyncratic. Rather than provide a single ntriples or even RDF/XML file the dump consists of a text file that consists of alternating lines like this: 
...feature URI.... rdf:RDF...RDF/XML description of feature..../rdf:RDF
This means you need to script up unpacking the file in order to load it into a triple store. "

As you can imagine, this implies two issues :
 1- Users/consumers have to write scripts for “harmonizing “ in clean triples .
 2- The provider claims [2] to have 8514201 features and about 125 mio rdf triples (2013 08 27). 
	2-1: How to ensure this original number is kept after uploading in third party endpoint ?
      For example, I was looking at LOD cache and Factforge to fing GEonames features
       #- results for http://lod.openlinksw.com/sparql — > see http://goo.gl/VFfQ4x (4 989 694 / 5 539 694 features)
       #- Results for factforge.net: http://goo.gl/VC2YuE (8.060.727 features). This seems to be more “realistic” according to the original dump.
 3- Trusting issue: Which endpoint to trust when I don’t have enough resource to build a script and load Geonames dump in local ?

With all the above issues, do you think this can be a “valid” Use Case for this group to deal with ?

WDYT ? 

Best,
Ghislain 


[1] https://github.com/ldodds/geonames
[2] http://www.geonames.org/ontology/documentation.html 
-------------
Ghislain Atemezing
EURECOM, Multimedia Communication Department
Campus SophiaTech
450, route des Chappes, 06410 Biot, France
email: auguste.atemezing@eurecom.fr & ghislain.atemezing@gmail.com
Tel: +33 (0)4- 9300 8178
Fax: +33 (0)4- 9000 8200
Web: http://www.eurecom.fr/~atemezin
Google+: http://google.com/+GhislainATEMEZING
Twitter: @gatemezing
Received on Saturday, 9 August 2014 09:32:42 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:24:16 UTC