Fwd: GeoKnow Public Datasets [via Geospatial Semantic Web Community Group] from Frans Knibbe on 2015-11-20 (public-sdw-wg@w3.org from November 2015)

From: Frans Knibbe <frans.knibbe@geodan.nl>
Date: Fri, 20 Nov 2015 10:59:41 +0100
To: SDW WG Public List <public-sdw-wg@w3.org>
Message-ID: <CAFVDz426rQPpp19RYzpqvduZzvJEDM5_bXQF6PuixEtg1nhR5Q@mail.gmail.com>
Hello all, especially BP editors,

Some results from the GeoKnow project <http://geoknow.eu/Welcome.html> were
shared on geosemweb (the W3C version) list. I thought it would be good to
forward the message because this is about practices that could be
investigated in the search for best practices.

Regards,
Frans


---------- Forwarded message ----------
From: W3C Community Development Team <team-community-process@w3.org>
Date: 2015-11-20 9:20 GMT+01:00
Subject: GeoKnow Public Datasets [via Geospatial Semantic Web Community
Group]
To: public-geosemweb@w3.org


In this blogpost we want to present three public datasets that were
improved/created in GeoKnow project.

LinkedGeoData
Size: 177GB zipped turtle file
URL: http://linkedgeodata.org/
LinkedGeoData is the RDF version of Open Street Map (OSM), which covers the
entire planet geospatial data information. As of September 2014 the zipped
xml
file from OSM had 36GB of data, while the size of zipped LGD files in turtle
format is 177GB. The detailed description of the dataset can be found in the
D1.3.2 Continuous Report on Performance Evaluation.
Technically, LinkedGeoData is set of SQL files, database-to-rdf (RDB2RDF)
mappings, and bash scripts. The actual RDF conversion is carried out by the
SPARQL-to-SQL rewriter Sparqlify. You can view the Sparqlify Mappings for
LinkedGeoData here. Within The maintenance and improvement of the Mappings
required to transform OSM data to RDF has being done during all the project.
This dataset has being used in several use cases, but specially for all
benchmarking tasks within GeoKnow.
Wikimapia
URL: http://wikimapia.org/api/
Wikimapia is a crowdsourced, open-content, collaborative mapping initiative,
where users can contribute mapping information. This dataset existed already
before the project started. However it was only accessible through
Wikimapia’s
API⁴ and provided in XML or JSON formats. Within GeoKnow, we downloaded
several
sets of geospatial entities from Wikimapia, including both spatial and
non-spatial attributes for each entity and transformed them into RDF data.
The
process we followed is described next. We considered a set of cities
throughout
the world (Athens, London, Leipzig, Berlin, New York) and downloaded the
whole
content provided by Wikimapia regarding the geospatial entities included in
those geographical areas. These cities where preferred since they are the
base
cities of several partners in the project, while
the rest two cities were randomly selected, with the aim to reach our
target of
more than 100000 spatial entities from Wikimapia. Apart from geometries,
Wikimapia provided a very rich set of metadata (non-spatial properties) for
each
entity (e.g. tags and categories describing the geospatial entities,
topological
relations
with nearby entities, comments of the users, etc.). The aforementioned dumps
were transformed into RDF triples in a straightforward way: (a) defining
intermediate resources (functioning as blank nodes) where information was
organized in more than one levels, (b) flattening the information of deeper
levels where possible in order to simplify the structure of the dataset and
(c)
transforming tags into OWL classes. Specifically, we developed a parsing
tool to
communicate with the Wikimapia API and construct appropriate n-triples from
the
dataset. The tool takes as input a bounding box in the form of wgs84
coordinates
(min long, min lat, max long, max lat). We chose five initial bounding
boxes:
one for each of the cities mentioned above. The bounding box was defined in
such
way so that it covered the whole area of the selected city. Each bounding
box
was then further divided by the tool into a grid of smaller bounding boxes
in
order to overcome the upper limit per area of the returned entities from
Wikimapia API. For each place returned, we transformed all properties into
RDF
triples. Every tag was assigned an OWL class and an appropriate label,
corresponding to the textual description in the initial Wikimapia XML file.
Each
place became an instance of the classes provided by its tags. For the rest
of
the returned Wikimapia attributes, we created a custom property in a
uniform way
for each attribute of the returned Wikimapia XML file. The properties
resulting
from the Wikimapia XML attributes point to their literal values. For
example, we
construct properties about each place’s language id, Wikipedia link, URL
link,
title, description, edit info, location info, global administrative areas,
available languages and geometry information.
If these attributes follow a deeper tree structure, we assign the
properties at
intermediate custom nodes by concatenating the property with the place ID;
these
nodes function as blank nodes and connect the initial entity with a set of
properties and the respective values. This process resulted to creating an
initial geospatial RDF dataset containing, for each entity, the polygon
geometry
that represents it, along with a wealth of non-spatial properties of the
entity.
The dataset contains 102,019 geospatial entities and 4,629,223 triples.
Upon that, in order to create a synthetically interlinked pair of datasets,
we
split the Wikimapia RDF dataset, duplicating the geometries and dividing
them
into the two datasets in the following way. For each polygon geometry, we
created another point geometry located in the centroid of the polygon and
then
shifted the point by a random (but bounded) factor⁵. The polygon was left in
the first dataset where the point was transferred to the second dataset. The
rest of the properties where distributed between the two datasets as
follows: The first dataset consists of metadata containing the main
information
about the Wikimapia places and edit information about users, timestamps,
deletion state and editors. The second dataset consists of metadata
concerning
basic info, location and language information. This way, the two sub-
datasets
essentially refer to the same Wikimapia entities, differing only in
geometric
and metadata information. Each of the two sub-datasets contains 102,019
geospatial entities and the first one contains 1,225,049 triples while the
second one 4,633,603 triples.

Seven Greek INSPIRE-compliant data themes of Annex I
URL: http://geodata.gov.gr/sparql/
For the INSPIRE to RDF use case, we selected seven data themes from Annex
I,that
are describes in the Table below. Although all metadata in geodata.gov.gr is
fully compatible with INSPIRE regulations, data is not because it has been
integrated from several diverse sources, which have rarely followed the
proper
standards. Thus, due to data variety, provenance, and excessive volume, its
transformation into INSPIRE-compliant datasets is a time-consuming and
demanding
task. The first step was the alignment of the data to INSPIRE Annex I. To
this
goal, we utilised the Humboldt Alignment Editor, a powerful open-source tool
with a graphical interface and a high-level language for expressing custom
alignments. Such transformation can be used to turn a non-harmonised data
source
to an INSPIRE-compliant dataset. It only requires a source schema (an .xsd
for
the local GML file) and a target one
(an .xsd implementing an INSPIRE data schema). As soon as the schema
mapping was
defined, the source GML data was loaded, and the INSPIRE-aligned GML file
was
produced.
The second step was the transformation into RDF. This process was quite
straightforward, provided the set of suitable XSL stylesheets. We developed
all
these transformations in XSLT 2.0, implementing one parametrised stylesheet
per
selected data theme. By default, all geometries were encoded in WKT
serialisations according to GeoSPARQL.The produced RDF triples were finally
loaded and made available in both Virtuoso and Parliament RDF stores, in
http://geodata.gov.gr/sparql, as a proof of concept.


INSPIRE Data ThemeGreek datasetNumber of featuresNumber of triples
[GN] Geographical names Settlements, towns, and localities in Greece.13
259304
957
[AU] Administrative units All Greek municipalities after the most recent
restructuring (”Kallikratis”).326 9 454
[AD] Addresses Street addresses in Kalamaria municipality.10 776277 838
[CP] Cadastral parcels The building blocks in Kalamaria are used. Data from
the
official Greek Cadastre are not available through geodata. gov.gr.96513 510
[TN] Transport networks Urban road network in Kalamaria.2 58459 432
[HY] Hydrography All rivers and waterstreams in Greece.4299120 372
[PS] Protected sites All areas of natural preservation in Greece according
to
the EU Natura 2000 network.419 10 894




----------

This post sent on Geospatial Semantic Web Community Group



'GeoKnow Public Datasets'

https://www.w3.org/community/geosemweb/2015/11/20/geoknow-public-datasets/



Learn more about the Geospatial Semantic Web Community Group:

https://www.w3.org/community/geosemweb
Received on Friday, 20 November 2015 10:00:13 UTC