Re: GeoKnow Public Datasets [via Geospatial Semantic Web Community Group] from Frans Knibbe on 2015-11-20 (public-sdw-wg@w3.org from November 2015)

From: Frans Knibbe <frans.knibbe@geodan.nl>
Date: Fri, 20 Nov 2015 11:06:57 +0100
To: SDW WG Public List <public-sdw-wg@w3.org>
Message-ID: <CAFVDz40LyM_w9VDschxpKEDi45Wtb4HBB_dN=a8tcEfX_7Xkyg@mail.gmail.com>
Oops, I missed Andrea's message. Sorry for the duplicate.

Frans

2015-11-20 10:59 GMT+01:00 Frans Knibbe <frans.knibbe@geodan.nl>:

> Hello all, especially BP editors,
>
> Some results from the GeoKnow project <http://geoknow.eu/Welcome.html>
> were shared on geosemweb (the W3C version) list. I thought it would be
> good to forward the message because this is about practices that could be
> investigated in the search for best practices.
>
> Regards,
> Frans
>
>
> ---------- Forwarded message ----------
> From: W3C Community Development Team <team-community-process@w3.org>
> Date: 2015-11-20 9:20 GMT+01:00
> Subject: GeoKnow Public Datasets [via Geospatial Semantic Web Community
> Group]
> To: public-geosemweb@w3.org
>
>
> In this blogpost we want to present three public datasets that were
> improved/created in GeoKnow project.
>
> LinkedGeoData
> Size: 177GB zipped turtle file
> URL: http://linkedgeodata.org/
> LinkedGeoData is the RDF version of Open Street Map (OSM), which covers
> the
> entire planet geospatial data information. As of September 2014 the
> zipped xml
> file from OSM had 36GB of data, while the size of zipped LGD files in
> turtle
> format is 177GB. The detailed description of the dataset can be found in
> the
> D1.3.2 Continuous Report on Performance Evaluation.
> Technically, LinkedGeoData is set of SQL files, database-to-rdf (RDB2RDF)
> mappings, and bash scripts. The actual RDF conversion is carried out by
> the
> SPARQL-to-SQL rewriter Sparqlify. You can view the Sparqlify Mappings for
> LinkedGeoData here. Within The maintenance and improvement of the Mappings
> required to transform OSM data to RDF has being done during all the
> project.
> This dataset has being used in several use cases, but specially for all
> benchmarking tasks within GeoKnow.
> Wikimapia
> URL: http://wikimapia.org/api/
> Wikimapia is a crowdsourced, open-content, collaborative mapping
> initiative,
> where users can contribute mapping information. This dataset existed
> already
> before the project started. However it was only accessible through
> Wikimapia’s
> API⁴ and provided in XML or JSON formats. Within GeoKnow, we downloaded
> several
> sets of geospatial entities from Wikimapia, including both spatial and
> non-spatial attributes for each entity and transformed them into RDF
> data. The
> process we followed is described next. We considered a set of cities
> throughout
> the world (Athens, London, Leipzig, Berlin, New York) and downloaded the
> whole
> content provided by Wikimapia regarding the geospatial entities included
> in
> those geographical areas. These cities where preferred since they are the
> base
> cities of several partners in the project, while
> the rest two cities were randomly selected, with the aim to reach our
> target of
> more than 100000 spatial entities from Wikimapia. Apart from geometries,
> Wikimapia provided a very rich set of metadata (non-spatial properties)
> for each
> entity (e.g. tags and categories describing the geospatial entities,
> topological
> relations
> with nearby entities, comments of the users, etc.). The aforementioned
> dumps
> were transformed into RDF triples in a straightforward way: (a) defining
> intermediate resources (functioning as blank nodes) where information was
> organized in more than one levels, (b) flattening the information of deeper
> levels where possible in order to simplify the structure of the dataset
> and (c)
> transforming tags into OWL classes. Specifically, we developed a parsing
> tool to
> communicate with the Wikimapia API and construct appropriate n-triples
> from the
> dataset. The tool takes as input a bounding box in the form of wgs84
> coordinates
> (min long, min lat, max long, max lat). We chose five initial bounding
> boxes:
> one for each of the cities mentioned above. The bounding box was defined
> in such
> way so that it covered the whole area of the selected city. Each bounding
> box
> was then further divided by the tool into a grid of smaller bounding boxes
> in
> order to overcome the upper limit per area of the returned entities from
> Wikimapia API. For each place returned, we transformed all properties
> into RDF
> triples. Every tag was assigned an OWL class and an appropriate label,
> corresponding to the textual description in the initial Wikimapia XML
> file. Each
> place became an instance of the classes provided by its tags. For the rest
> of
> the returned Wikimapia attributes, we created a custom property in a
> uniform way
> for each attribute of the returned Wikimapia XML file. The properties
> resulting
> from the Wikimapia XML attributes point to their literal values. For
> example, we
> construct properties about each place’s language id, Wikipedia link, URL
> link,
> title, description, edit info, location info, global administrative areas,
> available languages and geometry information.
> If these attributes follow a deeper tree structure, we assign the
> properties at
> intermediate custom nodes by concatenating the property with the place ID;
> these
> nodes function as blank nodes and connect the initial entity with a set of
> properties and the respective values. This process resulted to creating an
> initial geospatial RDF dataset containing, for each entity, the polygon
> geometry
> that represents it, along with a wealth of non-spatial properties of the
> entity.
> The dataset contains 102,019 geospatial entities and 4,629,223 triples.
> Upon that, in order to create a synthetically interlinked pair of datasets,
> we
> split the Wikimapia RDF dataset, duplicating the geometries and dividing
> them
> into the two datasets in the following way. For each polygon geometry, we
> created another point geometry located in the centroid of the polygon and
> then
> shifted the point by a random (but bounded) factor⁵. The polygon was left
> in
> the first dataset where the point was transferred to the second dataset.
> The
> rest of the properties where distributed between the two datasets as
> follows: The first dataset consists of metadata containing the main
> information
> about the Wikimapia places and edit information about users, timestamps,
> deletion state and editors. The second dataset consists of metadata
> concerning
> basic info, location and language information. This way, the two sub-
> datasets
> essentially refer to the same Wikimapia entities, differing only in
> geometric
> and metadata information. Each of the two sub-datasets contains 102,019
> geospatial entities and the first one contains 1,225,049 triples while the
> second one 4,633,603 triples.
>
> Seven Greek INSPIRE-compliant data themes of Annex I
> URL: http://geodata.gov.gr/sparql/
> For the INSPIRE to RDF use case, we selected seven data themes from Annex
> I,that
> are describes in the Table below. Although all metadata in geodata.gov.gr
> is
> fully compatible with INSPIRE regulations, data is not because it has been
> integrated from several diverse sources, which have rarely followed the
> proper
> standards. Thus, due to data variety, provenance, and excessive volume, its
> transformation into INSPIRE-compliant datasets is a time-consuming and
> demanding
> task. The first step was the alignment of the data to INSPIRE Annex I. To
> this
> goal, we utilised the Humboldt Alignment Editor, a powerful open-source
> tool
> with a graphical interface and a high-level language for expressing custom
> alignments. Such transformation can be used to turn a non-harmonised data
> source
> to an INSPIRE-compliant dataset. It only requires a source schema (an .xsd
> for
> the local GML file) and a target one
> (an .xsd implementing an INSPIRE data schema). As soon as the schema
> mapping was
> defined, the source GML data was loaded, and the INSPIRE-aligned GML file
> was
> produced.
> The second step was the transformation into RDF. This process was quite
> straightforward, provided the set of suitable XSL stylesheets. We
> developed all
> these transformations in XSLT 2.0, implementing one parametrised
> stylesheet per
> selected data theme. By default, all geometries were encoded in WKT
> serialisations according to GeoSPARQL.The produced RDF triples were
> finally
> loaded and made available in both Virtuoso and Parliament RDF stores, in
> http://geodata.gov.gr/sparql, as a proof of concept.
>
>
> INSPIRE Data ThemeGreek datasetNumber of featuresNumber of triples
> [GN] Geographical names Settlements, towns, and localities in Greece.13
> 259304
> 957
> [AU] Administrative units All Greek municipalities after the most recent
> restructuring (”Kallikratis”).326 9 454
> [AD] Addresses Street addresses in Kalamaria municipality.10 776277 838
> [CP] Cadastral parcels The building blocks in Kalamaria are used. Data
> from the
> official Greek Cadastre are not available through geodata. gov.gr.96513
> 510
> [TN] Transport networks Urban road network in Kalamaria.2 58459 432
> [HY] Hydrography All rivers and waterstreams in Greece.4299120 372
> [PS] Protected sites All areas of natural preservation in Greece according
> to
> the EU Natura 2000 network.419 10 894
>
>
>
>
> ----------
>
> This post sent on Geospatial Semantic Web Community Group
>
>
>
> 'GeoKnow Public Datasets'
>
> https://www.w3.org/community/geosemweb/2015/11/20/geoknow-public-datasets/
>
>
>
> Learn more about the Geospatial Semantic Web Community Group:
>
> https://www.w3.org/community/geosemweb
>
>
>
>
>
Received on Friday, 20 November 2015 10:07:35 UTC