Basque Government Linked Open Data: URIs for content in different languages from Mikel Egaña Aranguren on 2017-02-10 (public-lod@w3.org from February 2017)

From: Mikel Egaña Aranguren <mikel.egana.aranguren@gmail.com>
Date: Fri, 10 Feb 2017 17:25:11 +0100
To: public-lod <public-lod@w3.org>
Message-ID: <CABf_9zJgjquHtwNEBirqLak5TbiGwZrge0XgkpCOrFOVS2wS2Q@mail.gmail.com>
Hi all;

We have been hired by the Basque Government [1] to help them "enter the
Linked Data world", through a 90.000 EUR pilot project [2]. I have a
question about multilingual content, but any thoughts on our first
approach, presented bellow, are wellcome.

The project comprises two different but overlapping areas:

*1.- Web Content *

Add Schema+JSON-LD data to current Web pages, describing the content of the
pages. For example, add something like the following snippet (with more
data) to the Web document about the President (Note the different URIs; see
explanation bellow):

[http://gida.irekia.euskadi.eus/eu/people/8080]


{
  "@context": "http://schema.org",
  "@type": "Person",
  "@id":"http://euskadi.eus/id/inigo-urkullu",
  "email":"gab-lehendak@euskadi.eus",
  "telephone":945017900,
  "mainEntityOfPage": {
         "@type": "WebPage",
         "@id": "http://euskadi.eus/page/inigo-urkullu"
      }
}


*2.- Open Data*

Convert some of the Dataset of the current Open Data Portal (
http://opendata.euskadi.eus) to RDF and publish as Linked Data. These
datasets may also refer to entities from Web content: for example, the
president might appear in a CSV staff list in the Open Data portal.


The overlapping part is a further requirement to migrate some of the
content from the current URL (Document) based system to a URI (Resource)
based system. Thus, for example, the current web for the President (
http://gida.irekia.euskadi.eus/eu/people/8080) should become something like
http://euskadi.eus/id/inigo-urkullu (and make it persistent and the rest of
best practices for URIs). The RDF representation of that URI will contain
RDF from both the Open Data portal (e.g. staff related data) and from the
JSON-LD of the Web page (e.g. email), hence the URIs in the JSON-LD above.

In terms of content negotiation, the situation is a bit more complex than
the usual Linked Data setting I'm acquainted with. For datasets from the
Open Data portal that lack already existing Web content, there is no
problem: 303 redirections will provide the usual. For example, for the URI
of a sensor that meassures air quality, we would have something like this
(Remember that there was no prior web page describing the sensor):

http://euskadi.eus/id/sensor-1 [Resource identifier of the entity "sensor
1"]
303 http://euskadi.eus/data/sensor-1 [RDF data about the sensor]
303 http://euskadi.eus/doc/sensor-1 [An HTML, "ugly" rendering of the RDF
data, a la DBPedia]

For content that already existed in the Web, like the president, the
process is a bit more convoluted:

http://euskadi.eus/id/inigo-urkullu [Resource identifier of the entity
"Iñigo Urkullu"]
303 http://euskadi.eus/data/inigo-urkullu [RDF data about the president,
including both data from Open Data and data from the Web content, via
JSON-LD]
303 http://euskadi.eus/page/inigo-urkullu [A nice HTML page containing some
of the RDF data, in JSON-LD, and other, pure web content, non existing as
data]

The page http://euskadi.eus/page/inigo-urkullu has two HTML links, with
appropiate icons, pointing at:
http://euskadi.eus/data/inigo-urkullu [RDF data about the president,
including both data from Open Data and data from the Web content, via
JSON-LD, as already described]
http://euskadi.eus/doc/inigo-urkullu [An HTML, "ugly" rendering of all the
RDF data about the president, a la DBPedia]

When an HTML representation from the ID is requested, content is filtered
according to the schema:mainEntityOfPage predicate: if the predicate exists
(the president), the fancy web page (/page/) is provided via 303, otherwise
(the sensor, there is no schema:mainEntityOfPage predicate, there was no
"prior" web page) the "ugly" web page is provided (/doc/).

Web content (the president) is of high quaility and very linkable, and
that's why we want to include it in the Triple Store, via JSON-LD, to have
some "anchor" entities in the data, with a lot of links. (The JSON-LD is
programmatically created by the current content manager software).

This a very preliminary sketchy architecture  and thoughts are wellcome
about it, but my question is about the URIs themselves: there are two
official languages in Basque Country (Spanish and Basque) and the same
content is usually duplicated (or even triplicated, including english
sometimes):

- Web pages:
http://gida.irekia.euskadi.eus/eu/people/8080
http://gida.irekia.euskadi.eus/es/people/8080

- Datasets:
http://opendata.euskadi.eus/katalogoa/-/2015eko-igorpen-eta-jatorri-kutsagarrien-euskal-erregistroa-eper-e-prtr/
http://opendata.euskadi.eus/catalogo/-/registro-vasco-de-emisiones-y-fuentes-contaminantes-del-2015-eper-euskadi-e-prtr/

Therefore the easiest solution would be to mint URIs according to language,
like DBPedia. Thus the president would have two URIs:

http://euskadi.eus/id/es/inigo-urkullu ("Spanish" president)
http://euskadi.eus/id/eu/inigo-urkullu ("Basque" president)

Both resources would have to be related via owl:sameAs in the Triple Store
[3]. The advantage of this is that one can follow the current division when
it comes to converting data to RDF. However, my gut feeling is that I
should go for a "pure Linked Data" solution, mint a unique id (
http://euskadi.eus/id/inigo-urkullu) and use RDF @es and @eu in triples for
content in different languages. The latter solution implies that the
content negotiation above should include language content negotiation,
which I don't know if it is widespread, and other side effects.

So I'm more inclined for a URI for each language, because it is the
 easiest, but I still would like to hear any thoughts on the "One URI -
different rdfs:labels" solution, before completely ignoring it.

Thanks!

Regards

[1] https://en.wikipedia.org/wiki/Basque_Government
[2] http://www.contratacion.euskadi.eus/w32-1084/es/contenidos/anuncio_
contratacion/expx74j21656/es_doc/es_arch_expx74j21656.html
[3] Stardog and GraphDB (and porbably others) implement special methods for
owl:sameAs "inference" in SPARQL queries, to make queries efficient.

-- 
Mikel Egaña Aranguren, Ph.D.

https://mikel-egana-aranguren.github.io
Received on Friday, 10 February 2017 16:25:47 UTC