W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > March 2014

Re: Semantics and Data Consumption

From: Carlos Iglesias <carlos.iglesias.moro@gmail.com>
Date: Sat, 29 Mar 2014 21:06:43 +0100
Message-ID: <CAAa1Xz=aY5mEtdUaBE=8m7w6dWrJiWy0PT2Zci3LKhjhxdme2Q@mail.gmail.com>
To: Phil Archer <phila@w3.org>
Cc: Laufer <laufer@globo.com>, Ig Ibert Bittencourt <ig.ibert@gmail.com>, Public DWBP WG <public-dwbp-wg@w3.org>
>
> [...] DKAN and Socrata have their own metamodels. These metamodels define
>> the
>> things. They define the way Data is exposed when you use the tool.
>>
>
> That's where we I believe should have something to say - how data is
> exposed. How it is stored internally is irrelevant. [...]


I agree the relevant part is how data is exposed, but can't forget that
metamodels also affect to how data it is exposed. If something has not been
previously stored in the metamodel then there is no way to expose it later.


> [...]
>
>  They could tell us how they see this market and at
>
>> the same time they could talk about their metamodels and they could be one
>> of the targets of the Best Practices, including in their metamodels
>> features that could help to implement the recommendations of the WG.
>>
>
> In theory, CKAN exposes metadata about the datasets using DCAT. They don't
> do it very well, however.


Phil is being more politically correct here than I will be. In practice
they are not currently using DCAT almost at all. At least the one we all
know. We did an interesting comparative analysis for the Government of
Aragón (regional Spanish one) based on CKAN 1.8 (support has not improved
since then); DCAT WD 5Nov (no changes wrt final rec); DCAT-AP final and the
National Interoperability Framework. It is available at [1] (Sorry only in
Spanish, but still you could get the overall figure. The companion deep
analysis report is also available at [2], but again in Spanish. I could
provide more insights if someone is interested and a translation tool does
not make a good job here). By the way, conclusions show how the underlying
metadata is one of the reasons for bad support, although probably not the
most important one.


> If there is one question I can be sure I'm going to be asked at pretty
> well every event I go to it's "why is CKAN so bad at using DCAT?" As you
> can imagine, I talk to OKFN about that from time to time ;-) Their model
> remains focussed on helping humans find the data rather than machines.


Can't imagine how a human could make use of CKAN if not by means of a
machine :-) So they have apparently forgot than in fact they are already
helping machines to find data and expose it to humans.


> However, the European Commission's DCAT Application Profile is, I hope
> leading to change. Since the EC uses CKAN and wants its own portals, and
> everyone else's, to use DCAT in the same way, there is pressure and, more
> importantly I believe money, forthcoming to fox this in CKAN.
>

At the end it is open source, so you don't need to wait for any money for
the EC. In fact there are already some testing compliant implementations of
DCAT/DCAT-AP based on CKAN. You have also a ckan extension with improved
DCAT support [3] The nice question here will be how to fix then all
previous historical data (or rather metadata) once DCAT support on CKAN has
been fixed.

[1] - [
http://opendata.aragon.es/public/documentos/AnexoI_Analisis_Metadatos_Aragon_OpenData_v31-01-14.pdf
]
[2] - [
http://opendata.aragon.es/public/documentos/Informe_NTI_Aragon_OpenData_v31-01-14.pdf
]
[3] - [https://github.com/ckan/ckanext-dcat]

Best,
 CI.



>
>
>> Best Regards,
>> Laufer
>>
>>
>> 2014-03-27 18:11 GMT-03:00 Ig Ibert Bittencourt <ig.ibert@gmail.com>:
>>
>>  Hi Laufer,
>>>
>>> On Mar 27, 2014 11:27 PM, "Laufer" <laufer@globo.com> wrote:
>>>
>>>>
>>>> Ig,
>>>>
>>>> What I am trying to expose is that we should differentiate the ideas of
>>>>
>>> the RDF Model and Linked Data from the way Data is stored.
>>>
>>> +1
>>>
>>>
>>>> Besides that, I think we should take into account the tools that are
>>>>
>>> being used do expose Data on the Web.
>>>
>>>>
>>>>
>>> What you mean?
>>>
>>>  Best,
>>>> Laufer
>>>>
>>>>
>>>> 2014-03-27 5:38 GMT-03:00 Ig Ibert Bittencourt <ig.ibert@gmail.com>:
>>>>
>>>>  Hi Laufer,
>>>>>
>>>>> Thank you for your didactic e-mail. :)
>>>>>
>>>>> I agree that Data semantics is very important and we should definitely
>>>>>
>>>> try to connect our the data as much as possible to some kind of schema
>>> of
>>> others people's data.
>>>
>>>>
>>>>> As far as I understand, your proposal goes in the same way as the fifth
>>>>>
>>>> start of the Tim's 5 start open data plan [1] and also with the third
>>> principle of the Linked Data Principles [2]. Is that right?
>>>
>>>>
>>>>> Even though, IMHO perhaps could be a good idea to reinforce the LD
>>>>>
>>>> principles as best practices.
>>>
>>>>
>>>>> [1] http://5stardata.info/
>>>>> [2] http://www.w3.org/DesignIssues/LinkedData.html
>>>>>
>>>>> All the Best,
>>>>> Ig
>>>>>
>>>>>
>>>>> 2014-03-25 16:25 GMT-03:00 Laufer <laufer@globo.com>:
>>>>>
>>>>>  Hello, All,
>>>>>>
>>>>>>
>>>>>>
>>>>>> I apologize for the long message.
>>>>>>
>>>>>>
>>>>>>
>>>>>> I would like to talk about some concepts that are being discussed by
>>>>>>
>>>>> the WG and are related to Data Formats and Semantics
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> Bernardette published a page in the wiki where she defines phases for
>>>>>>
>>>>> the Data on the Web Lifecycle.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> When we inspect some of the Use Cases and the Stories listed in the
>>>>>>
>>>>> wiki, including the webinars presentations, we can see that there are
>>> more
>>> than one player, a chain of players, that is responsible for allowing the
>>> consumption of Data.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> The Data Generation and the Data Distribution phases are done by
>>>>>>
>>>>> persons that access the raw data to be published but use platforms for
>>> distribution that have their own metamodels as, for example, CKAN and
>>> Socrata.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> The issue "what is the Data format that is consumed" is mixed with the
>>>>>>
>>>>> idea that the Data format of the stored Data is the same format of the
>>> consumed Data . In some Use Cases we can see, in some instances, that the
>>> Publishers store different formats to be downloaded by the Consumers.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> At first sight, it is not important what is the Data format that is
>>>>>>
>>>>> stored in the repository. When someone request Data, the transformation
>>> (serialization) of the stored Data could (should?) be done by the Data
>>> provider.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> Let's take Socrata as an example. A Dataset in Socrata could be
>>>>>>
>>>>> uploaded from an Excel file, but once it is stored in Socrata cloud, we
>>> don't know what is the Data format of the original Excel file that is
>>> stored as a Dataset. A Data consumer has a standard interface where she
>>> can
>>> browse the Dataset and she can ask the platform to export Data in
>>> different
>>> formats, including pdf, json, xml, rdf and xls.
>>>
>>>>
>>>>>> Socrata also provides an individual Endpoint with an API for each
>>>>>>
>>>>> Dataset. It considers the Endpoint as a way of exporting Data, a slice
>>> of
>>> the whole Dataset.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> When we think about Data semantic, this semantic should be described
>>>>>>
>>>>> as metadata. It can be stored, for example, in a pdf file describing
>>> the
>>> data model, in a technical style or in a free style. What is important is
>>> that the Consumer could understand what is being said about the Data that
>>> she is consuming.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> What could be a Best Practice would be to use a more wide common
>>>>>>
>>>>> understanding of this metadata. This is one of the contributions of rdf
>>> model when it defines the use of common vocabularies as a way to describe
>>> the properties of resources. Besides that, it also introduces the idea of
>>> universal identifiers in a way of linking Data from different Datasets.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> There is a huge amount of Data to be loaded on the web that has its
>>>>>>
>>>>> own semantics. People can publish these Data in his own view letting
>>> the
>>> developers to understand each one of these semantics and making the
>>> mashups. It's ok. But if the Publishers could use common vocabularies
>>> these
>>> could facilitate the work for the Developers to integrate Data.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> Let's take an example. In NYC Open Data Dataset "311 Service Requests
>>>>>>
>>>>> from 2010 to Present" there are two columns labeled "Latitude" and
>>> "Longitude". The type of these two columns is Number. Well, we can guess
>>> that they are related to the latitude and longitude of the address where
>>> a
>>> service was requested.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> There is a human interface where it is possible to browse the Dataset:
>>>>>>
>>>>>>
>>>>>>  https://data.cityofnewyork.us/Social-Services/311-Service-
>>> Requests-from-2010-to-Present/stnw-hdrd
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> To get the information about a service request we can use the Endpoint
>>>>>>
>>>>> to export Data in json or rdf formats. The columns labels are
>>> identified by
>>> property names derived form the columns labels: "Latitude" is identified
>>> as
>>> "latitude"; "Longitude" as "longitude."
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> Using the endpoint created for the Dataset we can obtain the json
>>>>>>
>>>>> output of the first row:
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd.json?$limit=1
>>>>>>
>>>>>> [ {
>>>>>>
>>>>>>
>>>>>>
>>>>>> "longitude" : "-73.76983198736392",
>>>>>>
>>>>>> "latitude" : "40.71159894212768",
>>>>>>
>>>>>>
>>>>>>
>>>>>>   }  ]
>>>>>>
>>>>>>
>>>>>>
>>>>>> Using the endpoint created for the Dataset we can obtain the rdf
>>>>>>
>>>>> output of the first row:
>>>
>>>>
>>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd.rdf?$limit=1
>>>>>>
>>>>>>
>>>>>>
>>>>>> <rdf:RDF
>>>>>>
>>>>>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>>>>>>
>>>>>> xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>>>>>>
>>>>>> xmlns:socrata="http://www.socrata.com/rdf/terms#"
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> xmlns:dsbase="http://data.cityofnewyork.us/resource/"
>>>>>>
>>>>>> xmlns:ds="http://data.cityofnewyork.us/resource/stnw-hdrd/"
>>>>>>
>>>>>> xmlns:usps="http://www.w3.org/2000/10/swap/pim/usps#">
>>>>>>
>>>>>>
>>>>>>
>>>>>> <dsbase:stnw-hdrd rdf:about="
>>>>>>
>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd/27702159">
>>>
>>>>
>>>>>> <socrata:rowID>7055868</socrata:rowID>
>>>>>>
>>>>>> <rdfs:member rdf:resource="
>>>>>>
>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd"/>
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> <ds:latitude>40.71159894212768</ds:latitude>
>>>>>>
>>>>>> <ds:longitude>-73.76983198736392</ds:longitude>
>>>>>>
>>>>>>
>>>>>>
>>>>>> </dsbase:stnw-hdrd>
>>>>>>
>>>>>> </rdf:RDF>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Well, the rdf does not introduces any kind of semantics in this case..
>>>>>>
>>>>> It is only a different serialized format of the Data returned in json..
>>> The
>>> property http://data.cityofnewyork.us/resource/stnw-hdrd/latitudedoesn't
>>> have more semantics than the label "Latitude".
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> But Socrata allows the owner of the Dataset to associate an rdf
>>>>>>
>>>>> property to a column. The user can associate any URL as a metadata of
>>> the
>>> column and, besides that, Socrata lists some properties that it
>>> understands
>>> from some vocabularies: dcat; foaf; dublic core; geo.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> I associate to the column "Latitude" the URL:
>>>>>>
>>>>> http://www.w3.org/2003/01/geo/wgs84_pos#lat
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> I associate to the column "Longitude" the URL:
>>>>>>
>>>>> http://www.w3.org/2003/01/geo/wgs84_pos#long
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> I made the endpoint call again:
>>>>>>
>>>>>>
>>>>>>
>>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd.rdf?$limit=1
>>>>>>
>>>>>> <rdf:RDF
>>>>>>
>>>>>> xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
>>>>>>
>>>>>> xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"
>>>>>>
>>>>>> xmlns:socrata="http://www.socrata.com/rdf/terms#"
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"
>>>>>>
>>>>>> ...
>>>>>>
>>>>>> xmlns:dsbase="http://data.cityofnewyork.us/resource/"
>>>>>>
>>>>>> xmlns:ds="http://data.cityofnewyork.us/resource/stnw-hdrd/"
>>>>>>
>>>>>> xmlns:usps="http://www.w3.org/2000/10/swap/pim/usps#">
>>>>>>
>>>>>>
>>>>>>
>>>>>> <dsbase:stnw-hdrd rdf:about="
>>>>>>
>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd/27702159">
>>>
>>>>
>>>>>> <socrata:rowID>7055868</socrata:rowID>
>>>>>>
>>>>>> <rdfs:member rdf:resource="
>>>>>>
>>>>> http://data.cityofnewyork.us/resource/stnw-hdrd"/>
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> <geo:lat>40.71159894212768</geo:lat>
>>>>>>
>>>>>> <geo:long>-73.76983198736392</geo:long>
>>>>>>
>>>>>>
>>>>>>
>>>>>> </dsbase:stnw-hdrd>
>>>>>>
>>>>>> </rdf:RDF>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Well, the rdf returned geo:lat and geo:long as the properties of two
>>>>>>
>>>>> numbers that has a well known semantics.
>>>
>>>>
>>>>>>
>>>>>>
>>>>>> For me, this is a Best Practice.
>>>>>>
>>>>>>
>>>>>>
>>>>>> What do you think about this?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I apologize, again, for the long message.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Kind Regards,
>>>>>>
>>>>>> Laufer
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> .  .  .  .. .  .
>>>>>> .        .   . ..
>>>>>> .     ..       .
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>> Ig Ibert Bittencourt
>>>>> Professor Adjunto III - Universidade Federal de Alagoas (UFAL)
>>>>> Vice-Coordenador da Comissão Especial de Informática na Educação
>>>>> Líder do Centro de Excelência em Tecnologias Sociais
>>>>> Co-fundador da Startup MeuTutor Soluções Educacionais LTDA.
>>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> .  .  .  .. .  .
>>>> .        .   . ..
>>>> .     ..       .
>>>>
>>>
>>>
>>
>>
>>
> --
>
>
> Phil Archer
> W3C Data Activity Lead
> http://www.w3.org/2013/data/
>
> http://philarcher.org
> +44 (0)7887 767755
> @philarcher1
>
>

-- 
---

Carlos Iglesias.
Internet & Web Consultant.
+34 687 917 759
contact@carlosiglesias.es
@carlosiglesias
http://es.linkedin.com/in/carlosiglesiasmoro/en
Received on Saturday, 29 March 2014 22:11:18 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:24:12 UTC