Semantics and Data Consumption from Laufer on 2014-03-26 (public-dwbp-wg@w3.org from March 2014)

From: Laufer <laufer@globo.com>
Date: Wed, 26 Mar 2014 00:08:41 -0300
To: public-dwbp-wg <public-dwbp-wg@w3.org>
Message-ID: <CA+pXJij7JCQoHmdzN=X2-a9nAnx1EQsSGAkxycFONtxsB86BJw@mail.gmail.com>

Hello All,

I apologize for the long message.

I would like to talk about some concepts that are being discussed by the WG
and are related to Data Formats and Semantics.

Bernardette published a page in the wiki where she defines phases for the
Data on the Web Lifecycle.

When we inspect some of the Use Cases and the Stories listed in the wiki,
including the webinars presentations, we can see that there are more than
one player, a chain of players, that is responsible for allowing the
consumption of Data.

The Data Generation and the Data Distribution phases are done by persons
that access the raw data to be published but use platforms for distribution
that have their own metamodels as, for example, CKAN and Socrata.

The issue "what is the Data format that is consumed" is mixed with the idea
that the Data format of the stored Data is the same format of the consumed
Data . In some Use Cases we can see, in some instances, that the Publishers
store different formats to be downloaded by the Consumers.

At first sight, it is not important what is the Data format that is stored
in the repository. When someone request Data, the transformation
(serialization) of the stored Data could (should?) be done by the Data
provider.

Let's take Socrata as an example. A Dataset in Socrata could be uploaded
from an Excel file, but once it is stored in Socrata cloud, we don't know
what is the Data format of the original Excel file that is stored as a
Dataset. A Data consumer has a standard interface where she can browse the
Dataset and she can ask the platform to export Data in different formats,
including pdf, json, xml, rdf and xls.

Socrata also provides an individual Endpoint with an API for each Dataset.
It considers the Endpoint as a way of exporting Data, a slice of the whole
Dataset.

When we think about Data semantics, this semantics could be described as
metadata in different forms. It can be stored, for example, in a pdf file
describing the data model, in a technical style or in a free style. What is
important is that the Consumer could understand what is being said about
the Data that she is consuming.

What could be a Best Practice would be to use a more wide common
understanding of this metadata. This is one of the contributions of rdf
model when it defines the use of common vocabularies as a way to describe
the properties of resources. Besides that, it also introduces the idea of
universal identifiers in a way of linking Data from different Datasets.

There is a huge amount of Data to be loaded on the web that has its own
semantics. People can publish these Data in his own view letting the
developers to understand each one of these semantics and making the
mashups. It's ok. But if the Publishers could use common vocabularies these
could facilitate the work for the Developers to integrate Data.

Let's take an example. In NYC Open Data Dataset "311 Service Requests from
2010 to Present" there are two columns labeled "Latitude" and "Longitude".
The type of these two columns is Number. Well, we can guess that they are
related to the latitude and longitude of the address where a service was
requested.

There is a human interface where it is possible to browse the Dataset:

https://data.cityofnewyork.us/Social-Services/311-Service-Requests-from-2010-to-Present/stnw-hdrd

To get the information about a service request we can use the Endpoint to
export Data in json or rdf formats. The columns labels are identified by
property names derived form the columns labels: "Latitude" is identified as
"latitude"; "Longitude" as "longitude."

Using the endpoint created for the Dataset we can obtain the json output of
the first row:

http://data.cityofnewyork.us/resource/stnw-hdrd.json?$limit=1

[ {

"longitude" : "-73.76983198736392",

"latitude" : "40.71159894212768",

} ]

Using the endpoint created for the Dataset we can obtain the rdf output of
the first row:

http://data.cityofnewyork.us/resource/stnw-hdrd.rdf?$limit=1

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:socrata="http://www.socrata.com/rdf/terms#"

...

xmlns:dsbase="http://data.cityofnewyork.us/resource/"

xmlns:ds="http://data.cityofnewyork.us/resource/stnw-hdrd/"

xmlns:usps="http://www.w3.org/2000/10/swap/pim/usps#">

<dsbase:stnw-hdrd rdf:about="
http://data.cityofnewyork.us/resource/stnw-hdrd/27702159">

<socrata:rowID>7055868</socrata:rowID>

<rdfs:member rdf:resource="http://data.cityofnewyork.us/resource/stnw-hdrd
"/>

<ds:latitude>40.71159894212768</ds:latitude>

<ds:longitude>-73.76983198736392</ds:longitude>

</dsbase:stnw-hdrd>

</rdf:RDF>

Well, the rdf does not introduces any kind of semantics in this case. It is
only a different serialized format of the Data returned in json. The
property http://data.cityofnewyork.us/resource/stnw-hdrd/latitude doesn't
have more semantics than the label "Latitude".

But Socrata allows the owner of the Dataset to associate an rdf property to
a column. The user can associate any URL as a metadata of the column and,
besides that, Socrata lists some properties that it understands from some
vocabularies: dcat; foaf; dublic core; geo.

I associate to the column "Latitude" the URL:
http://www.w3.org/2003/01/geo/wgs84_pos#lat

I associate to the column "Longitude" the URL:
http://www.w3.org/2003/01/geo/wgs84_pos#long

I made the endpoint call again:

http://data.cityofnewyork.us/resource/stnw-hdrd.rdf?$limit=1

<rdf:RDF

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

xmlns:socrata="http://www.socrata.com/rdf/terms#"

...

xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#"

...

xmlns:dsbase="http://data.cityofnewyork.us/resource/"

xmlns:ds="http://data.cityofnewyork.us/resource/stnw-hdrd/"

xmlns:usps="http://www.w3.org/2000/10/swap/pim/usps#">

<dsbase:stnw-hdrd rdf:about="
http://data.cityofnewyork.us/resource/stnw-hdrd/27702159">

<socrata:rowID>7055868</socrata:rowID>

<rdfs:member rdf:resource="http://data.cityofnewyork.us/resource/stnw-hdrd
"/>

<geo:lat>40.71159894212768</geo:lat>

<geo:long>-73.76983198736392</geo:long>

</dsbase:stnw-hdrd>

</rdf:RDF>

Well, the rdf returned geo:lat and geo:long as the properties of two
numbers that have a well known semantics.

For me, this is a Best Practice.

What do you think about this?

I apologize, again, for the long message.

Kind Regards,

Laufer

--
. . . .. . .
. . . ..
. .. .

Received on Wednesday, 26 March 2014 03:09:11 UTC