- From: Steven Adler <adler1@us.ibm.com>
- Date: Thu, 26 Mar 2015 14:19:20 -0400
- To: Laufer <laufer@globo.com>
- Cc: Bernadette Farias Lóscio <bfl@cin.ufpe.br>, Christophe Guéret <christophe.gueret@dans.knaw.nl>, Eric Stephan <ericphb@gmail.com>, Phil Archer <phila@w3.org>, DWBP WG <public-dwbp-wg@w3.org>
- Message-ID: <OF0F6A0915.1CF9278F-ON85257E14.00642D9B-85257E14.0064A601@us.ibm.com>
Laufer,
I agree we need to be careful and the discussion here is helping to clarify
the issues. I have been working with Data Quality for over 10 years and
have not seen any really good DQ rating systems in use beyond very small
scale enterprise deployments. I am not sure that ODI is a source we need
to rely on for guidance in this matter as their bench of DQ experts is
quite narrow.
I would recommend that we continue to discuss this together and seek out
simple methods that can be easily implemented. It is easier to start
simple with something no one has today and then add to it as we gain
insights into usage patterns from use cases that emerge over time.
Best Regards,
Steve
Motto: "Do First, Think, Do it Again"
|------------>
| From: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|Laufer <laufer@globo.com> |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|Steven Adler/Somers/IBM@IBMUS |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|Christophe Guéret <christophe.gueret@dans.knaw.nl>, Bernadette Farias Lóscio <bfl@cin.ufpe.br>, Eric Stephan <ericphb@gmail.com>, Phil Archer |
|<phila@w3.org>, DWBP WG <public-dwbp-wg@w3.org> |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|03/26/2015 01:02 PM |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject: |
|------------>
>--------------------------------------------------------------------------------------------------------------------------------------------------|
|The 5 stars path |
>--------------------------------------------------------------------------------------------------------------------------------------------------|
Hi all,
I've started this thread because the misunderstanding about the LOD 5 stars
scale, and how persons are using it as a way of classifying the quality of
data published on the web.
I think that different axes of quality, each one with its own 5 stars
scale, could confuse even more people when someone attach a number of stars
to a dataset. Besidest that, there will be certificates around these
issues, probably taking into account several axes of quality. ODI already
has a certification process.
So, I think we must be very careful with this subject and be very clear in
our texts in the documents.
Abraços,
Laufer
Em quinta-feira, 26 de março de 2015, Steven Adler <adler1@us.ibm.com>
escreveu:
I like that approach, but that 5-star is not a Data Quality rating system
which I still think we need as part of BP.
Best Regards,
Steve
Motto: "Do First, Think, Do it Again"
Inactive hide details for Christophe Guéret ---03/25/2015 09:53:36
PM---BTW, speaking about stars and feedback we may want to hChristophe
Guéret ---03/25/2015 09:53:36 PM---BTW, speaking about stars and feedback
we may want to have a look at the 5 star scheme for community
Fro Christophe Guéret <christophe.gueret@dans.knaw.nl>
m:
To: Steven Adler/Somers/IBM@IBMUS
Cc: Phil Archer <phila@w3.org>, Laufer <laufer@globo.com>,
Bernadette Farias Lóscio <bfl@cin.ufpe.br>, DWBP WG
<public-dwbp-wg@w3.org>, Eric Stephan <ericphb@gmail.com>
Dat 03/25/2015 09:53 PM
e:
Sub Re: The 5 stars path
jec
t:
BTW, speaking about stars and feedback we may want to have a look at the
5 star scheme for community engagement from Tim Davies:
http://www.opendataimpacts.net/engagement/
We could probably do something with it, if only linking to it somewhere.
Cheers,
Christophe
--
Sent with difficulties. Sorry for the brievety and typos...
Op 24 mrt. 2015 07:18 schreef "Steven Adler" <adler1@us.ibm.com>:
Rating a dataset is only valuable if records within the dataset
have ratings whose sum or average validates the dataset rating.
That is, there has to be provenance to the ratings.
Best Regards,
Steve
Motto: "Do First, Think, Do it Again"
Inactive hide details for Bernadette Farias Lóscio ---03/24/2015
10:11:38 AM---Hi all, Thanks for the great discussion!Bernadette
Farias Lóscio ---03/24/2015 10:11:38 AM---Hi all, Thanks for the
great discussion!
Fro Bernadette Farias Lóscio <bfl@cin.ufpe.br>
m:
To: Eric Stephan <ericphb@gmail.com>
Cc: Phil Archer <phila@w3.org>, Laufer <laufer@globo.com>,
Christophe Guéret <christophe.gueret@dans.knaw.nl>, DWBP WG <
public-dwbp-wg@w3.org>
Dat 03/24/2015 10:11 AM
e:
Sub Re: The 5 stars path
jec
t:
Hi all,
Thanks for the great discussion!
I like the idea of having a star rating discussion, but we need to
be aware that publishing data on the Web is more than just
publishing data and metadata. It also concerns issues like data
access and feedback.
I've been thinking a lot about this rating system and it would be
great to consider all aspects related to data on the Web (ex: data
format, metadata, identifiers, data access, feedback,
versioning...), but I'm bot sure if this is the best choice. Maybe,
we can have a rating system based just on data and metadata, which
is similar to the initial proposal of Phil.
Cheers,
Bernadette
2015-03-22 18:38 GMT-03:00 Eric Stephan <ericphb@gmail.com>:
Wow what a wonderful thread to read. Thank you Phil! Many
many thanks for this wonderful note of clarity!
>>if Eric and Annette can provide similar examples for NetCDF
that would be terrific (I'm out of my depth here).
Yes I think we can show this quite easily. Just off the top
of my heads.
NetCDF:
- is an open format for storing multi-dimensional data
streams [NETCDF]
- can be annotated with self describing metadata (called
attributes)
- has existing conventions for representing different
forms of data. E.g. CF convention.
- has a CF vocabulary [CFNAMES] for curated climate and
forecasting terminology.
- In addition the climate community within the Earth
System Grid (ESG) has adopted fully documented protocols
[CMIP5] to show how regional and climate model datasets must
be organized so that they can be inter-related to support
regional and global climate studies.
- Leverages existing ISO standards used in the geospatial,
dublin core, and metadata communities.
- Finally an ontology was developed by NASA JPL called
SWEET [SWEET], there is previous research showing how the CF
terms can inter-related.
I would submit that even without the ontology in terms of
open data, the climate community is already at 5 star.
Eric
References
[NETCDF] http://en.wikipedia.org/wiki/NetCDF
[CFNAMES]
http://cfconventions.org/Data/cf-standard-names/28/build/cf-standard-name-table.html
[CMIP5] http://cmip-pcmdi.llnl.gov/cmip5/
[SWEET] https://sweet.jpl.nasa.gov/
On Sun, Mar 22, 2015 at 10:45 AM, Phil Archer <phila@w3.org>
wrote:
We are in full agreement.
One of my hopes for this WG is that we can indeed lead
people to publish formats like CSV in the best way
(i.e. with good quality metadata) without them feeling
somehow inferior.
If that leads us to define our own star rating system,
I wouldn't mind. Something like:
* It's available on the Web in an open format with a
declared licence (anything less is all but useless).
** As level 1 with good quality discovery metadata (we
might refer to the DCAT Application profile work as an
example).
*** All the above plus structural metadata in the
relevant format (e.g. CSV+ for CSV, VoID for RDF etc).
This doesn't include quality metrics (which it should),
and contact details (which it should) - but they might
be defined at level 2?
Maybe a start anyway.
Phil.
On 22/03/2015 13:50, Laufer wrote:
I agree, Phil.
What I want to reinforce is that it would be nice
if we could make clear in
the document that 5 stars LD (or OD?) is not a
scale of a dataset that is
well published in the web. We can have, for
example, a "CSV dataset" (3
stars) more well published than a "LD dataset" (5
stars). Or, maybe, we can
avoid using the 5 stars when what we want to say
is that a dataset is being
published in a CSV format.
If we say that one dataset is 3 stars and other
is 5 stars, people have the
idea that the 5 one is better than the 3 one (as
in reviews or hotels, for
example).
We probably will not define our own scale but I
hope that our set of BPs
could help people to publish a "Well Published
Data on The Web".
Best Regards,
Laufer
Em domingo, 22 de março de 2015, Christophe
Guéret <
christophe.gueret@dans.knaw.nl
<javascript:_e(%7B%7D,'cvml','
christophe.gueret@dans.knaw.nl');>> escreveu:
+1!
Christophe
--
Sent with difficulties. Sorry for the
brievety and typos...
Op 22 mrt. 2015 08:47 schreef "Phil Archer"
<phila@w3.org>:
I've just been reading through
Friday's minutes and I see that this
was
the hot topic of the day. As ever,
I'm sorry I wasn't able to be there.
Let me add my 2 cents.
LD forms a small part of the
available data on the Web. It would
be
silly of us to push for everyone to
convert their data into perfectly
linked 5 star data before they make
it available publicly or behind a
pay-wall of some kind.
What we *can* do IMO is:
- Promote the publication of human
readable metadata as Laufer has
described;
- promote the publication of machine
readable metadata and then show how
this can be (and is) done with RDF
using DCAT as an example;
- promote the publication of
structural metadata which, for CSV at
least, we have a very clear route -
use the CSV on the Web work;
- if Eric and Annette can provide
similar examples for NetCDF that
would
be terrific (I'm out of my depth
here).
- We can leave it to the Spatial Data
on the Web WG to handle spatial
stuff (as they are leaving some of
their generic issues to this group).
As an aside, the CSV WG has resolved
its issues now and is expecting to
publish pretty much the stable
version of its specs in the first
week of
April.
If you publish data in your favourite
format + structural metadata in
whatever format goes with that (and
the CSV WG is using JSON for its
metadata) then you are providing a
route through which your users can
readily create 5 star data if they so
wish. They may or may not use LD
themselves but the concept behind it
is, I hope, clear enough to readers?
From what I've read of Friday and
the list since then, I dare t hope
this is in line with the general mood
of the WG?
Phil.
On 20/03/2015 18:09, Laufer wrote:
Thank, you, Eric.
Abraços,
Laufer
2015-03-20 12:31 GMT-03:00 Eric
Stephan <ericphb@gmail.com>:
Laufer and Bernadette,
I raised an issue
relating to this asking
the question can we use 5
star
as a metric and not a path?
http://www.w3.org/2013/dwbp/track/issues/148
Eric S.
On Fri, Mar 20, 2015 at 7:54
AM, Bernadette Farias Lóscio <
bfl@cin.ufpe.br
wrote:
Hi Laufer,
Thanks for the message!
It is a very useful
explanation!
I fully agree with you:
"In this dataset
publishing I can see the
idea of
publishing metadata and using
standard vocabularies, but is
not a LD
dataset."
IMHO, we can use vocabularies
to publish metadata, but we are
not
doing
linked data, i.e., there are no
links between resources.
I also agree that "we should
differentiate the idea of a
Best
Practice of
a non LD dataset of the idea of
an implicit Best Practice to go
to a
LD
dataset, that is what the 5
stars scale says.".
If we have a BP whose
implementation proposes the use
of the RDF
model to
publish data, then we are
moving towards the 5 stars. It
is important
to
note that, publishind data
using the RDF model may be just
one of the
proposed approaches for
implementation, i.e, we may
show other ways of
publishing data without using
RDF.
Cheers,
Bernadette
2015-03-20 11:32 GMT-03:00
Laufer <laufer@globo.com>:
Hi all,
I will start my comment
using an example:
Someone publish a page
where there are links to
2 files:
a csv file with a
dataset;
a text file that explains
the structure of the
dataset, in natural
language (metadata).
In the page there are a
lot of metadata provided
in natural
language, as
for example, an overview of the
dataset, license, organization,
version,
creator, rights, etc...
At the same time, the page has
an embedded dcat instance using
rdfa
where there are info about the
dataset, the distribution, etc.
What I want to say is that we
have here the metadata concept
mixed
with
semantic web concepts, and it
is a way of publishing data
that, if
all the
things are well described,
could be very useful to the
society.
In this dataset publishing I
can see the idea of publishing
metadata
and
using standard vocabularies,
but is not a LD dataset.
What I was discussing in the
last meeting is: will we
support in the
document the idea that the best
way to publish is LD. I am not
saying that
I am against or not the idea. I
am favorable to LD. But we
should
differentiate the idea of a
Best Practice of a non LD
dataset of the
idea
of an implicit Best Practice to
go to a LD dataset, that is
what the
5
stars scale says.
Maybe is too much care with the
words, sorry about this.
Best Regards,
Laufer
--
. . . .. . .
. . . ..
. .. .
--
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de
Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------
--
Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/
http://philarcher.org
+44 (0)7887 767755
@philarcher1
--
Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/
http://philarcher.org
+44 (0)7887 767755
@philarcher1
--
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------
--
. . . .. . .
. . . ..
. .. .
Attachments
- image/gif attachment: graycol.gif
- image/gif attachment: ecblank.gif
Received on Thursday, 26 March 2015 18:20:11 UTC