Re: The 5 stars path from Steven Adler on 2015-03-26 (public-dwbp-wg@w3.org from March 2015)

From: Steven Adler <adler1@us.ibm.com>
Date: Thu, 26 Mar 2015 14:19:20 -0400
To: Laufer <laufer@globo.com>
Cc: Bernadette Farias Lóscio <bfl@cin.ufpe.br>, Christophe Guéret <christophe.gueret@dans.knaw.nl>, Eric Stephan <ericphb@gmail.com>, Phil Archer <phila@w3.org>, DWBP WG <public-dwbp-wg@w3.org>
Message-ID: <OF0F6A0915.1CF9278F-ON85257E14.00642D9B-85257E14.0064A601@us.ibm.com>
Laufer,

I agree we need to be careful and the discussion here is helping to clarify
the issues.  I have been working with Data Quality for over 10 years and
have not seen any really good DQ rating systems in use beyond very small
scale enterprise deployments.  I am not sure that ODI is a source we need
to rely on for guidance in this matter as their bench of DQ experts is
quite narrow.

I would recommend that we continue to discuss this together and seek out
simple methods that can be easily implemented.  It is easier to start
simple with something no one has today and then add to it as we gain
insights into usage patterns from use cases that emerge over time.


Best Regards,

Steve

Motto: "Do First, Think, Do it Again"


|------------>
| From:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Laufer <laufer@globo.com>                                                                                                                         |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| To:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Steven Adler/Somers/IBM@IBMUS                                                                                                                     |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Cc:        |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |Christophe Guéret <christophe.gueret@dans.knaw.nl>, Bernadette Farias Lóscio <bfl@cin.ufpe.br>, Eric Stephan <ericphb@gmail.com>, Phil Archer     |
  |<phila@w3.org>, DWBP WG <public-dwbp-wg@w3.org>                                                                                                   |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Date:      |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |03/26/2015 01:02 PM                                                                                                                               |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
|------------>
| Subject:   |
|------------>
  >--------------------------------------------------------------------------------------------------------------------------------------------------|
  |The 5 stars path                                                                                                                                  |
  >--------------------------------------------------------------------------------------------------------------------------------------------------|





Hi all,

I've started this thread because the misunderstanding about the LOD 5 stars
scale, and how persons are using it as a way of classifying the quality of
data published on the web.

I think that different axes of quality, each one with its own 5 stars
scale, could confuse even more people when someone attach a number of stars
to a dataset. Besidest that, there will be certificates around these
issues, probably taking into account several axes of quality. ODI already
has a certification process.

So, I think we must be very careful with this subject and be very clear in
our texts in the documents.

Abraços,
Laufer

Em quinta-feira, 26 de março de 2015, Steven Adler <adler1@us.ibm.com>
escreveu:
  I like that approach, but that 5-star is not a Data Quality rating system
  which I still think we need as part of BP.


  Best Regards,

  Steve

  Motto: "Do First, Think, Do it Again"

  Inactive hide details for Christophe Guéret ---03/25/2015 09:53:36
  PM---BTW, speaking about stars and feedback we may want to hChristophe
  Guéret ---03/25/2015 09:53:36 PM---BTW, speaking about stars and feedback
  we may want to have a look at the 5 star scheme for community



                                                                           
                                                                           
       Fro Christophe Guéret <christophe.gueret@dans.knaw.nl>              
       m:                                                                  
                                                                           
                                                                           
       To: Steven Adler/Somers/IBM@IBMUS                                   
                                                                           
                                                                           
       Cc: Phil Archer <phila@w3.org>, Laufer <laufer@globo.com>,          
           Bernadette Farias Lóscio <bfl@cin.ufpe.br>, DWBP WG             
           <public-dwbp-wg@w3.org>, Eric Stephan <ericphb@gmail.com>       
                                                                           
                                                                           
       Dat 03/25/2015 09:53 PM                                             
       e:                                                                  
                                                                           
                                                                           
       Sub Re: The 5 stars path                                            
       jec                                                                 
       t:                                                                  
                                                                           





  BTW, speaking about stars and feedback we may want to have a look at the
  5 star scheme for community engagement from Tim Davies:
  http://www.opendataimpacts.net/engagement/


  We could probably do something with it, if only linking to it somewhere.


  Cheers,
  Christophe


  --
  Sent with difficulties. Sorry for the brievety and typos...


  Op 24 mrt. 2015 07:18 schreef "Steven Adler" <adler1@us.ibm.com>:


        Rating a dataset is only valuable if records within the dataset
        have ratings whose sum or average validates the dataset rating.
        That is, there has to be provenance to the ratings.


        Best Regards,

        Steve

        Motto: "Do First, Think, Do it Again"

        Inactive hide details for Bernadette Farias Lóscio ---03/24/2015
        10:11:38 AM---Hi all, Thanks for the great discussion!Bernadette
        Farias Lóscio ---03/24/2015 10:11:38 AM---Hi all, Thanks for the
        great discussion!
                                                                           
                                                                           
       Fro Bernadette Farias Lóscio <bfl@cin.ufpe.br>                      
       m:                                                                  
                                                                           
                                                                           
       To: Eric Stephan <ericphb@gmail.com>                                
                                                                           
                                                                           
       Cc: Phil Archer <phila@w3.org>, Laufer <laufer@globo.com>,          
           Christophe Guéret <christophe.gueret@dans.knaw.nl>, DWBP WG <   
           public-dwbp-wg@w3.org>                                          
                                                                           
                                                                           
       Dat 03/24/2015 10:11 AM                                             
       e:                                                                  
                                                                           
                                                                           
       Sub Re: The 5 stars path                                            
       jec                                                                 
       t:                                                                  
                                                                           




        Hi all,

        Thanks for the great discussion!

        I like the idea of having a star rating discussion, but we need to
        be aware that publishing data on the Web is more than just
        publishing data and metadata. It also concerns issues like data
        access and feedback.

        I've been thinking a lot about this rating system and it would be
        great to consider all aspects related to data on the Web (ex: data
        format, metadata, identifiers, data access, feedback,
        versioning...), but I'm bot sure if this is the best choice. Maybe,
        we can have a rating system based just on data and metadata, which
        is similar to the initial proposal of Phil.

        Cheers,
        Bernadette

        2015-03-22 18:38 GMT-03:00 Eric Stephan <ericphb@gmail.com>:
              Wow what a wonderful thread to read.  Thank you Phil!  Many
              many thanks for this wonderful note of clarity!

              >>if Eric and Annette can provide similar examples for NetCDF
              that would be terrific (I'm out of my depth here).

              Yes I think we can show this quite easily.  Just off the top
              of my heads.

              NetCDF:
                 - is an open format for storing multi-dimensional data
              streams [NETCDF]
                 - can be annotated with self describing metadata (called
              attributes)
                 - has existing conventions for representing different
              forms of data.  E.g. CF convention.
                 - has a CF vocabulary [CFNAMES] for curated climate and
              forecasting terminology.
                 - In addition the climate community within the Earth
              System Grid (ESG) has adopted fully documented protocols
              [CMIP5] to show how regional and climate model datasets must
              be organized so that they can be inter-related to support
              regional and global climate studies.
                - Leverages existing ISO standards used in the geospatial,
              dublin core, and metadata communities.
                 - Finally an ontology was developed by NASA JPL called
              SWEET [SWEET], there is previous research showing how the CF
              terms can inter-related.

              I would submit that even without the ontology in terms of
              open data, the climate community is already at 5 star.



              Eric


              References

              [NETCDF] http://en.wikipedia.org/wiki/NetCDF
              [CFNAMES]
              http://cfconventions.org/Data/cf-standard-names/28/build/cf-standard-name-table.html

              [CMIP5] http://cmip-pcmdi.llnl.gov/cmip5/
              [SWEET] https://sweet.jpl.nasa.gov/


              On Sun, Mar 22, 2015 at 10:45 AM, Phil Archer <phila@w3.org>
              wrote:
                    We are in full agreement.

                    One of my hopes for this WG is that we can indeed lead
                    people to publish formats like CSV in the best way
                    (i.e. with good quality metadata) without them feeling
                    somehow inferior.

                    If that leads us to define our own star rating system,
                    I wouldn't mind. Something like:

                    * It's available on the Web in an open format with a
                    declared licence (anything less is all but useless).

                    ** As level 1 with good quality discovery metadata (we
                    might refer to the DCAT Application profile work as an
                    example).

                    *** All the above plus structural metadata in the
                    relevant format (e.g. CSV+ for CSV, VoID for RDF etc).

                    This doesn't include quality metrics (which it should),
                    and contact details (which it should) - but they might
                    be defined at level 2?

                    Maybe a start anyway.

                    Phil.

                    On 22/03/2015 13:50, Laufer wrote:
                          I agree, Phil.

                          What I want to reinforce is that it would be nice
                          if we could make clear in
                          the document that 5 stars LD (or OD?) is not a
                          scale of a dataset that is
                          well published in the web. We can have, for
                          example, a "CSV dataset" (3
                          stars) more well published than a "LD dataset" (5
                          stars). Or, maybe, we can
                          avoid using the 5 stars when what we want to say
                          is that a dataset is being
                          published in a CSV format.

                          If we say that one dataset is 3 stars and other
                          is 5 stars, people have the
                          idea that the 5 one is better than the 3 one (as
                          in reviews or hotels, for
                          example).

                          We probably will not define our own scale but I
                          hope that our set of BPs
                          could help people to publish a  "Well Published
                          Data on The Web".

                          Best Regards,
                          Laufer

                          Em domingo, 22 de março de 2015, Christophe
                          Guéret <
                          christophe.gueret@dans.knaw.nl
                          <javascript:_e(%7B%7D,'cvml','
                          christophe.gueret@dans.knaw.nl');>> escreveu:
                                +1!

                                Christophe

                                --
                                Sent with difficulties. Sorry for the
                                brievety and typos...
                                Op 22 mrt. 2015 08:47 schreef "Phil Archer"
                                <phila@w3.org>:
                                      I've just been reading through
                                      Friday's minutes and I see that this
                                      was
                                      the hot topic of the day. As ever,
                                      I'm sorry I wasn't able to be there.

                                      Let me add my 2 cents.

                                      LD forms a small part of the
                                      available data on the Web. It would
                                      be
                                      silly of us to push for everyone to
                                      convert their data into perfectly
                                      linked 5 star data before they make
                                      it available publicly or behind a
                                      pay-wall of some kind.

                                      What we *can* do IMO is:

                                      - Promote the publication of human
                                      readable metadata as Laufer has
                                      described;

                                      - promote the publication of machine
                                      readable metadata and then show how
                                      this can be (and is) done with RDF
                                      using DCAT as an example;

                                      - promote the publication of
                                      structural metadata which, for CSV at
                                      least, we have a very clear route -
                                      use the CSV on the Web work;

                                      - if Eric and Annette can provide
                                      similar examples for NetCDF that
                                      would
                                      be terrific (I'm out of my depth
                                      here).

                                      - We can leave it to the Spatial Data
                                      on the Web WG to handle spatial
                                      stuff (as they are leaving some of
                                      their generic issues to this group).

                                      As an aside, the CSV WG has resolved
                                      its issues now and is expecting to
                                      publish pretty much the stable
                                      version of its specs in the first
                                      week of
                                      April.

                                      If you publish data in your favourite
                                      format + structural metadata in
                                      whatever format goes with that (and
                                      the CSV WG is using JSON for its
                                      metadata) then you are providing a
                                      route through which your users can
                                      readily create 5 star data if they so
                                      wish. They may or may not use LD
                                      themselves but the concept behind it
                                      is, I hope, clear enough to readers?

                                        From what I've read of Friday and
                                      the list since then, I dare t hope
                                      this is in line with the general mood
                                      of the WG?

                                      Phil.



                                      On 20/03/2015 18:09, Laufer wrote:
                                            Thank, you, Eric.

                                            Abraços,
                                            Laufer

                                            2015-03-20 12:31 GMT-03:00 Eric
                                            Stephan <ericphb@gmail.com>:
                                                  Laufer and Bernadette,

                                                  I raised an issue
                                                  relating to this asking
                                                  the question can we use 5
                                      star
                                            as a metric and not a path?
                                      http://www.w3.org/2013/dwbp/track/issues/148


                                            Eric S.

                                            On Fri, Mar 20, 2015 at 7:54
                                            AM, Bernadette Farias Lóscio <
                                      bfl@cin.ufpe.br
                                            wrote:
                                                  Hi Laufer,

                                                  Thanks for the message!
                                                  It is a very useful
                                                  explanation!

                                                  I fully agree with you:
                                                  "In this dataset
                                                  publishing I can see the
                                      idea of
                                            publishing metadata and using
                                            standard vocabularies, but is
                                            not a LD
                                            dataset."

                                            IMHO, we can use vocabularies
                                            to publish metadata, but we are
                                            not
                                      doing
                                            linked data, i.e., there are no
                                            links between resources.

                                            I also agree that "we should
                                            differentiate the idea of a
                                            Best
                                      Practice of
                                            a non LD dataset of the idea of
                                            an implicit Best Practice to go
                                            to a
                                      LD
                                            dataset, that is what the 5
                                            stars scale says.".

                                            If we have a BP whose
                                            implementation proposes the use
                                            of the RDF
                                      model to
                                            publish data, then we are
                                            moving towards the 5 stars. It
                                            is important
                                      to
                                            note that, publishind data
                                            using the RDF model may be just
                                            one of the
                                            proposed approaches for
                                            implementation, i.e, we may
                                            show other ways of
                                            publishing data without using
                                            RDF.

                                            Cheers,
                                            Bernadette




                                            2015-03-20 11:32 GMT-03:00
                                            Laufer <laufer@globo.com>:

                                            Hi all,

                                                  I will start my comment
                                                  using an example:

                                                  Someone publish a page
                                                  where there are links to
                                                  2 files:
                                                  a csv file with a
                                                  dataset;
                                                  a text file that explains
                                                  the structure of the
                                                  dataset, in natural
                                                  language (metadata).

                                                  In the page there are a
                                                  lot of metadata provided
                                                  in natural
                                      language, as
                                            for example, an overview of the
                                            dataset, license, organization,
                                      version,
                                            creator, rights, etc...

                                            At the same time, the page has
                                            an embedded dcat instance using
                                            rdfa
                                            where there are info about the
                                            dataset, the distribution, etc.

                                            What I want to say is that we
                                            have here the metadata concept
                                            mixed
                                      with
                                            semantic web concepts, and it
                                            is a way of publishing data
                                            that, if
                                      all the
                                            things are well described,
                                            could be very useful to the
                                            society.

                                            In this dataset publishing I
                                            can see the idea of publishing
                                            metadata
                                      and
                                            using standard vocabularies,
                                            but is not a LD dataset.

                                            What I was discussing in the
                                            last meeting is: will we
                                            support in the
                                            document the idea that the best
                                            way to publish is LD. I am not
                                      saying that
                                            I am against or not the idea. I
                                            am favorable to LD. But we
                                            should
                                            differentiate the idea of a
                                            Best Practice of a non LD
                                            dataset of the
                                      idea
                                            of an implicit Best Practice to
                                            go to a LD dataset, that is
                                            what the
                                      5
                                            stars scale says.

                                            Maybe is too much care with the
                                            words, sorry about this.

                                            Best Regards,
                                            Laufer

                                            --
                                            .  .  .  .. .  .
                                            .        .   . ..
                                            .     ..       .



                                            --
                                            Bernadette Farias Lóscio
                                            Centro de Informática
                                            Universidade Federal de
                                            Pernambuco - UFPE, Brazil
                                      ----------------------------------------------------------------------------



                                      --


                                      Phil Archer
                                      W3C Data Activity Lead
                                      http://www.w3.org/2013/data/

                                      http://philarcher.org
                                      +44 (0)7887 767755
                                      @philarcher1

                    --


                    Phil Archer
                    W3C Data Activity Lead
                    http://www.w3.org/2013/data/

                    http://philarcher.org
                    +44 (0)7887 767755
                    @philarcher1



        --
        Bernadette Farias Lóscio
        Centro de Informática
        Universidade Federal de Pernambuco - UFPE, Brazil
        ----------------------------------------------------------------------------







--
.  .  .  .. .  .
.        .   . ..
.     ..       .
Attachments

image/gif attachment: graycol.gif
image/gif attachment: ecblank.gif
Received on Thursday, 26 March 2015 18:20:11 UTC