W3C home > Mailing lists > Public > public-dwbp-wg@w3.org > November 2014

Re: ISSUE-80: We need a definition of "dataset"

From: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Date: Tue, 11 Nov 2014 11:33:33 -0200
Message-ID: <CANx1Pzx-ruAUpvuB65=bDZjzHE363132hV11RPH6=ReaOPHFXg@mail.gmail.com>
To: Steven Adler <adler1@us.ibm.com>
Cc: Laufer <laufer@globo.com>, Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
+1 to Makx!

But I think we should have a better understand about how to specify a
dataset using the DACT definition. We should also have examples that
illustrate datasets definitions and their corresponding distributions.

Phil sent an example of a dataset definition and I sent another version for
the same example. If possible, it could be nice to discuss these ideas also.

Thanks!
Bernadette

2014-11-11 11:26 GMT-02:00 Steven Adler <adler1@us.ibm.com>:

> +1.  Well said.
>
>
> Best Regards,
>
> Steve
>
> Motto: "Do First, Think, Do it Again"
>
> [image: Inactive hide details for Bernadette Farias Lóscio ---11/10/2014
> 04:21:41 PM---Hi all, I like the idea of using the dataset def]Bernadette
> Farias Lóscio ---11/10/2014 04:21:41 PM---Hi all, I like the idea of using
> the dataset definition from DCAT and I fully agree
>
>
>
>    From:
>
>
> Bernadette Farias Lóscio <bfl@cin.ufpe.br>
>
>    To:
>
>
> Laufer <laufer@globo.com>
>
>    Cc:
>
>
> Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
>
>    Date:
>
>
> 11/10/2014 04:21 PM
>
>    Subject:
>
>
> Re: ISSUE-80: We need a definition of "dataset"
> ------------------------------
>
>
>
> Hi all,
>
> I like the idea of using the dataset definition from DCAT and I fully
> agree with Makx that "we should not try to redefine what's already
> well-defined".
>
> I like the examples of Phil, but I think that we still need to clarify the
> meaning of a dataset. To do this, I'd like to make a comparison between
> DCAT concepts and database concepts.
>
> In the database world, a database is defined as "a collection of related
> data. By data, we mean known facts that can be recorded and that have
> implicit meaning ". Moreover, "a database is a logically coherent
> collection of data with some inherent meaning. A random assortment of data
> cannot correctly be referred to as a database." [1]
>
> The relational model represents the database as a collection of relations
> (or tables). Informally, each relation resembles a table of values or, to
> some extent, a flat file of records.  For example, to construct the
> database that describes a university, we store data to represent each
> student, course, section, grade report, and prerequisite as a record in the
> appropriate table.
>
> In DCAT, a dataset is a collection of data, published or curated by a
> single agent, and available for access or download in one or more formats.
> A data catalog is a curated collection of metadata about datasets. In DCAT
> there is no notion of related data and collection of data with some
> inherent meaning.
>
> In my opinion, if we make a comparison between DCAT and  databases,  a
> *dataset* is similar to a database and the data organization of a given
> dataset will depend from the data model used to represent the data.
>
> Considering that a given dataset may have multiple distributions, then the
> organization of files will depend from the data model of each distribution
> (ex: csv, xml, rdf, json). For example, a csv distribution may have
> multiple tables and an xml distribution may have one or more xml documents.
> When a given distribution has more than one file (ex: multiple csv files),
> I agree with Phil that we can use dcterms:isPartOf to associate multiple
> files to a given distribution.
>
> In other words, I think that DCAT's definition is more abstract and
> doesn't concern the organization of the data in different files. This
> should be defined by the data model used in each distribution.
>
> Concerning the metadata, I think that there will be metadata related to
> the dataset and metadata related to each available distribution, as
> proposed by DCAT.
>
> Considering this context, then in the example of Phil, instead of having
> several datasets, there will be one dataset and 06 distributions. Each
> distribution will be composed by several files. The composition of a
> distribution will depend on the data model (ex: csv tables, xml
> documents..). In the following, I present a new proposal for the dataset
> definition and its corresponding csv distribution (other distributions are
> similar).
>
> <#sensor-readings> a dcat:Dataset;
>   dcat:distribution <#sendor-readings.csv> ;
>   dcat:distribution <#sendor-readings.pdf> ;
>   dcat:distribution <#sendor-readings.html> ;
>   dcat:distribution <#sendor-readings.xml> ;
>   dcat:distribution <# sendor-readings.ttl>;
>
>   dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>
> <#sendor-readings.csv> a dcat:Distribution ;
>   dcterms:isPartOf <#sensor-readings> ;
>   dcterms:hasPart <#readings-2014-11-08T00:00.csv>;
>   dcterms:hasPart <#readings-2014-11-08T06:00.csv>;
>   dcterms:hasPart <#readings-2014-11-08T12:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-08T18:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-08.zip> ;
>   dcterms:hasPart <#readings-2014-11-08.csv>;
>
>   dcterms:hasPart <#readings-2014-11-09T00:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-09T06:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-09T12:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-09T18:00.csv>;
>
>   dcterms:hasPart <#readings-2014-11-09.zip> ;
>   dcterms:hasPart <#readings-2014-11-09.csv>.
>
> <#readings-2014-11-08T00:00.csv> a csv:table
>   dcat:format "text/csv";
>
> <#readings-2014-11-08T00:06.csv> a csv:table
>   dcat:format "text/csv";
>
> <#readings-2014-11-08T00:18.csv> a csv:table
>   dcat:format "text/csv";
>
> [...Table descriptions]
>
> [...Distribution descriptions]
>
>
> In this case, the publication of new data, i.e, a new sensor reading,
> implies the insertion of a new file in each one of the distributions. It is
> important to note that data is distributed in two different levels of
> granularity (ex: per day and every six hours).
>
> I'm not sure of how to define the type of a specific file. For example, I
> used "csv:table" to define the type of <#readings-2014-11-08T00:00.csv>.
> However, I think that the csv model doesn't define this.
>
> @Phil, please let me know if this definition makes sense to you and if it
> is DCAT conformant.
>
> I think we can use the dataset definition from DCAT, however it is
> important to have a better understanding of how to specify datasets and its
> corresponding distributions.
>
> I'm sorry for the long message :)
>
> kind regards,
> Bernadette
>
> [1] Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database System",
> 6th Edition. ISBN-13: 978-0136086208
>
> 2014-11-09 11:40 GMT-03:00 Laufer <*laufer@globo.com* <laufer@globo.com>>:
>
>    Ok, Phil.
>
>    Let´s continue with the example.
>
>    Suppose that there is a metadata type, for example, license (a
>    metadata type that exists in DCAT spec), that applies to these datasets and
>    is common to all files.
>
>    DCAT defines a license property for the catalog and "Even if the
>    license of the catalog applies to all of its datasets and distributions, it
>    should be replicated on each distribution."
>
>    In the CSV WG they discussed levels of metadata definitions, with ways
>    of linking the metadata file to the CSV file. They defined a priority chain
>    that states that the inner definition has a higher priority level, so, if,
>    for example, there is a definition of the same type of metadata related to
>    a package and to a file, the metadata related to the file will be the valid
>    one.
>
>    Will we assume that a metadata that is related to a dataset that has
>    parts will apply to all of its parts, or, as in the dcat:license for the
>    catalog, the same metadata will have to be linked to all distributions of
>    all datasets that are parts of the dataset that groups them all?
>
>    Reading the DCAT spec, I feel that the way of grouping datasets is the
>    catalog. Catalog has parts that are the datasets. But the type catalog is
>    different from the type dataset. In your example, a dataset that groups
>    other datasets could be seen as a type of catalog, a hierarchy in the
>    definiton of the catalog. And, at the same time, be a dataset.
>
>    Maybe I am not seeing the things correctly, but I think that here we
>    are defining a type of dataset grouping that is not addressed in DCAT spec.
>    The use of dcterms:hasPart and dcterms:isPartOf is interesting. Will the
>    DWBP WG recommend that?
>
>    In CSV WG they have the idea of metadata inheritance. The semantics of
>    dcterms:hasPart and dcterms:isPartOf says nothing about inheritance. I am
>    not saying that we will have inheritance (or not), but is a thing that is
>    common when we have collections, packages, etc. The DWBP WG will have to
>    made this issue explicit to the users. Will our extension of DCAT address
>    this issue?
>
>    I am not sure that replicating the information of all types of
>    metadata that are common to a group of datasets is the best solution. Or a
>    thing that users usually do. I guess that this issue probably was
>    exhaustively discussed when defining DCAT. Sorry about the repetition.
>
>    Best Regards,
>    Laufer
>
>    2014-11-09 9:52 GMT-02:00 Phil Archer <*phila@w3.org* <phila@w3.org>>:
>
>
>       On 08/11/2014 17:06, Laufer wrote:
>          I am not against the definition of DCAT. What I am saying is
>          that the
>          dataset to DCAT do not address multiple datasets with different
>          distributions that could be a bundle.
>
>       OK I was being a little lazy. The following RDF expands on my
>       original example and is DCAT conformant. I'm thinking of some sort of
>       sensor readings taken every 6 hours and made available in different
>       formats. Once a day all formats are bundled up and available as a single
>       day's readings in all original formats as well as a zip file with
>       everything.
>
>       <#readings-2014-11-08T00:00> a dcat:Dataset;
>         dcterms:isPartOf <#readings-2014-11-08> ;
>         dcat:distribution <#readings-2014-11-08T00:00.csv> ;
>         dcat:distribution <#readings-2014-11-08T00:00.pdf> ;
>         dcat:distribution <#readings-2014-11-08T00:00.html> ;
>         dcat:distribution <#readings-2014-11-08T00:00.xml> ;
>         dcat:distribution <#readings-2014-11-08T00:00.ttl> ;
>         dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> .
>
>       <#readings-2014-11-08T00:00.csv> a dcat:Distribution ;
>         dcat:format "text/csv";
>         dcterms:isPartOf <#readings-2014-11-08.csv> ;
>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>
>       <#readings-2014-11-08T00:00.pdf> a dcat:Distribution ;
>         dcat:format "application/pdf" ;
>         dcterms:isPartOf <#readings-2014-11-08.pdf> ;
>         dcterms:isPartOf <#readings-2014-11-08.ziz> .
>
>       <#readings-2014-11-08T00:00.html> a dcat:Distribution ;
>         dcat:format "text/html" ;
>         dcterms:isPartOf <#readings-2014-11-08.html> ;
>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>
>       <#readings-2014-11-08T00:00.xml> a dcat:Distribution ;
>         dcat:format "application/xml" ;
>         dcterms:isPartOf <#readings-2014-11-08.xml> ;
>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>
>       <#readings-2014-11-08T00:00.ttl> a dcat:Distribution ;
>         dcat:format "text/turtle" ;
>         dcterms:isPartOf <#readings-2014-11-08.ttl> ;
>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>
>       <#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ;
>         dcat:format "application/json" ;
>         dcterms:isPartOf <#readings-2014-11-08.json> ;
>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>
>       <#readings-2014-11-08T06:00> a dcat:Dataset;
>         dcterms:isPartOf <#readings-2014-11-08> ;
>         dcat:distribution <#readings-2014-11-08T08:00.csv> ;
>         dcat:distribution <#readings-2014-11-08T08:00.pdf> ;
>         dcat:distribution <#readings-2014-11-08T08:00.html> ;
>         dcat:distribution <#readings-2014-11-08T08:00.xml> ;
>         dcat:distribution <#readings-2014-11-08T08:00.ttl> ;
>         dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> .
>
>       [.. Distribution descriptions]
>
>       <#readings-2014-11-08T12:00> a dcat:Dataset;
>         dcterms:isPartOf <#readings-2014-11-08> ;
>         dcat:distribution <#readings-2014-11-08T12:00.csv> ;
>         dcat:distribution <#readings-2014-11-08T1200.pdf> ;
>         dcat:distribution <#readings-2014-11-08T12:00.html> ;
>         dcat:distribution <#readings-2014-11-08T12:00.xml> ;
>         dcat:distribution <#readings-2014-11-08T12:00.ttl> ;
>         dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> .
>
>       [.. Distribution descriptions]
>
>       <#readings-2014-11-08T18:00> a dcat:Dataset;
>         dcterms:isPartOf <#readings-2014-11-08> ;
>         dcat:distribution <#readings-2014-11-08T18:00.csv> ;
>         dcat:distribution <#readings-2014-11-08T1800.pdf> ;
>         dcat:distribution <#readings-2014-11-08T18:00.html> ;
>         dcat:distribution <#readings-2014-11-08T18:00.xml> ;
>         dcat:distribution <#readings-2014-11-08T18:00.ttl> ;
>         dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> .
>
>       [.. Distribution descriptions]
>
>       <#readings-2014-11-08> a dcat:Dataset;
>         dcterms:hasPart <#readings-2014-11-08T00:00>;
>         dcterms:hasPart <#readings-2014-11-08T06:00>;
>         dcterms:hasPart <#readings-2014-11-08T12:00>;
>         dcterms:hasPart <#readings-2014-11-08T18:00>;
>         dcat:distribution <#readings-2014-11-08.zip> ;
>         dcat:distribution <#readings-2014-11-08.csv> ;
>         dcat:distribution <#readings-2014-11-08.html> ;
>         dcat:distribution <#readings-2014-11-08.pdf> ;
>         dcat:distribution <#readings-2014-11-08.xml> ;
>         dcat:distribution <#readings-2014-11-08.ttl> ;
>         dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>
>       [.. Distribution descriptions]
>
>       Such a set up might also have
>       <#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution .
>
>       i.e. a Dataset can also be a Distribution, in this case conneg
>       would determine which version you got back - and I'm not sure of the best
>       way to make this explicit. One could simply make no statement about the
>       format of the returned data but I'm not aware of a commonly accepted way of
>       stating this explicitly. The HTTP Response header 'Vary' does this job but
>       if we want to make it explicit before the request is sent we'd need to do
>       some work (and find people who care!).
>
>       Of course there's no need for each Dataset to have the same variety
>       of Distributions as each other.
>
>
>          In your example, Phil, there is only one file, the zip one. And
>          if you have
>          each one of the files with different distributions? If you are
>          sure that
>          this case never will happened, if when you have multiple files
>          they always
>          will be distributed in one single file, maybe the current
>          definition of
>          DCAT could be sufficient.
>
>          For Ckan and DSPL, dataset is always the set of files.
>
>          I prefer to restrict the idea of dataset to a collection of
>          resources (in
>          the sense of rdf resources). I do not like the idea of using
>          dataset as a
>          collection of datasets. But we have to discuss and collect
>          examples.
>
>       I don't think I'm understanding your concern I'm afraid.
>       dcat:Dataset is a very abstract concept and says nothing about the number
>       of files that materialise it. The Distributions do that and
>       dcterms:isPartOf/hasPart should cover it, I think but, of course, if there
>       are cases where this doesn't work then we will indeed need to look at them.
>
>          I think that this granularity is important. There would be
>          metadata in each
>          of these levels.
>
>       The CSV WG is reusing the idea of a package (with JSON metadata)
>       but that's specifically about CSVs.
>
>       Does this help?
>
>       Phil.
>
>
>
>          Em sábado, 8 de novembro de 2014, Phil Archer <*phila@w3.org*
>          <phila@w3.org>> escreveu:
>           I'm confident that DCAT supports this already. The DCAT
>             definition does
>             not say whether the collection of data is in a single file or
>             multiple
>             files since a dcat:Dataset is an abstract concept that may be
>             accessible by
>             a distribution.
>
>             dcterms:hasPart and dcterms:isPartOf are probably useful
>             here, and I'd
>             want to use those at the Dataset level, not the distribution
>             level,
>             something like:
>
>             <readings-2014-11-08T00:00> a dcat:Dataset;
>                dcterms:isPartOf <readings-2014-11-08> .
>
>             <readings-2014-11-08T06:00> a dcat:Dataset;
>                dcterms:isPartOf <readings-2014-11-08> .
>
>             <readings-2014-11-08T12:00> a dcat:Dataset;
>                dcterms:isPartOf <readings-2014-11-08> .
>
>             <readings-2014-11-08T18:00> a dcat:Dataset;
>                dcterms:isPartOf <readings-2014-11-08> .
>
>
>             <readings-2014-11-08> a dcat:Dataset;
>                dcterms:hasPart <readings-2014-11-08T00:00>;
>                dcterms:hasPart <readings-2014-11-08T06:00>;
>                dcterms:hasPart <readings-2014-11-08T12:00>;
>                dcterms:hasPart <readings-2014-11-08T18:00>;
>                dcat:distribution <readings-2014-11-08.zip> .
>
>             <readings-2014-11-08.zip> a dcat:Distribution;
>                dcat:mediaType "application/zip" .
>
>
>             The 4 timed readings and the collected readings for the day
>             are all
>             dcat:Datasets, i.e. they are all "A collection of data,
>             published or
>             curated by a single agent, and available for access or
>             download in one or
>             more formats."
>
>             Would that work for you Laufer?
>
>
>             On 07/11/2014 23:40, Laufer wrote:
>              I agree with you Phil. But as there are many different
>                definitions of this
>                term being used, we have to assert the definition that we
>                would accept.
>
>                I think that we will also need to use a term to talk about
>                bundles that
>                include multiple files, multiple datasets. Maybe
>                container, package...
>
>                As I understand, DCAT's definition of dataset does not
>                include a dataset
>                as
>                a set of files, for example.
>
>                Regards,
>                Laufer
>
>                Em sexta-feira, 7 de novembro de 2014, Phil Archer <
>                *phila@w3.org* <phila@w3.org>>
>                escreveu:
>
>                  I tried to word the issue relatively objectively just
>                now in tracker,
>                   allowing for the possibility of the WG to come up with
>                   a definition of
>                   'dataset' other than that in DCAT. More subjectively, I
>                   would personally
>                   be
>                   very opposed to any such redefinition unless there were
>                   very strong
>                   arguments for doing so.
>
>                   Phil.
>
>
>                   On 07/11/2014 14:25, Data on the Web Best Practices
>                   Working Group Issue
>                   Tracker wrote:
>
>                     ISSUE-80: We need a definition of "dataset"
>
> *http://www.w3.org/2013/dwbp/track/issues/80*
>                      <http://www.w3.org/2013/dwbp/track/issues/80>
>
>                      Raised by:
>                      On product:
>
>
>
>
>
>
>
>                        --
>
>
>                   Phil Archer
>                   W3C Data Activity Lead
> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>
> *http://philarcher.org* <http://philarcher.org/>
> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>                   @philarcher1
>
>
>                    --
>
>
>             Phil Archer
>             W3C Data Activity Lead
> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>
> *http://philarcher.org* <http://philarcher.org/>
> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>             @philarcher1
>
>
>       --
>
>
>       Phil Archer
>       W3C Data Activity Lead
> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>
> *http://philarcher.org* <http://philarcher.org/>
> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>       @philarcher1
>
>
>
>    --
>    .  .  .  .. .  .
>    .        .   . ..
>    .     ..       .
>
>
>
>
> --
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
>
> ----------------------------------------------------------------------------
>
>


-- 
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------


graycol.gif
(image/gif attachment: graycol.gif)

ecblank.gif
(image/gif attachment: ecblank.gif)

Received on Tuesday, 11 November 2014 13:34:25 UTC

This archive was generated by hypermail 2.3.1 : Tuesday, 6 January 2015 20:24:18 UTC