Re: ISSUE-80: We need a definition of "dataset"

Sorry about being the one not so sure.

For me, datasets are instances of data generated in a specific time. Even
if they come from the same database. Or different databases with the same
schema. Or rdf data using the same ontology.

The example of sensors has the concept of real-time data, that introduces
another issue already identified by the WG.

Let´s take the example of agencies that publish data from excel files in a
regular base. For me, these are different datasets and they are not part of
another dataset. Others can see all the excel files as being part of a huge
dataset, composed by all the files generated until the current date. Each
one of these datasets has its own distributions, maybe in different formats
for the same set of data, CSVs, XMLs, etc. But all the CSVs has the same
schema, all the XMLs has the same schema, and so on. Probably, all of them
will have the same license. A single one license.

My doubt is about the common metadata among these datasets. If we are happy
with the idea that common metadata has to be replicated for each
distribution (as stated by DCAT), and that we do not need a mechanism to
define metadata in a more high level of composition, I could agree with
using only the definition of DCAT. It means that for each one of these
datasets. publishers will have to replicate the link to metadata about
schema, license, etc.

Cheers,
Laufer


2014-11-11 11:33 GMT-02:00 Bernadette Farias Lóscio <bfl@cin.ufpe.br>:

> +1 to Makx!
>
> But I think we should have a better understand about how to specify a
> dataset using the DACT definition. We should also have examples that
> illustrate datasets definitions and their corresponding distributions.
>
> Phil sent an example of a dataset definition and I sent another version
> for the same example. If possible, it could be nice to discuss these ideas
> also.
>
> Thanks!
> Bernadette
>
> 2014-11-11 11:26 GMT-02:00 Steven Adler <adler1@us.ibm.com>:
>
> +1.  Well said.
>>
>>
>> Best Regards,
>>
>> Steve
>>
>> Motto: "Do First, Think, Do it Again"
>>
>> [image: Inactive hide details for Bernadette Farias Lóscio ---11/10/2014
>> 04:21:41 PM---Hi all, I like the idea of using the dataset def]Bernadette
>> Farias Lóscio ---11/10/2014 04:21:41 PM---Hi all, I like the idea of using
>> the dataset definition from DCAT and I fully agree
>>
>>
>>
>>    From:
>>
>>
>> Bernadette Farias Lóscio <bfl@cin.ufpe.br>
>>
>>    To:
>>
>>
>> Laufer <laufer@globo.com>
>>
>>    Cc:
>>
>>
>> Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
>>
>>    Date:
>>
>>
>> 11/10/2014 04:21 PM
>>
>>    Subject:
>>
>>
>> Re: ISSUE-80: We need a definition of "dataset"
>> ------------------------------
>>
>>
>>
>> Hi all,
>>
>> I like the idea of using the dataset definition from DCAT and I fully
>> agree with Makx that "we should not try to redefine what's already
>> well-defined".
>>
>> I like the examples of Phil, but I think that we still need to clarify
>> the meaning of a dataset. To do this, I'd like to make a comparison between
>> DCAT concepts and database concepts.
>>
>> In the database world, a database is defined as "a collection of related
>> data. By data, we mean known facts that can be recorded and that have
>> implicit meaning ". Moreover, "a database is a logically coherent
>> collection of data with some inherent meaning. A random assortment of data
>> cannot correctly be referred to as a database." [1]
>>
>> The relational model represents the database as a collection of relations
>> (or tables). Informally, each relation resembles a table of values or, to
>> some extent, a flat file of records.  For example, to construct the
>> database that describes a university, we store data to represent each
>> student, course, section, grade report, and prerequisite as a record in the
>> appropriate table.
>>
>> In DCAT, a dataset is a collection of data, published or curated by a
>> single agent, and available for access or download in one or more formats.
>> A data catalog is a curated collection of metadata about datasets. In DCAT
>> there is no notion of related data and collection of data with some
>> inherent meaning.
>>
>> In my opinion, if we make a comparison between DCAT and  databases,  a
>> *dataset* is similar to a database and the data organization of a given
>> dataset will depend from the data model used to represent the data.
>>
>> Considering that a given dataset may have multiple distributions, then
>> the organization of files will depend from the data model of each
>> distribution (ex: csv, xml, rdf, json). For example, a csv distribution may
>> have multiple tables and an xml distribution may have one or more xml
>> documents. When a given distribution has more than one file (ex:
>> multiple csv files), I agree with Phil that we can use dcterms:isPartOf
>> to associate multiple files to a given distribution.
>>
>> In other words, I think that DCAT's definition is more abstract and
>> doesn't concern the organization of the data in different files. This
>> should be defined by the data model used in each distribution.
>>
>> Concerning the metadata, I think that there will be metadata related to
>> the dataset and metadata related to each available distribution, as
>> proposed by DCAT.
>>
>> Considering this context, then in the example of Phil, instead of having
>> several datasets, there will be one dataset and 06 distributions. Each
>> distribution will be composed by several files. The composition of a
>> distribution will depend on the data model (ex: csv tables, xml
>> documents..). In the following, I present a new proposal for the dataset
>> definition and its corresponding csv distribution (other distributions are
>> similar).
>>
>> <#sensor-readings> a dcat:Dataset;
>>   dcat:distribution <#sendor-readings.csv> ;
>>   dcat:distribution <#sendor-readings.pdf> ;
>>   dcat:distribution <#sendor-readings.html> ;
>>   dcat:distribution <#sendor-readings.xml> ;
>>   dcat:distribution <# sendor-readings.ttl>;
>>
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>>
>> <#sendor-readings.csv> a dcat:Distribution ;
>>   dcterms:isPartOf <#sensor-readings> ;
>>   dcterms:hasPart <#readings-2014-11-08T00:00.csv>;
>>   dcterms:hasPart <#readings-2014-11-08T06:00.csv>;
>>   dcterms:hasPart <#readings-2014-11-08T12:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-08T18:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-08.zip> ;
>>   dcterms:hasPart <#readings-2014-11-08.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-09T00:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-09T06:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-09T12:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-09T18:00.csv>;
>>
>>   dcterms:hasPart <#readings-2014-11-09.zip> ;
>>   dcterms:hasPart <#readings-2014-11-09.csv>.
>>
>> <#readings-2014-11-08T00:00.csv> a csv:table
>>   dcat:format "text/csv";
>>
>> <#readings-2014-11-08T00:06.csv> a csv:table
>>   dcat:format "text/csv";
>>
>> <#readings-2014-11-08T00:18.csv> a csv:table
>>   dcat:format "text/csv";
>>
>> [...Table descriptions]
>>
>> [...Distribution descriptions]
>>
>>
>> In this case, the publication of new data, i.e, a new sensor reading,
>> implies the insertion of a new file in each one of the distributions. It is
>> important to note that data is distributed in two different levels of
>> granularity (ex: per day and every six hours).
>>
>> I'm not sure of how to define the type of a specific file. For example, I
>> used "csv:table" to define the type of <#readings-2014-11-08T00:00.csv>.
>> However, I think that the csv model doesn't define this.
>>
>> @Phil, please let me know if this definition makes sense to you and if it
>> is DCAT conformant.
>>
>> I think we can use the dataset definition from DCAT, however it is
>> important to have a better understanding of how to specify datasets and its
>> corresponding distributions.
>>
>> I'm sorry for the long message :)
>>
>> kind regards,
>> Bernadette
>>
>> [1] Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database
>> System", 6th Edition. ISBN-13: 978-0136086208
>>
>> 2014-11-09 11:40 GMT-03:00 Laufer <*laufer@globo.com* <laufer@globo.com>
>> >:
>>
>>    Ok, Phil.
>>
>>    Let´s continue with the example.
>>
>>    Suppose that there is a metadata type, for example, license (a
>>    metadata type that exists in DCAT spec), that applies to these datasets and
>>    is common to all files.
>>
>>    DCAT defines a license property for the catalog and "Even if the
>>    license of the catalog applies to all of its datasets and distributions, it
>>    should be replicated on each distribution."
>>
>>    In the CSV WG they discussed levels of metadata definitions, with
>>    ways of linking the metadata file to the CSV file. They defined a priority
>>    chain that states that the inner definition has a higher priority level,
>>    so, if, for example, there is a definition of the same type of metadata
>>    related to a package and to a file, the metadata related to the file will
>>    be the valid one.
>>
>>    Will we assume that a metadata that is related to a dataset that has
>>    parts will apply to all of its parts, or, as in the dcat:license for the
>>    catalog, the same metadata will have to be linked to all distributions of
>>    all datasets that are parts of the dataset that groups them all?
>>
>>    Reading the DCAT spec, I feel that the way of grouping datasets is
>>    the catalog. Catalog has parts that are the datasets. But the type catalog
>>    is different from the type dataset. In your example, a dataset that groups
>>    other datasets could be seen as a type of catalog, a hierarchy in the
>>    definiton of the catalog. And, at the same time, be a dataset.
>>
>>    Maybe I am not seeing the things correctly, but I think that here we
>>    are defining a type of dataset grouping that is not addressed in DCAT spec.
>>    The use of dcterms:hasPart and dcterms:isPartOf is interesting. Will the
>>    DWBP WG recommend that?
>>
>>    In CSV WG they have the idea of metadata inheritance. The semantics
>>    of dcterms:hasPart and dcterms:isPartOf says nothing about inheritance. I
>>    am not saying that we will have inheritance (or not), but is a thing that
>>    is common when we have collections, packages, etc. The DWBP WG will have to
>>    made this issue explicit to the users. Will our extension of DCAT address
>>    this issue?
>>
>>    I am not sure that replicating the information of all types of
>>    metadata that are common to a group of datasets is the best solution. Or a
>>    thing that users usually do. I guess that this issue probably was
>>    exhaustively discussed when defining DCAT. Sorry about the repetition.
>>
>>    Best Regards,
>>    Laufer
>>
>>    2014-11-09 9:52 GMT-02:00 Phil Archer <*phila@w3.org* <phila@w3.org>>:
>>
>>
>>       On 08/11/2014 17:06, Laufer wrote:
>>          I am not against the definition of DCAT. What I am saying is
>>          that the
>>          dataset to DCAT do not address multiple datasets with different
>>          distributions that could be a bundle.
>>
>>       OK I was being a little lazy. The following RDF expands on my
>>       original example and is DCAT conformant. I'm thinking of some sort of
>>       sensor readings taken every 6 hours and made available in different
>>       formats. Once a day all formats are bundled up and available as a single
>>       day's readings in all original formats as well as a zip file with
>>       everything.
>>
>>       <#readings-2014-11-08T00:00> a dcat:Dataset;
>>         dcterms:isPartOf <#readings-2014-11-08> ;
>>         dcat:distribution <#readings-2014-11-08T00:00.csv> ;
>>         dcat:distribution <#readings-2014-11-08T00:00.pdf> ;
>>         dcat:distribution <#readings-2014-11-08T00:00.html> ;
>>         dcat:distribution <#readings-2014-11-08T00:00.xml> ;
>>         dcat:distribution <#readings-2014-11-08T00:00.ttl> ;
>>         dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> .
>>
>>       <#readings-2014-11-08T00:00.csv> a dcat:Distribution ;
>>         dcat:format "text/csv";
>>         dcterms:isPartOf <#readings-2014-11-08.csv> ;
>>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>>       <#readings-2014-11-08T00:00.pdf> a dcat:Distribution ;
>>         dcat:format "application/pdf" ;
>>         dcterms:isPartOf <#readings-2014-11-08.pdf> ;
>>         dcterms:isPartOf <#readings-2014-11-08.ziz> .
>>
>>       <#readings-2014-11-08T00:00.html> a dcat:Distribution ;
>>         dcat:format "text/html" ;
>>         dcterms:isPartOf <#readings-2014-11-08.html> ;
>>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>>       <#readings-2014-11-08T00:00.xml> a dcat:Distribution ;
>>         dcat:format "application/xml" ;
>>         dcterms:isPartOf <#readings-2014-11-08.xml> ;
>>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>>       <#readings-2014-11-08T00:00.ttl> a dcat:Distribution ;
>>         dcat:format "text/turtle" ;
>>         dcterms:isPartOf <#readings-2014-11-08.ttl> ;
>>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>>       <#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ;
>>         dcat:format "application/json" ;
>>         dcterms:isPartOf <#readings-2014-11-08.json> ;
>>         dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>>       <#readings-2014-11-08T06:00> a dcat:Dataset;
>>         dcterms:isPartOf <#readings-2014-11-08> ;
>>         dcat:distribution <#readings-2014-11-08T08:00.csv> ;
>>         dcat:distribution <#readings-2014-11-08T08:00.pdf> ;
>>         dcat:distribution <#readings-2014-11-08T08:00.html> ;
>>         dcat:distribution <#readings-2014-11-08T08:00.xml> ;
>>         dcat:distribution <#readings-2014-11-08T08:00.ttl> ;
>>         dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> .
>>
>>       [.. Distribution descriptions]
>>
>>       <#readings-2014-11-08T12:00> a dcat:Dataset;
>>         dcterms:isPartOf <#readings-2014-11-08> ;
>>         dcat:distribution <#readings-2014-11-08T12:00.csv> ;
>>         dcat:distribution <#readings-2014-11-08T1200.pdf> ;
>>         dcat:distribution <#readings-2014-11-08T12:00.html> ;
>>         dcat:distribution <#readings-2014-11-08T12:00.xml> ;
>>         dcat:distribution <#readings-2014-11-08T12:00.ttl> ;
>>         dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> .
>>
>>       [.. Distribution descriptions]
>>
>>       <#readings-2014-11-08T18:00> a dcat:Dataset;
>>         dcterms:isPartOf <#readings-2014-11-08> ;
>>         dcat:distribution <#readings-2014-11-08T18:00.csv> ;
>>         dcat:distribution <#readings-2014-11-08T1800.pdf> ;
>>         dcat:distribution <#readings-2014-11-08T18:00.html> ;
>>         dcat:distribution <#readings-2014-11-08T18:00.xml> ;
>>         dcat:distribution <#readings-2014-11-08T18:00.ttl> ;
>>         dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> .
>>
>>       [.. Distribution descriptions]
>>
>>       <#readings-2014-11-08> a dcat:Dataset;
>>         dcterms:hasPart <#readings-2014-11-08T00:00>;
>>         dcterms:hasPart <#readings-2014-11-08T06:00>;
>>         dcterms:hasPart <#readings-2014-11-08T12:00>;
>>         dcterms:hasPart <#readings-2014-11-08T18:00>;
>>         dcat:distribution <#readings-2014-11-08.zip> ;
>>         dcat:distribution <#readings-2014-11-08.csv> ;
>>         dcat:distribution <#readings-2014-11-08.html> ;
>>         dcat:distribution <#readings-2014-11-08.pdf> ;
>>         dcat:distribution <#readings-2014-11-08.xml> ;
>>         dcat:distribution <#readings-2014-11-08.ttl> ;
>>         dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>>
>>       [.. Distribution descriptions]
>>
>>       Such a set up might also have
>>       <#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution .
>>
>>       i.e. a Dataset can also be a Distribution, in this case conneg
>>       would determine which version you got back - and I'm not sure of the best
>>       way to make this explicit. One could simply make no statement about the
>>       format of the returned data but I'm not aware of a commonly accepted way of
>>       stating this explicitly. The HTTP Response header 'Vary' does this job but
>>       if we want to make it explicit before the request is sent we'd need to do
>>       some work (and find people who care!).
>>
>>       Of course there's no need for each Dataset to have the same
>>       variety of Distributions as each other.
>>
>>
>>          In your example, Phil, there is only one file, the zip one. And
>>          if you have
>>          each one of the files with different distributions? If you are
>>          sure that
>>          this case never will happened, if when you have multiple files
>>          they always
>>          will be distributed in one single file, maybe the current
>>          definition of
>>          DCAT could be sufficient.
>>
>>          For Ckan and DSPL, dataset is always the set of files.
>>
>>          I prefer to restrict the idea of dataset to a collection of
>>          resources (in
>>          the sense of rdf resources). I do not like the idea of using
>>          dataset as a
>>          collection of datasets. But we have to discuss and collect
>>          examples.
>>
>>       I don't think I'm understanding your concern I'm afraid.
>>       dcat:Dataset is a very abstract concept and says nothing about the number
>>       of files that materialise it. The Distributions do that and
>>       dcterms:isPartOf/hasPart should cover it, I think but, of course, if there
>>       are cases where this doesn't work then we will indeed need to look at them.
>>
>>          I think that this granularity is important. There would be
>>          metadata in each
>>          of these levels.
>>
>>       The CSV WG is reusing the idea of a package (with JSON metadata)
>>       but that's specifically about CSVs.
>>
>>       Does this help?
>>
>>       Phil.
>>
>>
>>
>>          Em sábado, 8 de novembro de 2014, Phil Archer <*phila@w3.org*
>>          <phila@w3.org>> escreveu:
>>           I'm confident that DCAT supports this already. The DCAT
>>             definition does
>>             not say whether the collection of data is in a single file
>>             or multiple
>>             files since a dcat:Dataset is an abstract concept that may
>>             be accessible by
>>             a distribution.
>>
>>             dcterms:hasPart and dcterms:isPartOf are probably useful
>>             here, and I'd
>>             want to use those at the Dataset level, not the distribution
>>             level,
>>             something like:
>>
>>             <readings-2014-11-08T00:00> a dcat:Dataset;
>>                dcterms:isPartOf <readings-2014-11-08> .
>>
>>             <readings-2014-11-08T06:00> a dcat:Dataset;
>>                dcterms:isPartOf <readings-2014-11-08> .
>>
>>             <readings-2014-11-08T12:00> a dcat:Dataset;
>>                dcterms:isPartOf <readings-2014-11-08> .
>>
>>             <readings-2014-11-08T18:00> a dcat:Dataset;
>>                dcterms:isPartOf <readings-2014-11-08> .
>>
>>
>>             <readings-2014-11-08> a dcat:Dataset;
>>                dcterms:hasPart <readings-2014-11-08T00:00>;
>>                dcterms:hasPart <readings-2014-11-08T06:00>;
>>                dcterms:hasPart <readings-2014-11-08T12:00>;
>>                dcterms:hasPart <readings-2014-11-08T18:00>;
>>                dcat:distribution <readings-2014-11-08.zip> .
>>
>>             <readings-2014-11-08.zip> a dcat:Distribution;
>>                dcat:mediaType "application/zip" .
>>
>>
>>             The 4 timed readings and the collected readings for the day
>>             are all
>>             dcat:Datasets, i.e. they are all "A collection of data,
>>             published or
>>             curated by a single agent, and available for access or
>>             download in one or
>>             more formats."
>>
>>             Would that work for you Laufer?
>>
>>
>>             On 07/11/2014 23:40, Laufer wrote:
>>              I agree with you Phil. But as there are many different
>>                definitions of this
>>                term being used, we have to assert the definition that we
>>                would accept.
>>
>>                I think that we will also need to use a term to talk
>>                about bundles that
>>                include multiple files, multiple datasets. Maybe
>>                container, package...
>>
>>                As I understand, DCAT's definition of dataset does not
>>                include a dataset
>>                as
>>                a set of files, for example.
>>
>>                Regards,
>>                Laufer
>>
>>                Em sexta-feira, 7 de novembro de 2014, Phil Archer <
>>                *phila@w3.org* <phila@w3.org>>
>>                escreveu:
>>
>>                  I tried to word the issue relatively objectively just
>>                now in tracker,
>>                   allowing for the possibility of the WG to come up with
>>                   a definition of
>>                   'dataset' other than that in DCAT. More subjectively,
>>                   I would personally
>>                   be
>>                   very opposed to any such redefinition unless there
>>                   were very strong
>>                   arguments for doing so.
>>
>>                   Phil.
>>
>>
>>                   On 07/11/2014 14:25, Data on the Web Best Practices
>>                   Working Group Issue
>>                   Tracker wrote:
>>
>>                     ISSUE-80: We need a definition of "dataset"
>>
>> *http://www.w3.org/2013/dwbp/track/issues/80*
>>                      <http://www.w3.org/2013/dwbp/track/issues/80>
>>
>>                      Raised by:
>>                      On product:
>>
>>
>>
>>
>>
>>
>>
>>                        --
>>
>>
>>                   Phil Archer
>>                   W3C Data Activity Lead
>> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>>
>> *http://philarcher.org* <http://philarcher.org/>
>> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>>                   @philarcher1
>>
>>
>>                    --
>>
>>
>>             Phil Archer
>>             W3C Data Activity Lead
>> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>>
>> *http://philarcher.org* <http://philarcher.org/>
>> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>>             @philarcher1
>>
>>
>>       --
>>
>>
>>       Phil Archer
>>       W3C Data Activity Lead
>> *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/>
>>
>> *http://philarcher.org* <http://philarcher.org/>
>> *+44 (0)7887 767755* <%2B44%20%280%297887%20767755>
>>       @philarcher1
>>
>>
>>
>>    --
>>    .  .  .  .. .  .
>>    .        .   . ..
>>    .     ..       .
>>
>>
>>
>>
>> --
>> Bernadette Farias Lóscio
>> Centro de Informática
>> Universidade Federal de Pernambuco - UFPE, Brazil
>>
>> ----------------------------------------------------------------------------
>>
>>
>
>
> --
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
>
> ----------------------------------------------------------------------------
>



-- 
.  .  .  .. .  .
.        .   . ..
.     ..       .

Received on Tuesday, 11 November 2014 14:14:03 UTC