Re: ISSUE-80: We need a definition of "dataset"

Hi all,

I like the idea of using the dataset definition from DCAT and I fully agree
with Makx that "we should not try to redefine what's already well-defined".

I like the examples of Phil, but I think that we still need to clarify the
meaning of a dataset. To do this, I'd like to make a comparison between
DCAT concepts and database concepts.

In the database world, a database is defined as "a collection of related
data. By data, we mean known facts that can be recorded and that have
implicit meaning ". Moreover, "a database is a logically coherent
collection of data with some inherent meaning. A random assortment of data
cannot correctly be referred to as a database." [1]

The relational model represents the database as a collection of relations
(or tables). Informally, each relation resembles a table of values or, to
some extent, a flat file of records.  For example, to construct the
database that describes a university, we store data to represent each
student, course, section, grade report, and prerequisite as a record in the
appropriate table.

In DCAT, a dataset is a collection of data, published or curated by a
single agent, and available for access or download in one or more formats.
A data catalog is a curated collection of metadata about datasets. In DCAT
there is no notion of related data and collection of data with some
inherent meaning.

In my opinion, if we make a comparison between DCAT and  databases,  a
*dataset* is similar to a database and the data organization of a given
dataset will depend from the data model used to represent the data.

Considering that a given dataset may have multiple distributions, then the
organization of files will depend from the data model of each distribution
(ex: csv, xml, rdf, json). For example, a csv distribution may have
multiple tables and an xml distribution may have one or more xml documents.
When a given distribution has more than one file (ex: multiple csv files),
I agree with Phil that we can use dcterms:isPartOf to associate multiple
files to a given distribution.

In other words, I think that DCAT's definition is more abstract and doesn't
concern the organization of the data in different files. This should be
defined by the data model used in each distribution.

Concerning the metadata, I think that there will be metadata related to the
dataset and metadata related to each available distribution, as proposed by
DCAT.

Considering this context, then in the example of Phil, instead of having
several datasets, there will be one dataset and 06 distributions. Each
distribution will be composed by several files. The composition of a
distribution will depend on the data model (ex: csv tables, xml
documents..). In the following, I present a new proposal for the dataset
definition and its corresponding csv distribution (other distributions are
similar).

<#sensor-readings> a dcat:Dataset;
  dcat:distribution <#sendor-readings.csv> ;
  dcat:distribution <#sendor-readings.pdf> ;
  dcat:distribution <#sendor-readings.html> ;
  dcat:distribution <#sendor-readings.xml> ;
  dcat:distribution <# sendor-readings.ttl>;

  dcat:distribution <#readingsapi?date=2014-11-08&time=all> .

 <#sendor-readings.csv> a dcat:Distribution ;
  dcterms:isPartOf <#sensor-readings> ;
  dcterms:hasPart <#readings-2014-11-08T00:00.csv>;
  dcterms:hasPart <#readings-2014-11-08T06:00.csv>;
  dcterms:hasPart <#readings-2014-11-08T12:00.csv>;

  dcterms:hasPart <#readings-2014-11-08T18:00.csv>;

  dcterms:hasPart <#readings-2014-11-08.zip> ;
  dcterms:hasPart <#readings-2014-11-08.csv>;

  dcterms:hasPart <#readings-2014-11-09T00:00.csv>;

  dcterms:hasPart <#readings-2014-11-09T06:00.csv>;

  dcterms:hasPart <#readings-2014-11-09T12:00.csv>;

  dcterms:hasPart <#readings-2014-11-09T18:00.csv>;

  dcterms:hasPart <#readings-2014-11-09.zip> ;
  dcterms:hasPart <#readings-2014-11-09.csv>.

<#readings-2014-11-08T00:00.csv> a csv:table
  dcat:format "text/csv";

<#readings-2014-11-08T00:06.csv> a csv:table
  dcat:format "text/csv";

<#readings-2014-11-08T00:18.csv> a csv:table
  dcat:format "text/csv";

[...Table descriptions]


[...Distribution descriptions]

In this case, the publication of new data, i.e, a new sensor reading,
implies the insertion of a new file in each one of the distributions. It is
important to note that data is distributed in two different levels of
granularity (ex: per day and every six hours).

I'm not sure of how to define the type of a specific file. For example, I
used "csv:table" to define the type of <#readings-2014-11-08T00:00.csv>.
However, I think that the csv model doesn't define this.

@Phil, please let me know if this definition makes sense to you and if it
is DCAT conformant.

I think we can use the dataset definition from DCAT, however it is
important to have a better understanding of how to specify datasets and its
corresponding distributions.

I'm sorry for the long message :)

kind regards,
Bernadette

[1] Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database System",
6th Edition. ISBN-13: 978-0136086208

2014-11-09 11:40 GMT-03:00 Laufer <laufer@globo.com>:

> Ok, Phil.
>
> Let´s continue with the example.
>
> Suppose that there is a metadata type, for example, license (a metadata
> type that exists in DCAT spec), that applies to these datasets and is
> common to all files.
>
> DCAT defines a license property for the catalog and "Even if the license
> of the catalog applies to all of its datasets and distributions, it should
> be replicated on each distribution."
>
> In the CSV WG they discussed levels of metadata definitions, with ways of
> linking the metadata file to the CSV file. They defined a priority chain
> that states that the inner definition has a higher priority level, so, if,
> for example, there is a definition of the same type of metadata related to
> a package and to a file, the metadata related to the file will be the valid
> one.
>
> Will we assume that a metadata that is related to a dataset that has parts
> will apply to all of its parts, or, as in the dcat:license for the catalog,
> the same metadata will have to be linked to all distributions of all
> datasets that are parts of the dataset that groups them all?
>
> Reading the DCAT spec, I feel that the way of grouping datasets is the
> catalog. Catalog has parts that are the datasets. But the type catalog is
> different from the type dataset. In your example, a dataset that groups
> other datasets could be seen as a type of catalog, a hierarchy in the
> definiton of the catalog. And, at the same time, be a dataset.
>
> Maybe I am not seeing the things correctly, but I think that here we are
> defining a type of dataset grouping that is not addressed in DCAT spec. The
> use of dcterms:hasPart and dcterms:isPartOf is interesting. Will the DWBP
> WG recommend that?
>
> In CSV WG they have the idea of metadata inheritance. The semantics of
> dcterms:hasPart and dcterms:isPartOf says nothing about inheritance. I am
> not saying that we will have inheritance (or not), but is a thing that is
> common when we have collections, packages, etc. The DWBP WG will have to
> made this issue explicit to the users. Will our extension of DCAT address
> this issue?
>
> I am not sure that replicating the information of all types of metadata
> that are common to a group of datasets is the best solution. Or a thing
> that users usually do. I guess that this issue probably was exhaustively
> discussed when defining DCAT. Sorry about the repetition.
>
> Best Regards,
> Laufer
>
> 2014-11-09 9:52 GMT-02:00 Phil Archer <phila@w3.org>:
>
>
>>
>> On 08/11/2014 17:06, Laufer wrote:
>>
>>> I am not against the definition of DCAT. What I am saying is that the
>>> dataset to DCAT do not address multiple datasets with different
>>> distributions that could be a bundle.
>>>
>>
>> OK I was being a little lazy. The following RDF expands on my original
>> example and is DCAT conformant. I'm thinking of some sort of sensor
>> readings taken every 6 hours and made available in different formats. Once
>> a day all formats are bundled up and available as a single day's readings
>> in all original formats as well as a zip file with everything.
>>
>> <#readings-2014-11-08T00:00> a dcat:Dataset;
>>   dcterms:isPartOf <#readings-2014-11-08> ;
>>   dcat:distribution <#readings-2014-11-08T00:00.csv> ;
>>   dcat:distribution <#readings-2014-11-08T00:00.pdf> ;
>>   dcat:distribution <#readings-2014-11-08T00:00.html> ;
>>   dcat:distribution <#readings-2014-11-08T00:00.xml> ;
>>   dcat:distribution <#readings-2014-11-08T00:00.ttl> ;
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> .
>>
>> <#readings-2014-11-08T00:00.csv> a dcat:Distribution ;
>>   dcat:format "text/csv";
>>   dcterms:isPartOf <#readings-2014-11-08.csv> ;
>>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>> <#readings-2014-11-08T00:00.pdf> a dcat:Distribution ;
>>   dcat:format "application/pdf" ;
>>   dcterms:isPartOf <#readings-2014-11-08.pdf> ;
>>   dcterms:isPartOf <#readings-2014-11-08.ziz> .
>>
>> <#readings-2014-11-08T00:00.html> a dcat:Distribution ;
>>   dcat:format "text/html" ;
>>   dcterms:isPartOf <#readings-2014-11-08.html> ;
>>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>> <#readings-2014-11-08T00:00.xml> a dcat:Distribution ;
>>   dcat:format "application/xml" ;
>>   dcterms:isPartOf <#readings-2014-11-08.xml> ;
>>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>> <#readings-2014-11-08T00:00.ttl> a dcat:Distribution ;
>>   dcat:format "text/turtle" ;
>>   dcterms:isPartOf <#readings-2014-11-08.ttl> ;
>>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>> <#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ;
>>   dcat:format "application/json" ;
>>   dcterms:isPartOf <#readings-2014-11-08.json> ;
>>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>>
>> <#readings-2014-11-08T06:00> a dcat:Dataset;
>>   dcterms:isPartOf <#readings-2014-11-08> ;
>>   dcat:distribution <#readings-2014-11-08T08:00.csv> ;
>>   dcat:distribution <#readings-2014-11-08T08:00.pdf> ;
>>   dcat:distribution <#readings-2014-11-08T08:00.html> ;
>>   dcat:distribution <#readings-2014-11-08T08:00.xml> ;
>>   dcat:distribution <#readings-2014-11-08T08:00.ttl> ;
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> .
>>
>> [.. Distribution descriptions]
>>
>> <#readings-2014-11-08T12:00> a dcat:Dataset;
>>   dcterms:isPartOf <#readings-2014-11-08> ;
>>   dcat:distribution <#readings-2014-11-08T12:00.csv> ;
>>   dcat:distribution <#readings-2014-11-08T1200.pdf> ;
>>   dcat:distribution <#readings-2014-11-08T12:00.html> ;
>>   dcat:distribution <#readings-2014-11-08T12:00.xml> ;
>>   dcat:distribution <#readings-2014-11-08T12:00.ttl> ;
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> .
>>
>> [.. Distribution descriptions]
>>
>> <#readings-2014-11-08T18:00> a dcat:Dataset;
>>   dcterms:isPartOf <#readings-2014-11-08> ;
>>   dcat:distribution <#readings-2014-11-08T18:00.csv> ;
>>   dcat:distribution <#readings-2014-11-08T1800.pdf> ;
>>   dcat:distribution <#readings-2014-11-08T18:00.html> ;
>>   dcat:distribution <#readings-2014-11-08T18:00.xml> ;
>>   dcat:distribution <#readings-2014-11-08T18:00.ttl> ;
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> .
>>
>> [.. Distribution descriptions]
>>
>> <#readings-2014-11-08> a dcat:Dataset;
>>   dcterms:hasPart <#readings-2014-11-08T00:00>;
>>   dcterms:hasPart <#readings-2014-11-08T06:00>;
>>   dcterms:hasPart <#readings-2014-11-08T12:00>;
>>   dcterms:hasPart <#readings-2014-11-08T18:00>;
>>   dcat:distribution <#readings-2014-11-08.zip> ;
>>   dcat:distribution <#readings-2014-11-08.csv> ;
>>   dcat:distribution <#readings-2014-11-08.html> ;
>>   dcat:distribution <#readings-2014-11-08.pdf> ;
>>   dcat:distribution <#readings-2014-11-08.xml> ;
>>   dcat:distribution <#readings-2014-11-08.ttl> ;
>>   dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>>
>> [.. Distribution descriptions]
>>
>> Such a set up might also have
>> <#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution .
>>
>> i.e. a Dataset can also be a Distribution, in this case conneg would
>> determine which version you got back - and I'm not sure of the best way to
>> make this explicit. One could simply make no statement about the format of
>> the returned data but I'm not aware of a commonly accepted way of stating
>> this explicitly. The HTTP Response header 'Vary' does this job but if we
>> want to make it explicit before the request is sent we'd need to do some
>> work (and find people who care!).
>>
>> Of course there's no need for each Dataset to have the same variety of
>> Distributions as each other.
>>
>>
>>
>>> In your example, Phil, there is only one file, the zip one. And if you
>>> have
>>> each one of the files with different distributions? If you are sure that
>>> this case never will happened, if when you have multiple files they
>>> always
>>> will be distributed in one single file, maybe the current definition of
>>> DCAT could be sufficient.
>>>
>>> For Ckan and DSPL, dataset is always the set of files.
>>>
>>> I prefer to restrict the idea of dataset to a collection of resources (in
>>> the sense of rdf resources). I do not like the idea of using dataset as a
>>> collection of datasets. But we have to discuss and collect examples.
>>>
>>
>> I don't think I'm understanding your concern I'm afraid. dcat:Dataset is
>> a very abstract concept and says nothing about the number of files that
>> materialise it. The Distributions do that and dcterms:isPartOf/hasPart
>> should cover it, I think but, of course, if there are cases where this
>> doesn't work then we will indeed need to look at them.
>>
>>
>>> I think that this granularity is important. There would be metadata in
>>> each
>>> of these levels.
>>>
>>
>> The CSV WG is reusing the idea of a package (with JSON metadata) but
>> that's specifically about CSVs.
>>
>> Does this help?
>>
>> Phil.
>>
>>
>>
>>
>>> Em sábado, 8 de novembro de 2014, Phil Archer <phila@w3.org> escreveu:
>>>
>>>  I'm confident that DCAT supports this already. The DCAT definition does
>>>> not say whether the collection of data is in a single file or multiple
>>>> files since a dcat:Dataset is an abstract concept that may be
>>>> accessible by
>>>> a distribution.
>>>>
>>>> dcterms:hasPart and dcterms:isPartOf are probably useful here, and I'd
>>>> want to use those at the Dataset level, not the distribution level,
>>>> something like:
>>>>
>>>> <readings-2014-11-08T00:00> a dcat:Dataset;
>>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>>
>>>> <readings-2014-11-08T06:00> a dcat:Dataset;
>>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>>
>>>> <readings-2014-11-08T12:00> a dcat:Dataset;
>>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>>
>>>> <readings-2014-11-08T18:00> a dcat:Dataset;
>>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>>
>>>>
>>>> <readings-2014-11-08> a dcat:Dataset;
>>>>    dcterms:hasPart <readings-2014-11-08T00:00>;
>>>>    dcterms:hasPart <readings-2014-11-08T06:00>;
>>>>    dcterms:hasPart <readings-2014-11-08T12:00>;
>>>>    dcterms:hasPart <readings-2014-11-08T18:00>;
>>>>    dcat:distribution <readings-2014-11-08.zip> .
>>>>
>>>> <readings-2014-11-08.zip> a dcat:Distribution;
>>>>    dcat:mediaType "application/zip" .
>>>>
>>>>
>>>> The 4 timed readings and the collected readings for the day are all
>>>> dcat:Datasets, i.e. they are all "A collection of data, published or
>>>> curated by a single agent, and available for access or download in one
>>>> or
>>>> more formats."
>>>>
>>>> Would that work for you Laufer?
>>>>
>>>>
>>>> On 07/11/2014 23:40, Laufer wrote:
>>>>
>>>>  I agree with you Phil. But as there are many different definitions of
>>>>> this
>>>>> term being used, we have to assert the definition that we would accept.
>>>>>
>>>>> I think that we will also need to use a term to talk about bundles that
>>>>> include multiple files, multiple datasets. Maybe container, package...
>>>>>
>>>>> As I understand, DCAT's definition of dataset does not include a
>>>>> dataset
>>>>> as
>>>>> a set of files, for example.
>>>>>
>>>>> Regards,
>>>>> Laufer
>>>>>
>>>>> Em sexta-feira, 7 de novembro de 2014, Phil Archer <phila@w3.org>
>>>>> escreveu:
>>>>>
>>>>>   I tried to word the issue relatively objectively just now in tracker,
>>>>>
>>>>>> allowing for the possibility of the WG to come up with a definition of
>>>>>> 'dataset' other than that in DCAT. More subjectively, I would
>>>>>> personally
>>>>>> be
>>>>>> very opposed to any such redefinition unless there were very strong
>>>>>> arguments for doing so.
>>>>>>
>>>>>> Phil.
>>>>>>
>>>>>>
>>>>>> On 07/11/2014 14:25, Data on the Web Best Practices Working Group
>>>>>> Issue
>>>>>> Tracker wrote:
>>>>>>
>>>>>>   ISSUE-80: We need a definition of "dataset"
>>>>>>
>>>>>>>
>>>>>>> http://www.w3.org/2013/dwbp/track/issues/80
>>>>>>>
>>>>>>> Raised by:
>>>>>>> On product:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>   --
>>>>>>>
>>>>>>
>>>>>>
>>>>>> Phil Archer
>>>>>> W3C Data Activity Lead
>>>>>> http://www.w3.org/2013/data/
>>>>>>
>>>>>> http://philarcher.org
>>>>>> +44 (0)7887 767755
>>>>>> @philarcher1
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>  --
>>>>
>>>>
>>>> Phil Archer
>>>> W3C Data Activity Lead
>>>> http://www.w3.org/2013/data/
>>>>
>>>> http://philarcher.org
>>>> +44 (0)7887 767755
>>>> @philarcher1
>>>>
>>>>
>>>
>>>
>> --
>>
>>
>> Phil Archer
>> W3C Data Activity Lead
>> http://www.w3.org/2013/data/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>>
>
>
>
> --
> .  .  .  .. .  .
> .        .   . ..
> .     ..       .
>



-- 
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------

Received on Monday, 10 November 2014 21:21:06 UTC