Re: ISSUE-80: We need a definition of "dataset" from Laufer on 2014-11-09 (public-dwbp-wg@w3.org from November 2014)

From: Laufer <laufer@globo.com>
Date: Sun, 9 Nov 2014 12:40:32 -0200
To: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
Message-ID: <CA+pXJigh_eqhsKvxZ8rFBA4JiQdx59j0Zz3yL_Wt82Xw+b2atg@mail.gmail.com>
Ok, Phil.

Let´s continue with the example.

Suppose that there is a metadata type, for example, license (a metadata
type that exists in DCAT spec), that applies to these datasets and is
common to all files.

DCAT defines a license property for the catalog and "Even if the license of
the catalog applies to all of its datasets and distributions, it should be
replicated on each distribution."

In the CSV WG they discussed levels of metadata definitions, with ways of
linking the metadata file to the CSV file. They defined a priority chain
that states that the inner definition has a higher priority level, so, if,
for example, there is a definition of the same type of metadata related to
a package and to a file, the metadata related to the file will be the valid
one.

Will we assume that a metadata that is related to a dataset that has parts
will apply to all of its parts, or, as in the dcat:license for the catalog,
the same metadata will have to be linked to all distributions of all
datasets that are parts of the dataset that groups them all?

Reading the DCAT spec, I feel that the way of grouping datasets is the
catalog. Catalog has parts that are the datasets. But the type catalog is
different from the type dataset. In your example, a dataset that groups
other datasets could be seen as a type of catalog, a hierarchy in the
definiton of the catalog. And, at the same time, be a dataset.

Maybe I am not seeing the things correctly, but I think that here we are
defining a type of dataset grouping that is not addressed in DCAT spec. The
use of dcterms:hasPart and dcterms:isPartOf is interesting. Will the DWBP
WG recommend that?

In CSV WG they have the idea of metadata inheritance. The semantics of
dcterms:hasPart and dcterms:isPartOf says nothing about inheritance. I am
not saying that we will have inheritance (or not), but is a thing that is
common when we have collections, packages, etc. The DWBP WG will have to
made this issue explicit to the users. Will our extension of DCAT address
this issue?

I am not sure that replicating the information of all types of metadata
that are common to a group of datasets is the best solution. Or a thing
that users usually do. I guess that this issue probably was exhaustively
discussed when defining DCAT. Sorry about the repetition.

Best Regards,
Laufer

2014-11-09 9:52 GMT-02:00 Phil Archer <phila@w3.org>:

>
>
> On 08/11/2014 17:06, Laufer wrote:
>
>> I am not against the definition of DCAT. What I am saying is that the
>> dataset to DCAT do not address multiple datasets with different
>> distributions that could be a bundle.
>>
>
> OK I was being a little lazy. The following RDF expands on my original
> example and is DCAT conformant. I'm thinking of some sort of sensor
> readings taken every 6 hours and made available in different formats. Once
> a day all formats are bundled up and available as a single day's readings
> in all original formats as well as a zip file with everything.
>
> <#readings-2014-11-08T00:00> a dcat:Dataset;
>   dcterms:isPartOf <#readings-2014-11-08> ;
>   dcat:distribution <#readings-2014-11-08T00:00.csv> ;
>   dcat:distribution <#readings-2014-11-08T00:00.pdf> ;
>   dcat:distribution <#readings-2014-11-08T00:00.html> ;
>   dcat:distribution <#readings-2014-11-08T00:00.xml> ;
>   dcat:distribution <#readings-2014-11-08T00:00.ttl> ;
>   dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> .
>
> <#readings-2014-11-08T00:00.csv> a dcat:Distribution ;
>   dcat:format "text/csv";
>   dcterms:isPartOf <#readings-2014-11-08.csv> ;
>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>
> <#readings-2014-11-08T00:00.pdf> a dcat:Distribution ;
>   dcat:format "application/pdf" ;
>   dcterms:isPartOf <#readings-2014-11-08.pdf> ;
>   dcterms:isPartOf <#readings-2014-11-08.ziz> .
>
> <#readings-2014-11-08T00:00.html> a dcat:Distribution ;
>   dcat:format "text/html" ;
>   dcterms:isPartOf <#readings-2014-11-08.html> ;
>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>
> <#readings-2014-11-08T00:00.xml> a dcat:Distribution ;
>   dcat:format "application/xml" ;
>   dcterms:isPartOf <#readings-2014-11-08.xml> ;
>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>
> <#readings-2014-11-08T00:00.ttl> a dcat:Distribution ;
>   dcat:format "text/turtle" ;
>   dcterms:isPartOf <#readings-2014-11-08.ttl> ;
>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>
> <#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ;
>   dcat:format "application/json" ;
>   dcterms:isPartOf <#readings-2014-11-08.json> ;
>   dcterms:isPartOf <#readings-2014-11-08.zip> .
>
> <#readings-2014-11-08T06:00> a dcat:Dataset;
>   dcterms:isPartOf <#readings-2014-11-08> ;
>   dcat:distribution <#readings-2014-11-08T08:00.csv> ;
>   dcat:distribution <#readings-2014-11-08T08:00.pdf> ;
>   dcat:distribution <#readings-2014-11-08T08:00.html> ;
>   dcat:distribution <#readings-2014-11-08T08:00.xml> ;
>   dcat:distribution <#readings-2014-11-08T08:00.ttl> ;
>   dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> .
>
> [.. Distribution descriptions]
>
> <#readings-2014-11-08T12:00> a dcat:Dataset;
>   dcterms:isPartOf <#readings-2014-11-08> ;
>   dcat:distribution <#readings-2014-11-08T12:00.csv> ;
>   dcat:distribution <#readings-2014-11-08T1200.pdf> ;
>   dcat:distribution <#readings-2014-11-08T12:00.html> ;
>   dcat:distribution <#readings-2014-11-08T12:00.xml> ;
>   dcat:distribution <#readings-2014-11-08T12:00.ttl> ;
>   dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> .
>
> [.. Distribution descriptions]
>
> <#readings-2014-11-08T18:00> a dcat:Dataset;
>   dcterms:isPartOf <#readings-2014-11-08> ;
>   dcat:distribution <#readings-2014-11-08T18:00.csv> ;
>   dcat:distribution <#readings-2014-11-08T1800.pdf> ;
>   dcat:distribution <#readings-2014-11-08T18:00.html> ;
>   dcat:distribution <#readings-2014-11-08T18:00.xml> ;
>   dcat:distribution <#readings-2014-11-08T18:00.ttl> ;
>   dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> .
>
> [.. Distribution descriptions]
>
> <#readings-2014-11-08> a dcat:Dataset;
>   dcterms:hasPart <#readings-2014-11-08T00:00>;
>   dcterms:hasPart <#readings-2014-11-08T06:00>;
>   dcterms:hasPart <#readings-2014-11-08T12:00>;
>   dcterms:hasPart <#readings-2014-11-08T18:00>;
>   dcat:distribution <#readings-2014-11-08.zip> ;
>   dcat:distribution <#readings-2014-11-08.csv> ;
>   dcat:distribution <#readings-2014-11-08.html> ;
>   dcat:distribution <#readings-2014-11-08.pdf> ;
>   dcat:distribution <#readings-2014-11-08.xml> ;
>   dcat:distribution <#readings-2014-11-08.ttl> ;
>   dcat:distribution <#readingsapi?date=2014-11-08&time=all> .
>
> [.. Distribution descriptions]
>
> Such a set up might also have
> <#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution .
>
> i.e. a Dataset can also be a Distribution, in this case conneg would
> determine which version you got back - and I'm not sure of the best way to
> make this explicit. One could simply make no statement about the format of
> the returned data but I'm not aware of a commonly accepted way of stating
> this explicitly. The HTTP Response header 'Vary' does this job but if we
> want to make it explicit before the request is sent we'd need to do some
> work (and find people who care!).
>
> Of course there's no need for each Dataset to have the same variety of
> Distributions as each other.
>
>
>
>> In your example, Phil, there is only one file, the zip one. And if you
>> have
>> each one of the files with different distributions? If you are sure that
>> this case never will happened, if when you have multiple files they always
>> will be distributed in one single file, maybe the current definition of
>> DCAT could be sufficient.
>>
>> For Ckan and DSPL, dataset is always the set of files.
>>
>> I prefer to restrict the idea of dataset to a collection of resources (in
>> the sense of rdf resources). I do not like the idea of using dataset as a
>> collection of datasets. But we have to discuss and collect examples.
>>
>
> I don't think I'm understanding your concern I'm afraid. dcat:Dataset is a
> very abstract concept and says nothing about the number of files that
> materialise it. The Distributions do that and dcterms:isPartOf/hasPart
> should cover it, I think but, of course, if there are cases where this
> doesn't work then we will indeed need to look at them.
>
>
>> I think that this granularity is important. There would be metadata in
>> each
>> of these levels.
>>
>
> The CSV WG is reusing the idea of a package (with JSON metadata) but
> that's specifically about CSVs.
>
> Does this help?
>
> Phil.
>
>
>
>
>> Em sábado, 8 de novembro de 2014, Phil Archer <phila@w3.org> escreveu:
>>
>>  I'm confident that DCAT supports this already. The DCAT definition does
>>> not say whether the collection of data is in a single file or multiple
>>> files since a dcat:Dataset is an abstract concept that may be accessible
>>> by
>>> a distribution.
>>>
>>> dcterms:hasPart and dcterms:isPartOf are probably useful here, and I'd
>>> want to use those at the Dataset level, not the distribution level,
>>> something like:
>>>
>>> <readings-2014-11-08T00:00> a dcat:Dataset;
>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>
>>> <readings-2014-11-08T06:00> a dcat:Dataset;
>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>
>>> <readings-2014-11-08T12:00> a dcat:Dataset;
>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>
>>> <readings-2014-11-08T18:00> a dcat:Dataset;
>>>    dcterms:isPartOf <readings-2014-11-08> .
>>>
>>>
>>> <readings-2014-11-08> a dcat:Dataset;
>>>    dcterms:hasPart <readings-2014-11-08T00:00>;
>>>    dcterms:hasPart <readings-2014-11-08T06:00>;
>>>    dcterms:hasPart <readings-2014-11-08T12:00>;
>>>    dcterms:hasPart <readings-2014-11-08T18:00>;
>>>    dcat:distribution <readings-2014-11-08.zip> .
>>>
>>> <readings-2014-11-08.zip> a dcat:Distribution;
>>>    dcat:mediaType "application/zip" .
>>>
>>>
>>> The 4 timed readings and the collected readings for the day are all
>>> dcat:Datasets, i.e. they are all "A collection of data, published or
>>> curated by a single agent, and available for access or download in one or
>>> more formats."
>>>
>>> Would that work for you Laufer?
>>>
>>>
>>> On 07/11/2014 23:40, Laufer wrote:
>>>
>>>  I agree with you Phil. But as there are many different definitions of
>>>> this
>>>> term being used, we have to assert the definition that we would accept.
>>>>
>>>> I think that we will also need to use a term to talk about bundles that
>>>> include multiple files, multiple datasets. Maybe container, package...
>>>>
>>>> As I understand, DCAT's definition of dataset does not include a dataset
>>>> as
>>>> a set of files, for example.
>>>>
>>>> Regards,
>>>> Laufer
>>>>
>>>> Em sexta-feira, 7 de novembro de 2014, Phil Archer <phila@w3.org>
>>>> escreveu:
>>>>
>>>>   I tried to word the issue relatively objectively just now in tracker,
>>>>
>>>>> allowing for the possibility of the WG to come up with a definition of
>>>>> 'dataset' other than that in DCAT. More subjectively, I would
>>>>> personally
>>>>> be
>>>>> very opposed to any such redefinition unless there were very strong
>>>>> arguments for doing so.
>>>>>
>>>>> Phil.
>>>>>
>>>>>
>>>>> On 07/11/2014 14:25, Data on the Web Best Practices Working Group Issue
>>>>> Tracker wrote:
>>>>>
>>>>>   ISSUE-80: We need a definition of "dataset"
>>>>>
>>>>>>
>>>>>> http://www.w3.org/2013/dwbp/track/issues/80
>>>>>>
>>>>>> Raised by:
>>>>>> On product:
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>   --
>>>>>>
>>>>>
>>>>>
>>>>> Phil Archer
>>>>> W3C Data Activity Lead
>>>>> http://www.w3.org/2013/data/
>>>>>
>>>>> http://philarcher.org
>>>>> +44 (0)7887 767755
>>>>> @philarcher1
>>>>>
>>>>>
>>>>>
>>>>>
>>>>  --
>>>
>>>
>>> Phil Archer
>>> W3C Data Activity Lead
>>> http://www.w3.org/2013/data/
>>>
>>> http://philarcher.org
>>> +44 (0)7887 767755
>>> @philarcher1
>>>
>>>
>>
>>
> --
>
>
> Phil Archer
> W3C Data Activity Lead
> http://www.w3.org/2013/data/
>
> http://philarcher.org
> +44 (0)7887 767755
> @philarcher1
>



-- 
.  .  .  .. .  .
.        .   . ..
.     ..       .
Received on Sunday, 9 November 2014 14:41:01 UTC