Re: ISSUE-80: We need a definition of "dataset" from Phil Archer on 2014-11-09 (public-dwbp-wg@w3.org from November 2014)

From: Phil Archer <phila@w3.org>
Date: Sun, 09 Nov 2014 11:52:20 +0000
To: Laufer <laufer@globo.com>
CC: Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
Message-ID: <545F5574.2040707@w3.org>
On 08/11/2014 17:06, Laufer wrote:
> I am not against the definition of DCAT. What I am saying is that the
> dataset to DCAT do not address multiple datasets with different
> distributions that could be a bundle.

OK I was being a little lazy. The following RDF expands on my original 
example and is DCAT conformant. I'm thinking of some sort of sensor 
readings taken every 6 hours and made available in different formats. 
Once a day all formats are bundled up and available as a single day's 
readings in all original formats as well as a zip file with everything.

<#readings-2014-11-08T00:00> a dcat:Dataset;
   dcterms:isPartOf <#readings-2014-11-08> ;
   dcat:distribution <#readings-2014-11-08T00:00.csv> ;
   dcat:distribution <#readings-2014-11-08T00:00.pdf> ;
   dcat:distribution <#readings-2014-11-08T00:00.html> ;
   dcat:distribution <#readings-2014-11-08T00:00.xml> ;
   dcat:distribution <#readings-2014-11-08T00:00.ttl> ;
   dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> .

<#readings-2014-11-08T00:00.csv> a dcat:Distribution ;
   dcat:format "text/csv";
   dcterms:isPartOf <#readings-2014-11-08.csv> ;
   dcterms:isPartOf <#readings-2014-11-08.zip> .

<#readings-2014-11-08T00:00.pdf> a dcat:Distribution ;
   dcat:format "application/pdf" ;
   dcterms:isPartOf <#readings-2014-11-08.pdf> ;
   dcterms:isPartOf <#readings-2014-11-08.ziz> .

<#readings-2014-11-08T00:00.html> a dcat:Distribution ;
   dcat:format "text/html" ;
   dcterms:isPartOf <#readings-2014-11-08.html> ;
   dcterms:isPartOf <#readings-2014-11-08.zip> .

<#readings-2014-11-08T00:00.xml> a dcat:Distribution ;
   dcat:format "application/xml" ;
   dcterms:isPartOf <#readings-2014-11-08.xml> ;
   dcterms:isPartOf <#readings-2014-11-08.zip> .

<#readings-2014-11-08T00:00.ttl> a dcat:Distribution ;
   dcat:format "text/turtle" ;
   dcterms:isPartOf <#readings-2014-11-08.ttl> ;
   dcterms:isPartOf <#readings-2014-11-08.zip> .

<#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ;
   dcat:format "application/json" ;
   dcterms:isPartOf <#readings-2014-11-08.json> ;
   dcterms:isPartOf <#readings-2014-11-08.zip> .

<#readings-2014-11-08T06:00> a dcat:Dataset;
   dcterms:isPartOf <#readings-2014-11-08> ;
   dcat:distribution <#readings-2014-11-08T08:00.csv> ;
   dcat:distribution <#readings-2014-11-08T08:00.pdf> ;
   dcat:distribution <#readings-2014-11-08T08:00.html> ;
   dcat:distribution <#readings-2014-11-08T08:00.xml> ;
   dcat:distribution <#readings-2014-11-08T08:00.ttl> ;
   dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> .

[.. Distribution descriptions]

<#readings-2014-11-08T12:00> a dcat:Dataset;
   dcterms:isPartOf <#readings-2014-11-08> ;
   dcat:distribution <#readings-2014-11-08T12:00.csv> ;
   dcat:distribution <#readings-2014-11-08T1200.pdf> ;
   dcat:distribution <#readings-2014-11-08T12:00.html> ;
   dcat:distribution <#readings-2014-11-08T12:00.xml> ;
   dcat:distribution <#readings-2014-11-08T12:00.ttl> ;
   dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> .

[.. Distribution descriptions]

<#readings-2014-11-08T18:00> a dcat:Dataset;
   dcterms:isPartOf <#readings-2014-11-08> ;
   dcat:distribution <#readings-2014-11-08T18:00.csv> ;
   dcat:distribution <#readings-2014-11-08T1800.pdf> ;
   dcat:distribution <#readings-2014-11-08T18:00.html> ;
   dcat:distribution <#readings-2014-11-08T18:00.xml> ;
   dcat:distribution <#readings-2014-11-08T18:00.ttl> ;
   dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> .

[.. Distribution descriptions]

<#readings-2014-11-08> a dcat:Dataset;
   dcterms:hasPart <#readings-2014-11-08T00:00>;
   dcterms:hasPart <#readings-2014-11-08T06:00>;
   dcterms:hasPart <#readings-2014-11-08T12:00>;
   dcterms:hasPart <#readings-2014-11-08T18:00>;
   dcat:distribution <#readings-2014-11-08.zip> ;
   dcat:distribution <#readings-2014-11-08.csv> ;
   dcat:distribution <#readings-2014-11-08.html> ;
   dcat:distribution <#readings-2014-11-08.pdf> ;
   dcat:distribution <#readings-2014-11-08.xml> ;
   dcat:distribution <#readings-2014-11-08.ttl> ;
   dcat:distribution <#readingsapi?date=2014-11-08&time=all> .

[.. Distribution descriptions]

Such a set up might also have
<#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution .

i.e. a Dataset can also be a Distribution, in this case conneg would 
determine which version you got back - and I'm not sure of the best way 
to make this explicit. One could simply make no statement about the 
format of the returned data but I'm not aware of a commonly accepted way 
of stating this explicitly. The HTTP Response header 'Vary' does this 
job but if we want to make it explicit before the request is sent we'd 
need to do some work (and find people who care!).

Of course there's no need for each Dataset to have the same variety of 
Distributions as each other.


>
> In your example, Phil, there is only one file, the zip one. And if you have
> each one of the files with different distributions? If you are sure that
> this case never will happened, if when you have multiple files they always
> will be distributed in one single file, maybe the current definition of
> DCAT could be sufficient.
>
> For Ckan and DSPL, dataset is always the set of files.
>
> I prefer to restrict the idea of dataset to a collection of resources (in
> the sense of rdf resources). I do not like the idea of using dataset as a
> collection of datasets. But we have to discuss and collect examples.

I don't think I'm understanding your concern I'm afraid. dcat:Dataset is 
a very abstract concept and says nothing about the number of files that 
materialise it. The Distributions do that and dcterms:isPartOf/hasPart 
should cover it, I think but, of course, if there are cases where this 
doesn't work then we will indeed need to look at them.

>
> I think that this granularity is important. There would be metadata in each
> of these levels.

The CSV WG is reusing the idea of a package (with JSON metadata) but 
that's specifically about CSVs.

Does this help?

Phil.


>
> Em sábado, 8 de novembro de 2014, Phil Archer <phila@w3.org> escreveu:
>
>> I'm confident that DCAT supports this already. The DCAT definition does
>> not say whether the collection of data is in a single file or multiple
>> files since a dcat:Dataset is an abstract concept that may be accessible by
>> a distribution.
>>
>> dcterms:hasPart and dcterms:isPartOf are probably useful here, and I'd
>> want to use those at the Dataset level, not the distribution level,
>> something like:
>>
>> <readings-2014-11-08T00:00> a dcat:Dataset;
>>    dcterms:isPartOf <readings-2014-11-08> .
>>
>> <readings-2014-11-08T06:00> a dcat:Dataset;
>>    dcterms:isPartOf <readings-2014-11-08> .
>>
>> <readings-2014-11-08T12:00> a dcat:Dataset;
>>    dcterms:isPartOf <readings-2014-11-08> .
>>
>> <readings-2014-11-08T18:00> a dcat:Dataset;
>>    dcterms:isPartOf <readings-2014-11-08> .
>>
>>
>> <readings-2014-11-08> a dcat:Dataset;
>>    dcterms:hasPart <readings-2014-11-08T00:00>;
>>    dcterms:hasPart <readings-2014-11-08T06:00>;
>>    dcterms:hasPart <readings-2014-11-08T12:00>;
>>    dcterms:hasPart <readings-2014-11-08T18:00>;
>>    dcat:distribution <readings-2014-11-08.zip> .
>>
>> <readings-2014-11-08.zip> a dcat:Distribution;
>>    dcat:mediaType "application/zip" .
>>
>>
>> The 4 timed readings and the collected readings for the day are all
>> dcat:Datasets, i.e. they are all "A collection of data, published or
>> curated by a single agent, and available for access or download in one or
>> more formats."
>>
>> Would that work for you Laufer?
>>
>>
>> On 07/11/2014 23:40, Laufer wrote:
>>
>>> I agree with you Phil. But as there are many different definitions of this
>>> term being used, we have to assert the definition that we would accept.
>>>
>>> I think that we will also need to use a term to talk about bundles that
>>> include multiple files, multiple datasets. Maybe container, package...
>>>
>>> As I understand, DCAT's definition of dataset does not include a dataset
>>> as
>>> a set of files, for example.
>>>
>>> Regards,
>>> Laufer
>>>
>>> Em sexta-feira, 7 de novembro de 2014, Phil Archer <phila@w3.org>
>>> escreveu:
>>>
>>>   I tried to word the issue relatively objectively just now in tracker,
>>>> allowing for the possibility of the WG to come up with a definition of
>>>> 'dataset' other than that in DCAT. More subjectively, I would personally
>>>> be
>>>> very opposed to any such redefinition unless there were very strong
>>>> arguments for doing so.
>>>>
>>>> Phil.
>>>>
>>>>
>>>> On 07/11/2014 14:25, Data on the Web Best Practices Working Group Issue
>>>> Tracker wrote:
>>>>
>>>>   ISSUE-80: We need a definition of "dataset"
>>>>>
>>>>> http://www.w3.org/2013/dwbp/track/issues/80
>>>>>
>>>>> Raised by:
>>>>> On product:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>   --
>>>>
>>>>
>>>> Phil Archer
>>>> W3C Data Activity Lead
>>>> http://www.w3.org/2013/data/
>>>>
>>>> http://philarcher.org
>>>> +44 (0)7887 767755
>>>> @philarcher1
>>>>
>>>>
>>>>
>>>
>> --
>>
>>
>> Phil Archer
>> W3C Data Activity Lead
>> http://www.w3.org/2013/data/
>>
>> http://philarcher.org
>> +44 (0)7887 767755
>> @philarcher1
>>
>
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Sunday, 9 November 2014 11:52:22 UTC