RE: ISSUE-80: We need a definition of "dataset" from Makx Dekkers on 2014-11-14 (public-dwbp-wg@w3.org from November 2014)

From: Makx Dekkers <mail@makxdekkers.com>
Date: Fri, 14 Nov 2014 17:19:21 +0100
To: "'Laufer'" <laufer@globo.com>
Cc: "'Ed Staub'" <estaub2@comcast.net>, "'DWBP WG'" <public-dwbp-wg@w3.org>
Message-ID: <000601d00026$c05f4250$411dc6f0$@makxdekkers.com>
Laufer, 

 

I understand what you’re saying. I just don’t see how we are going to be able to improve the definition of dataset, or even how a different definition will help us in creating best practice.

 

Makx.

 

 

From: Laufer [mailto:laufer@globo.com] 
Sent: Friday, November 14, 2014 4:01 PM
To: Makx Dekkers
Cc: Ed Staub; DWBP WG
Subject: Re: ISSUE-80: We need a definition of "dataset"

 

Makx,

I agree with you that DCAT´s definition is good. The problem I see is if with this definition DCAT could express (map) all other definitions using the current DCAT data model, including the DCAT definition of distribution (we must also define this term). And if our group should care if DCAT could do these mappings. As you also pointed, and I agree, the issue of inheritance  is also very abroad and has different interpretations in different groups, and would be impossible to define the "best" inheritance schema.

 

When, for example, a user uses a CKAN platform to publish data, the DCAT description instance is invisible for her. The CKAN platform will be the responsible for generating a DCAT instance that corresponds to the datasets and distributions published by the user. The same for other publishing/distributions platforms. Could CKAN maps its data model to DCAT´s data model?

I think that this issue is divided in 3 issues:

1 - the DWBP WG definition of dataset;

2 - the DCAT definition of dataset;

3 - the mapping of other data models to DCAT´s data model.

I agree that to our WG the better would be to not enter in this discussion and assume DCAT´s definition and not care about other issues. But I don't know if we can leave this thing without stating in our documents all this issues of the data on the web ecosystem. The fact, for me, is that in this ecosystem we have different definitions of dataset with different implementations related to these definitions.

 

I think that our suggestions/recommendations of best practices should influence the publishing/distribution platforms, in a way that, in some sense, could create a common definition of dataset/distribution, maybe the DCAT one, or an extended version.

Best Regards,

Laufer

 

2014-11-14 11:18 GMT-02:00 Makx Dekkers <mail@makxdekkers.com <mailto:mail@makxdekkers.com> >:

Ed,

In my mind, there is nothing that would prevent people to use DCAT for a
collection of unrelated data, and I don't think we want to tell them
they can't. Also, it would depend on someone's perspective on what
constitutes 'related'.

Again, my position is that the definition of dataset in DCAT is good
enough, and that we should not spend time in trying to make it better.
(http://www.brainyquote.com/quotes/quotes/v/voltaire109643.html)

Makx.




> -----Original Message-----
> From: Ed Staub [mailto:ed.staub@semanterra.org <mailto:ed.staub@semanterra.org> ] On Behalf Of Ed Staub
> Sent: Thursday, November 13, 2014 5:11 AM
> To: public-dwbp-wg@w3.org <mailto:public-dwbp-wg@w3.org> 
> Subject: Re: ISSUE-80: We need a definition of "dataset"
>
> Note that the RDF Data Cube vocabulary has a different definition of
> "dataset" than DCAT:
>
> "Represents a collection of observations, possibly organized into
> various
> slices, conforming to some common dimensional structure."
>
> Assuming the DCAT definition is used, I think it useful to make clear
> that a
> "common dimensional structure" is not implied.  FWIW, my prior
> experience
> led me to assume the "common dimensional structure" meaning for DCAT
> until I
> dug into the DCAT spec.
>
>
> On the "too-broad" side, there probably are collections of data
> published or
> curated by a single agent that are larger than is intended by this
> definition.  In particular, I agree with Bernadette Lóscio in thinking
> that
> the collection's content should be related - not "a random assortment
> of
> data".  As an extreme example, imagine the entire content of
> datahub.io <http://datahub.io> 
> described as a single dataset!
>
>
> So... I'd suggest adding the word "related":
>
> "A related collection of data, published or curated by a single agent,
>    ^^^^^^^
> and available for access or download in one or more formats."
>
> The addition of "related" deals with both concerns at once; it would
> be
> strange and tautological to require all the data in a single cube to
> be
> "related".
>
>
> -Ed Staub
>
>







-- 

.  .  .  .. .  . 
.        .   . ..
.     ..       .
Received on Friday, 14 November 2014 16:19:53 UTC