Re: DCAT Last Call issues: definition of Dataset -- examples from John Erickson on 2013-03-25 (public-gld-wg@w3.org from March 2013)

From: John Erickson <olyerickson@gmail.com>
Date: Tue, 26 Mar 2013 00:19:28 +0100
To: Richard Cyganiak <richard@cyganiak.de>
Cc: Makx Dekkers <makx@makxdekkers.com>, Public GLD WG <public-gld-wg@w3.org>
Message-ID: <CAC1Gg8RmGvhVF7mbgduhiZf22qPOXv2dLDAZ4mRv9bLkAcyAVA@mail.gmail.com>

I'm with Richard on this one; I think DCAT needs to be viewed in a
similar light as Dublin Core, which clarifies much ambiguity but does
not solve all problems for all use cases. And DCAT is definitely NOT
just about linked data; indeed while the publication of data will
increasing follow the linked data model, DCAT is useful for a wide
variety of data that does not. No minimum number of TBL stars...

Being fresh off the Research Data Alliance kick-off in Göteborg last
week, I can tell you that even the scientific data community is all
over the map regarding what a "dataset" is. For one introduction to
this, see Allen Renear's oft-cited "Definitions of Dataset in the
Scientific and Technical Literature" <http://bit.ly/ZQ0SEh>. They
don't agree on precisely what they are, but they DO agree they need to
be persistently and unambiguously identified, and mostly agree that
they should be unambiguously typed. More about THAT later ;)

Frankly I think CKAN's working definition of "dataset" is useful here
--- an aggregation of arbitrary data resources --- although I would
probably prefer that level of abstraction to be an "object" or "data
object" (following the Kahn/Wilensky notion of the digital object).

On Mon, Mar 25, 2013 at 6:51 PM, Richard Cyganiak <richard@cyganiak.de> wrote:
> On 25 Mar 2013, at 16:03, "Makx Dekkers" <makx@makxdekkers.com> wrote:
>> DCAT defines Dataset as "A collection of data, published or curated by a single source, and available for access or download in one or more formats".
>>
>> This definition does not give a clear indication of characteristics that distinguish a Dataset from a more general rdfs:Resource. Would it be possible to at least provide some examples of existing resources that fall within this definition, and (even more importantly) some examples that do not?
>>
>> In a conversation on the public mailing list (http://lists.w3.org/Archives/Public/public-gld-wg/2012Sep/0062.html), it was mentioned that “Any file stored on disk is a data set”.  This implies that any machine-readable information (including PDF files!) can be considered a dcat:Dataset. That doesn’t sound right to me.
>
> We've had that discussion many times. The best definition of “dataset” I've heard is still: “A set of data.”
>
> I don't see why a PDF file containing a big table shouldn't be considered a dataset. That's not the most useful form for re-use, of course.
>
> You seem to be suggesting that datasets must have some minimum number of TimBL stars [1] in order to be described with DCAT. I don't think such a restriction helps anybody.
>
> Best,
> Richard
>
> [1] http://5stardata.info

-- 
John S. Erickson, Ph.D.
Director, Web Science Operations
Tetherless World Constellation (RPI)
<http://tw.rpi.edu> <olyerickson@gmail.com>
Twitter & Skype: olyerickson

Received on Monday, 25 March 2013 23:20:00 UTC