- From: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
- Date: Tue, 11 Nov 2014 11:33:33 -0200
- To: Steven Adler <adler1@us.ibm.com>
- Cc: Laufer <laufer@globo.com>, Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org>
- Message-ID: <CANx1Pzx-ruAUpvuB65=bDZjzHE363132hV11RPH6=ReaOPHFXg@mail.gmail.com>
+1 to Makx! But I think we should have a better understand about how to specify a dataset using the DACT definition. We should also have examples that illustrate datasets definitions and their corresponding distributions. Phil sent an example of a dataset definition and I sent another version for the same example. If possible, it could be nice to discuss these ideas also. Thanks! Bernadette 2014-11-11 11:26 GMT-02:00 Steven Adler <adler1@us.ibm.com>: > +1. Well said. > > > Best Regards, > > Steve > > Motto: "Do First, Think, Do it Again" > > [image: Inactive hide details for Bernadette Farias Lóscio ---11/10/2014 > 04:21:41 PM---Hi all, I like the idea of using the dataset def]Bernadette > Farias Lóscio ---11/10/2014 04:21:41 PM---Hi all, I like the idea of using > the dataset definition from DCAT and I fully agree > > > > From: > > > Bernadette Farias Lóscio <bfl@cin.ufpe.br> > > To: > > > Laufer <laufer@globo.com> > > Cc: > > > Data on the Web Best Practices Working Group <public-dwbp-wg@w3.org> > > Date: > > > 11/10/2014 04:21 PM > > Subject: > > > Re: ISSUE-80: We need a definition of "dataset" > ------------------------------ > > > > Hi all, > > I like the idea of using the dataset definition from DCAT and I fully > agree with Makx that "we should not try to redefine what's already > well-defined". > > I like the examples of Phil, but I think that we still need to clarify the > meaning of a dataset. To do this, I'd like to make a comparison between > DCAT concepts and database concepts. > > In the database world, a database is defined as "a collection of related > data. By data, we mean known facts that can be recorded and that have > implicit meaning ". Moreover, "a database is a logically coherent > collection of data with some inherent meaning. A random assortment of data > cannot correctly be referred to as a database." [1] > > The relational model represents the database as a collection of relations > (or tables). Informally, each relation resembles a table of values or, to > some extent, a flat file of records. For example, to construct the > database that describes a university, we store data to represent each > student, course, section, grade report, and prerequisite as a record in the > appropriate table. > > In DCAT, a dataset is a collection of data, published or curated by a > single agent, and available for access or download in one or more formats. > A data catalog is a curated collection of metadata about datasets. In DCAT > there is no notion of related data and collection of data with some > inherent meaning. > > In my opinion, if we make a comparison between DCAT and databases, a > *dataset* is similar to a database and the data organization of a given > dataset will depend from the data model used to represent the data. > > Considering that a given dataset may have multiple distributions, then the > organization of files will depend from the data model of each distribution > (ex: csv, xml, rdf, json). For example, a csv distribution may have > multiple tables and an xml distribution may have one or more xml documents. > When a given distribution has more than one file (ex: multiple csv files), > I agree with Phil that we can use dcterms:isPartOf to associate multiple > files to a given distribution. > > In other words, I think that DCAT's definition is more abstract and > doesn't concern the organization of the data in different files. This > should be defined by the data model used in each distribution. > > Concerning the metadata, I think that there will be metadata related to > the dataset and metadata related to each available distribution, as > proposed by DCAT. > > Considering this context, then in the example of Phil, instead of having > several datasets, there will be one dataset and 06 distributions. Each > distribution will be composed by several files. The composition of a > distribution will depend on the data model (ex: csv tables, xml > documents..). In the following, I present a new proposal for the dataset > definition and its corresponding csv distribution (other distributions are > similar). > > <#sensor-readings> a dcat:Dataset; > dcat:distribution <#sendor-readings.csv> ; > dcat:distribution <#sendor-readings.pdf> ; > dcat:distribution <#sendor-readings.html> ; > dcat:distribution <#sendor-readings.xml> ; > dcat:distribution <# sendor-readings.ttl>; > > dcat:distribution <#readingsapi?date=2014-11-08&time=all> . > > <#sendor-readings.csv> a dcat:Distribution ; > dcterms:isPartOf <#sensor-readings> ; > dcterms:hasPart <#readings-2014-11-08T00:00.csv>; > dcterms:hasPart <#readings-2014-11-08T06:00.csv>; > dcterms:hasPart <#readings-2014-11-08T12:00.csv>; > > dcterms:hasPart <#readings-2014-11-08T18:00.csv>; > > dcterms:hasPart <#readings-2014-11-08.zip> ; > dcterms:hasPart <#readings-2014-11-08.csv>; > > dcterms:hasPart <#readings-2014-11-09T00:00.csv>; > > dcterms:hasPart <#readings-2014-11-09T06:00.csv>; > > dcterms:hasPart <#readings-2014-11-09T12:00.csv>; > > dcterms:hasPart <#readings-2014-11-09T18:00.csv>; > > dcterms:hasPart <#readings-2014-11-09.zip> ; > dcterms:hasPart <#readings-2014-11-09.csv>. > > <#readings-2014-11-08T00:00.csv> a csv:table > dcat:format "text/csv"; > > <#readings-2014-11-08T00:06.csv> a csv:table > dcat:format "text/csv"; > > <#readings-2014-11-08T00:18.csv> a csv:table > dcat:format "text/csv"; > > [...Table descriptions] > > [...Distribution descriptions] > > > In this case, the publication of new data, i.e, a new sensor reading, > implies the insertion of a new file in each one of the distributions. It is > important to note that data is distributed in two different levels of > granularity (ex: per day and every six hours). > > I'm not sure of how to define the type of a specific file. For example, I > used "csv:table" to define the type of <#readings-2014-11-08T00:00.csv>. > However, I think that the csv model doesn't define this. > > @Phil, please let me know if this definition makes sense to you and if it > is DCAT conformant. > > I think we can use the dataset definition from DCAT, however it is > important to have a better understanding of how to specify datasets and its > corresponding distributions. > > I'm sorry for the long message :) > > kind regards, > Bernadette > > [1] Ramez Elmasri, Shamkant B. Navathe, "Fundamentals of Database System", > 6th Edition. ISBN-13: 978-0136086208 > > 2014-11-09 11:40 GMT-03:00 Laufer <*laufer@globo.com* <laufer@globo.com>>: > > Ok, Phil. > > Let´s continue with the example. > > Suppose that there is a metadata type, for example, license (a > metadata type that exists in DCAT spec), that applies to these datasets and > is common to all files. > > DCAT defines a license property for the catalog and "Even if the > license of the catalog applies to all of its datasets and distributions, it > should be replicated on each distribution." > > In the CSV WG they discussed levels of metadata definitions, with ways > of linking the metadata file to the CSV file. They defined a priority chain > that states that the inner definition has a higher priority level, so, if, > for example, there is a definition of the same type of metadata related to > a package and to a file, the metadata related to the file will be the valid > one. > > Will we assume that a metadata that is related to a dataset that has > parts will apply to all of its parts, or, as in the dcat:license for the > catalog, the same metadata will have to be linked to all distributions of > all datasets that are parts of the dataset that groups them all? > > Reading the DCAT spec, I feel that the way of grouping datasets is the > catalog. Catalog has parts that are the datasets. But the type catalog is > different from the type dataset. In your example, a dataset that groups > other datasets could be seen as a type of catalog, a hierarchy in the > definiton of the catalog. And, at the same time, be a dataset. > > Maybe I am not seeing the things correctly, but I think that here we > are defining a type of dataset grouping that is not addressed in DCAT spec. > The use of dcterms:hasPart and dcterms:isPartOf is interesting. Will the > DWBP WG recommend that? > > In CSV WG they have the idea of metadata inheritance. The semantics of > dcterms:hasPart and dcterms:isPartOf says nothing about inheritance. I am > not saying that we will have inheritance (or not), but is a thing that is > common when we have collections, packages, etc. The DWBP WG will have to > made this issue explicit to the users. Will our extension of DCAT address > this issue? > > I am not sure that replicating the information of all types of > metadata that are common to a group of datasets is the best solution. Or a > thing that users usually do. I guess that this issue probably was > exhaustively discussed when defining DCAT. Sorry about the repetition. > > Best Regards, > Laufer > > 2014-11-09 9:52 GMT-02:00 Phil Archer <*phila@w3.org* <phila@w3.org>>: > > > On 08/11/2014 17:06, Laufer wrote: > I am not against the definition of DCAT. What I am saying is > that the > dataset to DCAT do not address multiple datasets with different > distributions that could be a bundle. > > OK I was being a little lazy. The following RDF expands on my > original example and is DCAT conformant. I'm thinking of some sort of > sensor readings taken every 6 hours and made available in different > formats. Once a day all formats are bundled up and available as a single > day's readings in all original formats as well as a zip file with > everything. > > <#readings-2014-11-08T00:00> a dcat:Dataset; > dcterms:isPartOf <#readings-2014-11-08> ; > dcat:distribution <#readings-2014-11-08T00:00.csv> ; > dcat:distribution <#readings-2014-11-08T00:00.pdf> ; > dcat:distribution <#readings-2014-11-08T00:00.html> ; > dcat:distribution <#readings-2014-11-08T00:00.xml> ; > dcat:distribution <#readings-2014-11-08T00:00.ttl> ; > dcat:distribution <#readingsapi?date=2014-11-08&time=00:00> . > > <#readings-2014-11-08T00:00.csv> a dcat:Distribution ; > dcat:format "text/csv"; > dcterms:isPartOf <#readings-2014-11-08.csv> ; > dcterms:isPartOf <#readings-2014-11-08.zip> . > > <#readings-2014-11-08T00:00.pdf> a dcat:Distribution ; > dcat:format "application/pdf" ; > dcterms:isPartOf <#readings-2014-11-08.pdf> ; > dcterms:isPartOf <#readings-2014-11-08.ziz> . > > <#readings-2014-11-08T00:00.html> a dcat:Distribution ; > dcat:format "text/html" ; > dcterms:isPartOf <#readings-2014-11-08.html> ; > dcterms:isPartOf <#readings-2014-11-08.zip> . > > <#readings-2014-11-08T00:00.xml> a dcat:Distribution ; > dcat:format "application/xml" ; > dcterms:isPartOf <#readings-2014-11-08.xml> ; > dcterms:isPartOf <#readings-2014-11-08.zip> . > > <#readings-2014-11-08T00:00.ttl> a dcat:Distribution ; > dcat:format "text/turtle" ; > dcterms:isPartOf <#readings-2014-11-08.ttl> ; > dcterms:isPartOf <#readings-2014-11-08.zip> . > > <#readingsapi?date=2014-11-08&time=00:00> a dcat:Distribution ; > dcat:format "application/json" ; > dcterms:isPartOf <#readings-2014-11-08.json> ; > dcterms:isPartOf <#readings-2014-11-08.zip> . > > <#readings-2014-11-08T06:00> a dcat:Dataset; > dcterms:isPartOf <#readings-2014-11-08> ; > dcat:distribution <#readings-2014-11-08T08:00.csv> ; > dcat:distribution <#readings-2014-11-08T08:00.pdf> ; > dcat:distribution <#readings-2014-11-08T08:00.html> ; > dcat:distribution <#readings-2014-11-08T08:00.xml> ; > dcat:distribution <#readings-2014-11-08T08:00.ttl> ; > dcat:distribution <#readingsapi?date=2014-11-08&time=06:00> . > > [.. Distribution descriptions] > > <#readings-2014-11-08T12:00> a dcat:Dataset; > dcterms:isPartOf <#readings-2014-11-08> ; > dcat:distribution <#readings-2014-11-08T12:00.csv> ; > dcat:distribution <#readings-2014-11-08T1200.pdf> ; > dcat:distribution <#readings-2014-11-08T12:00.html> ; > dcat:distribution <#readings-2014-11-08T12:00.xml> ; > dcat:distribution <#readings-2014-11-08T12:00.ttl> ; > dcat:distribution <#readingsapi?date=2014-11-08&time=12:00> . > > [.. Distribution descriptions] > > <#readings-2014-11-08T18:00> a dcat:Dataset; > dcterms:isPartOf <#readings-2014-11-08> ; > dcat:distribution <#readings-2014-11-08T18:00.csv> ; > dcat:distribution <#readings-2014-11-08T1800.pdf> ; > dcat:distribution <#readings-2014-11-08T18:00.html> ; > dcat:distribution <#readings-2014-11-08T18:00.xml> ; > dcat:distribution <#readings-2014-11-08T18:00.ttl> ; > dcat:distribution <#readingsapi?date=2014-11-08&time=18:00> . > > [.. Distribution descriptions] > > <#readings-2014-11-08> a dcat:Dataset; > dcterms:hasPart <#readings-2014-11-08T00:00>; > dcterms:hasPart <#readings-2014-11-08T06:00>; > dcterms:hasPart <#readings-2014-11-08T12:00>; > dcterms:hasPart <#readings-2014-11-08T18:00>; > dcat:distribution <#readings-2014-11-08.zip> ; > dcat:distribution <#readings-2014-11-08.csv> ; > dcat:distribution <#readings-2014-11-08.html> ; > dcat:distribution <#readings-2014-11-08.pdf> ; > dcat:distribution <#readings-2014-11-08.xml> ; > dcat:distribution <#readings-2014-11-08.ttl> ; > dcat:distribution <#readingsapi?date=2014-11-08&time=all> . > > [.. Distribution descriptions] > > Such a set up might also have > <#readings-2014-11-08T00:00> a dcat:Dataset, dcat:Distribution . > > i.e. a Dataset can also be a Distribution, in this case conneg > would determine which version you got back - and I'm not sure of the best > way to make this explicit. One could simply make no statement about the > format of the returned data but I'm not aware of a commonly accepted way of > stating this explicitly. The HTTP Response header 'Vary' does this job but > if we want to make it explicit before the request is sent we'd need to do > some work (and find people who care!). > > Of course there's no need for each Dataset to have the same variety > of Distributions as each other. > > > In your example, Phil, there is only one file, the zip one. And > if you have > each one of the files with different distributions? If you are > sure that > this case never will happened, if when you have multiple files > they always > will be distributed in one single file, maybe the current > definition of > DCAT could be sufficient. > > For Ckan and DSPL, dataset is always the set of files. > > I prefer to restrict the idea of dataset to a collection of > resources (in > the sense of rdf resources). I do not like the idea of using > dataset as a > collection of datasets. But we have to discuss and collect > examples. > > I don't think I'm understanding your concern I'm afraid. > dcat:Dataset is a very abstract concept and says nothing about the number > of files that materialise it. The Distributions do that and > dcterms:isPartOf/hasPart should cover it, I think but, of course, if there > are cases where this doesn't work then we will indeed need to look at them. > > I think that this granularity is important. There would be > metadata in each > of these levels. > > The CSV WG is reusing the idea of a package (with JSON metadata) > but that's specifically about CSVs. > > Does this help? > > Phil. > > > > Em sábado, 8 de novembro de 2014, Phil Archer <*phila@w3.org* > <phila@w3.org>> escreveu: > I'm confident that DCAT supports this already. The DCAT > definition does > not say whether the collection of data is in a single file or > multiple > files since a dcat:Dataset is an abstract concept that may be > accessible by > a distribution. > > dcterms:hasPart and dcterms:isPartOf are probably useful > here, and I'd > want to use those at the Dataset level, not the distribution > level, > something like: > > <readings-2014-11-08T00:00> a dcat:Dataset; > dcterms:isPartOf <readings-2014-11-08> . > > <readings-2014-11-08T06:00> a dcat:Dataset; > dcterms:isPartOf <readings-2014-11-08> . > > <readings-2014-11-08T12:00> a dcat:Dataset; > dcterms:isPartOf <readings-2014-11-08> . > > <readings-2014-11-08T18:00> a dcat:Dataset; > dcterms:isPartOf <readings-2014-11-08> . > > > <readings-2014-11-08> a dcat:Dataset; > dcterms:hasPart <readings-2014-11-08T00:00>; > dcterms:hasPart <readings-2014-11-08T06:00>; > dcterms:hasPart <readings-2014-11-08T12:00>; > dcterms:hasPart <readings-2014-11-08T18:00>; > dcat:distribution <readings-2014-11-08.zip> . > > <readings-2014-11-08.zip> a dcat:Distribution; > dcat:mediaType "application/zip" . > > > The 4 timed readings and the collected readings for the day > are all > dcat:Datasets, i.e. they are all "A collection of data, > published or > curated by a single agent, and available for access or > download in one or > more formats." > > Would that work for you Laufer? > > > On 07/11/2014 23:40, Laufer wrote: > I agree with you Phil. But as there are many different > definitions of this > term being used, we have to assert the definition that we > would accept. > > I think that we will also need to use a term to talk about > bundles that > include multiple files, multiple datasets. Maybe > container, package... > > As I understand, DCAT's definition of dataset does not > include a dataset > as > a set of files, for example. > > Regards, > Laufer > > Em sexta-feira, 7 de novembro de 2014, Phil Archer < > *phila@w3.org* <phila@w3.org>> > escreveu: > > I tried to word the issue relatively objectively just > now in tracker, > allowing for the possibility of the WG to come up with > a definition of > 'dataset' other than that in DCAT. More subjectively, I > would personally > be > very opposed to any such redefinition unless there were > very strong > arguments for doing so. > > Phil. > > > On 07/11/2014 14:25, Data on the Web Best Practices > Working Group Issue > Tracker wrote: > > ISSUE-80: We need a definition of "dataset" > > *http://www.w3.org/2013/dwbp/track/issues/80* > <http://www.w3.org/2013/dwbp/track/issues/80> > > Raised by: > On product: > > > > > > > > -- > > > Phil Archer > W3C Data Activity Lead > *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/> > > *http://philarcher.org* <http://philarcher.org/> > *+44 (0)7887 767755* <%2B44%20%280%297887%20767755> > @philarcher1 > > > -- > > > Phil Archer > W3C Data Activity Lead > *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/> > > *http://philarcher.org* <http://philarcher.org/> > *+44 (0)7887 767755* <%2B44%20%280%297887%20767755> > @philarcher1 > > > -- > > > Phil Archer > W3C Data Activity Lead > *http://www.w3.org/2013/data/* <http://www.w3.org/2013/data/> > > *http://philarcher.org* <http://philarcher.org/> > *+44 (0)7887 767755* <%2B44%20%280%297887%20767755> > @philarcher1 > > > > -- > . . . .. . . > . . . .. > . .. . > > > > > -- > Bernadette Farias Lóscio > Centro de Informática > Universidade Federal de Pernambuco - UFPE, Brazil > > ---------------------------------------------------------------------------- > > -- Bernadette Farias Lóscio Centro de Informática Universidade Federal de Pernambuco - UFPE, Brazil ----------------------------------------------------------------------------
Attachments
- image/gif attachment: graycol.gif
- image/gif attachment: ecblank.gif
Received on Tuesday, 11 November 2014 13:34:25 UTC