- From: Phil Archer <phila@w3.org>
- Date: Wed, 3 Feb 2016 09:55:14 +0000
- To: Maik Riechert <m.riechert@reading.ac.uk>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>
- Cc: public-sdw-comments@w3.org
As an aside, I'm trying to gather evidence for sufficient support to set up a WG specifically to update and extend DCAT. Lots of obstacles in the road, mostly around the small size of the intersection of the community that's most interested and the W3C Membership, but I'm trying... On 03/02/2016 09:43, Maik Riechert wrote: > Hi Andrea, > > Interesting points, exactly what I wanted to talk about. > > I tried to have spatial partitioning in scenario 3 for example, but I > just see that github swallowed some of my <varname> notations, I'll fix > that to make it clearer. But in the end it's similar to temporal > partitioning, just a different dimension. > > Now, about modeling subsets as distributions. Currently, this is a no-go > for me since you cannot indicate the subset extent (temporal, spatial, > ..) inside a distribution element. Basically the same opinion as comment > #2 in your linked page. > > Having said that, I agree that it would be way more convenient to manage > as distributions, on many levels. But, then someone has to step up and > define that I can use dct:temporal and dct:spatial (as defined in > GeoDCAT-AP) within a Distribution. Of course you can do that already, > but no one cares and it will be ignored because it is not recommended > anywhere. > > That still doesn't solve partitioning across other dimension types, but > it is a start and should work for the majority of datasets. > > I guess you could use actual sub datasets then for cases where each sub > dataset is a stand alone product on its own that can be used without > knowing about the siblings. The parent would then just group them > together for convenience/discoverability. But it's a fuzzy line and I > currently don't know when I would recommend subset-as-distribution vs. > subset-as-dataset if both are allowed/recommended. > > Cheers > Maik > > > Am 03.02.2016 um 09:05 schrieb Andrea Perego: >> Many thanks for sharing this work, Maik! >> >> Just a couple of notes from my side: >> >> 1. Besides temporal coverage, it may be worth adding in your scenarios >> also spatial coverage as another criterion of dataset partitioning. >> Actually, both criteria are frequently used concurrently. >> >> 2. In many of the scenarios you describe, dataset subsets are modelled >> as datasets. An alternative would be to model them just as >> distributions. So, I wonder whether those scenarios have requirements >> that cannot be met by the latter option. >> >> Some more words on point (2): >> >> As you probably know, there has been quite a long discussion in the >> DCAT-AP WG concerning this issue. The main points are probably >> summarised in the conversation recorded here: >> >> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets >> >> >> Of course, in DCAT-AP the objective was how to describe dataset >> subsets, and not about criteria for dataset subsetting. >> >> Notably, the discussion highlighted two different approaches: (a) >> dataset subsets modelled as datasets or (b) dataset subsets modelled >> simply as distributions. >> >> I don't see the two scenarios above as mutually exclusive. You can use >> one or the other depending of your use case and requirements. And you >> can use both (e.g., referring to point (1): time-related subsets >> modelled as child datasets, and their space-related subsets as >> distributions). However, I personally favour the idea of using >> distributions as the recommended option, and datasets only if you >> cannot do otherwise. In particular, I see two main issues with the >> dataset-based approach: >> >> - It includes an additional step to get to the data (dataset -> >> dataset -> distribution). Moreover, subsetting can be recursive - >> which increases the number of steps needed to get to the data. >> >> - I understand that your focus is on data discovery from a machine >> perspective. However, looking at how this will be reflected in >> catalogues used by people, the result is that you're going to have a >> record for each child dataset, in addition to the parent one. This >> scenario is quite typical nowadays (I know quite a few examples of >> tens of records having the same title, description, etc. - or just a >> slightly different one), and it doesn't help at all people trying to >> find what they're looking for. >> >> Thanks >> >> Andrea >> >> >> On 02/02/2016 12:02, Maik Riechert wrote: >>> Hi all, >>> >>> There has been a lot of discussion about subsetting data. I'd like to >>> give a slightly different perspective which is purely motivated from the >>> point of view of someone who wants to publish data, and in parallel >>> someone who wants to discover and access that data without much hassle. >>> >>> Of course it is hard to think about all scenarios, so I picked what I >>> think are common ones: >>> - a bunch of static data files without any API >>> - an API without static data files >>> - both >>> >>> And then some specific variations on what structure the data has (yearly >>> data files, daily, or another dimension used as splitting point, such as >>> spatial). >>> >>> It is in no way final or complete and may even be wrong, but here is >>> what I came up with: >>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas >>> >>> So it always starts by looking at what data exists and how it is >>> exposed, and based on those constraints I tried to model that as DCAT >>> datasets, sometimes with subdatasets. Again, it is purely motivated from >>> a machine-access point of view. There may be other things to consider. >>> >>> The point of this wiki page is to have something concrete to discuss >>> about and not just abstract ideas. It should uncover problems, possibly >>> solutions, perspectives... etc. >>> >>> Happy to hear your thoughts, >>> Maik >>> >> > > > -- Phil Archer W3C Data Activity Lead http://www.w3.org/2013/data/ http://philarcher.org +44 (0)7887 767755 @philarcher1
Received on Wednesday, 3 February 2016 09:56:17 UTC