- From: Maik Riechert <m.riechert@reading.ac.uk>
- Date: Wed, 10 Feb 2016 12:07:06 +0000
- To: Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
- Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
- Message-ID: <56BB27EA.8000901@reading.ac.uk>
See below On 04/02/2016 03:39, Rob Atkinson wrote: > > As a straw man... > > lets nicely describe a dimensional dataset (i.e. we can subset on > ranges on any dimension) - its kind of nice to use RDF-QB for this - > as we can describe dimensions using SKOS, OWL etc - all very powerful > and a lot more useful than DCAT for machines to use the data. > > (If DCAT is for cataloguing and discovery - then we should not > overload it with the description that RDF-QB can provide.) > > so lets say we generate a dataset on the fly via an API (pre-prepared > subsets provided as files are just a case of doing this at a different > point in the delivery chain.) > > I would think it would be possible to take a DCAT and a RDF-QB > description and generate a DCAT description for each subset - provided > your description of the dimension is good enough to define the > granularity of access. so the question of how to do it might boil down > to is there enough information to generate a new DCAT record on the fly.. Just for clarification, with DCAT record, do you mean a DCAT dataset or a DCAT distribution? I would say the latter. Cheers Maik > > this needs more thought than i am giving it here - but I would have > thought there should be enough information in such a DCAT record to > a) distinguish it from other subsets and allow a search using the > dimensions of the original dataset to find the DCAT record in a large > catalog. > b) to be able to collate such subsets and rebuild the original data > cube and its metadata (i.e. the domain of each dimension of the subset > is retained, but its range is made explict) > c) to define how it relates to the original dataset and the methods > used to subset the data - to make it possible to re-create the dataset > > If DCAT can be used safely in these modes then how to use DCAT to > describe data subsets should be clear. If you cannot support these > approaches then IMHO you are better off avoiding DCAT and treating > subsets as datasets - and move to a different information model > designed explicitly for this. > > Rob Atkinson > > > > > On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney > <lewis.mcgibbney@gmail.com <mailto:lewis.mcgibbney@gmail.com>> wrote: > > Hi Jon, > I agree completely here. Many time we are 'forced' to partition > data due to availability of, or improvements in query > techniques... or simply requests from customers! > Our data(set) partitioning strategies are dependent on a multitude > of data modeling assumptions and decisions. They can also be > determined by the hardware and software we are using to persist > and query the data. > Lewis > > On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower > <j.d.blower@reading.ac.uk <mailto:j.d.blower@reading.ac.uk>> wrote: > > Hi all, > > Just to chip in - I think that dataset partitioning is *not* > (necessarily) intrinsic to the dataset [1], but is a property > of data distribution (hence perhaps in scope for DCAT). A > dataset might be partitioned differently depending on user > preference. Some users may prefer a geographic partitioning, > others may prefer a temporal partitioning. Still others might > want to partition by variable. One can imagine different > catalogues serving the “same” data to different users in > different ways (and in fact this does happen with large-volume > geographic data like satellite imagery or global models). > >>> like to think about dataset partioning as something simple, >>> needing only three semantic ingredients: being able to say >>> that a resource is a dataset, and being able to point to >>> subsets and supersets. > > I agree with this. I think this is the “level zero” > requirement for partitioning. > > [1] Actually, it probably depends on what you mean by "the > dataset". If you mean the logical entity, then the > partitioning is not a property of the dataset. But if you > regard the dataset as a set of physical files then maybe the > partitioning *is* a property of the dataset. > > Cheers, > Jon > > >> On 3 Feb 2016, at 11:34, Maik Riechert >> <m.riechert@reading.ac.uk <mailto:m.riechert@reading.ac.uk>> >> wrote: >> >> Hi Frans, >> >> In my opinion, it all depends on how the actual data is made >> available. If it's a nice (possibly standard) API, then just >> link that as a distribution and you're done I would say. >> Clients can explore subsets etc. through that API (which in >> itself should be self-describing and doesn't need any further >> metadata at the Distribution level, except media type if >> possible). >> >> However, if you really *just* have a bunch of files, as is >> quite common and which may be ok depending on data volume and >> intended users, then it gets more complicated if you want to >> allow efficient machine access without first being forced to >> download everything. >> >> So, yes, partitioning is intrinsic to the dataset, and that >> detail is exposed to DCAT to allow more efficient access, >> both for humans and machines. It is an optimization in the >> end, but in my opinion a quite useful one. >> >> I wonder how many different partition strategies are really >> used in the wild. >> >> Cheers >> Maik >> >> Am 03.02.2016 um 11:13 schrieb Frans Knibbe: >>> Hello Andrea, all, >>> >>> I like to think about dataset partioning as something >>> simple, needing only three semantic ingredients: being able >>> to say that a resource is a dataset, and being able to point >>> to subsets and supersets. DCAT does not seem necessary for >>> those three. Is there really a need to see dataset >>> partioning as DCAT territory? DCAT is a vocabulary for data >>> catalogs, I see dataset partioning as something intrinsic to >>> the dataset - its structure. >>> >>> That said, data about the structure of a dataset is metadata >>> so it is interesting to think about how data and metadata >>> are coupled. For easy navigation through the structure (by >>> either man or machine) it is probably best to keep the data >>> volume small - metadata only. But it would be nice to have >>> the option to get the actual data from any dataset (at any >>> structural level). That means that additonial elements are >>> needed: a indication of ways to get the actual data, >>> dcat:Distribution for instance. Also an indication of size >>> of the actual data would be very useful, to help decide to >>> get the data or to dig a bit deeper for smaller subsets. >>> Only at the highest level of the structure, the leaves of >>> the tree, could the actual data be returned by default. A >>> friendly data provider will take care that those subsets >>> contain manageable volumes of data. >>> >>> My thoughts have little basis in practice, but I am trying >>> to set up an experiment with spatially partioned data. I >>> think there are many interesting possibilities. I hope to be >>> able to share something practical with the group soon. >>> >>> Regards, >>> Frans >>> >>> >>> >>> >>> >>> 2016-02-03 10:05 GMT+01:00 Andrea Perego >>> <andrea.perego@jrc.ec.europa.eu >>> <mailto:andrea.perego@jrc.ec.europa.eu>>: >>> >>> Many thanks for sharing this work, Maik! >>> >>> Just a couple of notes from my side: >>> >>> 1. Besides temporal coverage, it may be worth adding in >>> your scenarios also spatial coverage as another >>> criterion of dataset partitioning. Actually, both >>> criteria are frequently used concurrently. >>> >>> 2. In many of the scenarios you describe, dataset >>> subsets are modelled as datasets. An alternative would >>> be to model them just as distributions. So, I wonder >>> whether those scenarios have requirements that cannot be >>> met by the latter option. >>> >>> Some more words on point (2): >>> >>> As you probably know, there has been quite a long >>> discussion in the DCAT-AP WG concerning this issue. The >>> main points are probably summarised in the conversation >>> recorded here: >>> >>> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets >>> >>> Of course, in DCAT-AP the objective was how to describe >>> dataset subsets, and not about criteria for dataset >>> subsetting. >>> >>> Notably, the discussion highlighted two different >>> approaches: (a) dataset subsets modelled as datasets or >>> (b) dataset subsets modelled simply as distributions. >>> >>> I don't see the two scenarios above as mutually >>> exclusive. You can use one or the other depending of >>> your use case and requirements. And you can use both >>> (e.g., referring to point (1): time-related subsets >>> modelled as child datasets, and their space-related >>> subsets as distributions). However, I personally favour >>> the idea of using distributions as the recommended >>> option, and datasets only if you cannot do otherwise. In >>> particular, I see two main issues with the dataset-based >>> approach: >>> >>> - It includes an additional step to get to the data >>> (dataset -> dataset -> distribution). Moreover, >>> subsetting can be recursive - which increases the number >>> of steps needed to get to the data. >>> >>> - I understand that your focus is on data discovery from >>> a machine perspective. However, looking at how this will >>> be reflected in catalogues used by people, the result is >>> that you're going to have a record for each child >>> dataset, in addition to the parent one. This scenario is >>> quite typical nowadays (I know quite a few examples of >>> tens of records having the same title, description, etc. >>> - or just a slightly different one), and it doesn't help >>> at all people trying to find what they're looking for. >>> >>> Thanks >>> >>> Andrea >>> >>> >>> >>> On 02/02/2016 12:02, Maik Riechert wrote: >>> >>> Hi all, >>> >>> There has been a lot of discussion about subsetting >>> data. I'd like to >>> give a slightly different perspective which is >>> purely motivated from the >>> point of view of someone who wants to publish data, >>> and in parallel >>> someone who wants to discover and access that data >>> without much hassle. >>> >>> Of course it is hard to think about all scenarios, >>> so I picked what I >>> think are common ones: >>> - a bunch of static data files without any API >>> - an API without static data files >>> - both >>> >>> And then some specific variations on what structure >>> the data has (yearly >>> data files, daily, or another dimension used as >>> splitting point, such as >>> spatial). >>> >>> It is in no way final or complete and may even be >>> wrong, but here is >>> what I came up with: >>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas >>> >>> So it always starts by looking at what data exists >>> and how it is >>> exposed, and based on those constraints I tried to >>> model that as DCAT >>> datasets, sometimes with subdatasets. Again, it is >>> purely motivated from >>> a machine-access point of view. There may be other >>> things to consider. >>> >>> The point of this wiki page is to have something >>> concrete to discuss >>> about and not just abstract ideas. It should uncover >>> problems, possibly >>> solutions, perspectives... etc. >>> >>> Happy to hear your thoughts, >>> Maik >>> >>> >>> -- >>> Andrea Perego, Ph.D. >>> Scientific / Technical Project Officer >>> European Commission DG JRC >>> Institute for Environment & Sustainability >>> Unit H06 - Digital Earth & Reference Data >>> Via E. Fermi, 2749 - TP 262 >>> 21027 Ispra VA, Italy >>> >>> https://ec.europa.eu/jrc/ >>> >>> >> > > > > > -- > /Lewis/ >
Received on Wednesday, 10 February 2016 12:16:17 UTC