- From: Rob Atkinson <rob@metalinkage.com.au>
- Date: Wed, 10 Feb 2016 21:58:12 +0000
- To: Maik Riechert <m.riechert@reading.ac.uk>, Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
- Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
- Message-ID: <CACfF9LxKGDV9vOejN8hmu4Gt0WtOveYvi7CPzMo4Fpoa3qit7w@mail.gmail.com>
I was really thinking about being parsimonious with the information management - having the smallest number of manual (meta)data curation tasks and the maximum consistency and usefulness in the derived information - this smacks of OLAP with data warehouses to meet user needs, but highly normalised transactional backends to manage data in the most fast/reliable/cheap. way. how the DCAT description is best realised is a separate issue - and this depends on what operations clients expect to be able to do using it (what are the Use Cases here?) rob On Wed, 10 Feb 2016 at 23:07 Maik Riechert <m.riechert@reading.ac.uk> wrote: > See below > > > On 04/02/2016 03:39, Rob Atkinson wrote: > > > As a straw man... > > lets nicely describe a dimensional dataset (i.e. we can subset on ranges > on any dimension) - its kind of nice to use RDF-QB for this - as we can > describe dimensions using SKOS, OWL etc - all very powerful and a lot more > useful than DCAT for machines to use the data. > > (If DCAT is for cataloguing and discovery - then we should not overload it > with the description that RDF-QB can provide.) > > so lets say we generate a dataset on the fly via an API (pre-prepared > subsets provided as files are just a case of doing this at a different > point in the delivery chain.) > > I would think it would be possible to take a DCAT and a RDF-QB description > and generate a DCAT description for each subset - provided your description > of the dimension is good enough to define the granularity of access. so the > question of how to do it might boil down to is there enough information to > generate a new DCAT record on the fly.. > > Just for clarification, with DCAT record, do you mean a DCAT dataset or a > DCAT distribution? I would say the latter. > > Cheers > > Maik > > > > this needs more thought than i am giving it here - but I would have > thought there should be enough information in such a DCAT record to > a) distinguish it from other subsets and allow a search using the > dimensions of the original dataset to find the DCAT record in a large > catalog. > b) to be able to collate such subsets and rebuild the original data cube > and its metadata (i.e. the domain of each dimension of the subset is > retained, but its range is made explict) > c) to define how it relates to the original dataset and the methods used > to subset the data - to make it possible to re-create the dataset > > If DCAT can be used safely in these modes then how to use DCAT to describe > data subsets should be clear. If you cannot support these approaches then > IMHO you are better off avoiding DCAT and treating subsets as datasets - > and move to a different information model designed explicitly for this. > > Rob Atkinson > > > > > On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney < > lewis.mcgibbney@gmail.com> wrote: > >> Hi Jon, >> I agree completely here. Many time we are 'forced' to partition data due >> to availability of, or improvements in query techniques... or simply >> requests from customers! >> Our data(set) partitioning strategies are dependent on a multitude of >> data modeling assumptions and decisions. They can also be determined by the >> hardware and software we are using to persist and query the data. >> Lewis >> >> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk> >> wrote: >> >>> Hi all, >>> >>> Just to chip in - I think that dataset partitioning is *not* >>> (necessarily) intrinsic to the dataset [1], but is a property of data >>> distribution (hence perhaps in scope for DCAT). A dataset might be >>> partitioned differently depending on user preference. Some users may prefer >>> a geographic partitioning, others may prefer a temporal partitioning. Still >>> others might want to partition by variable. One can imagine different >>> catalogues serving the “same” data to different users in different ways >>> (and in fact this does happen with large-volume geographic data like >>> satellite imagery or global models). >>> >>> like to think about dataset partioning as something simple, needing >>> only three semantic ingredients: being able to say that a resource is a >>> dataset, and being able to point to subsets and supersets. >>> >>> >>> I agree with this. I think this is the “level zero” requirement for >>> partitioning. >>> >>> [1] Actually, it probably depends on what you mean by "the dataset". If >>> you mean the logical entity, then the partitioning is not a property of the >>> dataset. But if you regard the dataset as a set of physical files then >>> maybe the partitioning *is* a property of the dataset. >>> >>> Cheers, >>> Jon >>> >>> >>> On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote: >>> >>> Hi Frans, >>> >>> In my opinion, it all depends on how the actual data is made available. >>> If it's a nice (possibly standard) API, then just link that as a >>> distribution and you're done I would say. Clients can explore subsets etc. >>> through that API (which in itself should be self-describing and doesn't >>> need any further metadata at the Distribution level, except media type if >>> possible). >>> >>> However, if you really *just* have a bunch of files, as is quite common >>> and which may be ok depending on data volume and intended users, then it >>> gets more complicated if you want to allow efficient machine access without >>> first being forced to download everything. >>> >>> So, yes, partitioning is intrinsic to the dataset, and that detail is >>> exposed to DCAT to allow more efficient access, both for humans and >>> machines. It is an optimization in the end, but in my opinion a quite >>> useful one. >>> >>> I wonder how many different partition strategies are really used in the >>> wild. >>> >>> Cheers >>> Maik >>> >>> Am 03.02.2016 um 11:13 schrieb Frans Knibbe: >>> >>> Hello Andrea, all, >>> >>> I like to think about dataset partioning as something simple, needing >>> only three semantic ingredients: being able to say that a resource is a >>> dataset, and being able to point to subsets and supersets. DCAT does not >>> seem necessary for those three. Is there really a need to see dataset >>> partioning as DCAT territory? DCAT is a vocabulary for data catalogs, I see >>> dataset partioning as something intrinsic to the dataset - its structure. >>> >>> That said, data about the structure of a dataset is metadata so it is >>> interesting to think about how data and metadata are coupled. For easy >>> navigation through the structure (by either man or machine) it is probably >>> best to keep the data volume small - metadata only. But it would be nice to >>> have the option to get the actual data from any dataset (at any structural >>> level). That means that additonial elements are needed: a indication of >>> ways to get the actual data, dcat:Distribution for instance. Also an >>> indication of size of the actual data would be very useful, to help decide >>> to get the data or to dig a bit deeper for smaller subsets. Only at the >>> highest level of the structure, the leaves of the tree, could the actual >>> data be returned by default. A friendly data provider will take care that >>> those subsets contain manageable volumes of data. >>> >>> My thoughts have little basis in practice, but I am trying to set up an >>> experiment with spatially partioned data. I think there are many >>> interesting possibilities. I hope to be able to share something practical >>> with the group soon. >>> >>> Regards, >>> Frans >>> >>> >>> >>> >>> >>> 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu >>> >: >>> >>>> Many thanks for sharing this work, Maik! >>>> >>>> Just a couple of notes from my side: >>>> >>>> 1. Besides temporal coverage, it may be worth adding in your scenarios >>>> also spatial coverage as another criterion of dataset partitioning. >>>> Actually, both criteria are frequently used concurrently. >>>> >>>> 2. In many of the scenarios you describe, dataset subsets are modelled >>>> as datasets. An alternative would be to model them just as distributions. >>>> So, I wonder whether those scenarios have requirements that cannot be met >>>> by the latter option. >>>> >>>> Some more words on point (2): >>>> >>>> As you probably know, there has been quite a long discussion in the >>>> DCAT-AP WG concerning this issue. The main points are probably summarised >>>> in the conversation recorded here: >>>> >>>> >>>> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets >>>> >>>> Of course, in DCAT-AP the objective was how to describe dataset >>>> subsets, and not about criteria for dataset subsetting. >>>> >>>> Notably, the discussion highlighted two different approaches: (a) >>>> dataset subsets modelled as datasets or (b) dataset subsets modelled simply >>>> as distributions. >>>> >>>> I don't see the two scenarios above as mutually exclusive. You can use >>>> one or the other depending of your use case and requirements. And you can >>>> use both (e.g., referring to point (1): time-related subsets modelled as >>>> child datasets, and their space-related subsets as distributions). However, >>>> I personally favour the idea of using distributions as the recommended >>>> option, and datasets only if you cannot do otherwise. In particular, I see >>>> two main issues with the dataset-based approach: >>>> >>>> - It includes an additional step to get to the data (dataset -> dataset >>>> -> distribution). Moreover, subsetting can be recursive - which increases >>>> the number of steps needed to get to the data. >>>> >>>> - I understand that your focus is on data discovery from a machine >>>> perspective. However, looking at how this will be reflected in catalogues >>>> used by people, the result is that you're going to have a record for each >>>> child dataset, in addition to the parent one. This scenario is quite >>>> typical nowadays (I know quite a few examples of tens of records having the >>>> same title, description, etc. - or just a slightly different one), and it >>>> doesn't help at all people trying to find what they're looking for. >>>> >>>> Thanks >>>> >>>> Andrea >>>> >>>> >>>> >>>> On 02/02/2016 12:02, Maik Riechert wrote: >>>> >>>>> Hi all, >>>>> >>>>> There has been a lot of discussion about subsetting data. I'd like to >>>>> give a slightly different perspective which is purely motivated from >>>>> the >>>>> point of view of someone who wants to publish data, and in parallel >>>>> someone who wants to discover and access that data without much hassle. >>>>> >>>>> Of course it is hard to think about all scenarios, so I picked what I >>>>> think are common ones: >>>>> - a bunch of static data files without any API >>>>> - an API without static data files >>>>> - both >>>>> >>>>> And then some specific variations on what structure the data has >>>>> (yearly >>>>> data files, daily, or another dimension used as splitting point, such >>>>> as >>>>> spatial). >>>>> >>>>> It is in no way final or complete and may even be wrong, but here is >>>>> what I came up with: >>>>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas >>>>> >>>>> So it always starts by looking at what data exists and how it is >>>>> exposed, and based on those constraints I tried to model that as DCAT >>>>> datasets, sometimes with subdatasets. Again, it is purely motivated >>>>> from >>>>> a machine-access point of view. There may be other things to consider. >>>>> >>>>> The point of this wiki page is to have something concrete to discuss >>>>> about and not just abstract ideas. It should uncover problems, possibly >>>>> solutions, perspectives... etc. >>>>> >>>>> Happy to hear your thoughts, >>>>> Maik >>>>> >>>>> >>>> -- >>>> Andrea Perego, Ph.D. >>>> Scientific / Technical Project Officer >>>> European Commission DG JRC >>>> Institute for Environment & Sustainability >>>> Unit H06 - Digital Earth & Reference Data >>>> Via E. Fermi, 2749 - TP 262 >>>> 21027 Ispra VA, Italy >>>> >>>> https://ec.europa.eu/jrc/ >>>> >>>> >>> >>> >>> >> >> >> -- >> *Lewis* >> > >
Received on Wednesday, 10 February 2016 21:59:00 UTC