- From: Rob Atkinson <rob@metalinkage.com.au>
- Date: Thu, 04 Feb 2016 12:06:21 +0000
- To: Maik Riechert <m.riechert@reading.ac.uk>, Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
- Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
- Message-ID: <CACfF9LwE2hMSUJqS1MvNnqFuZkELQqCMctpDNeMkZvN-bP17fQ@mail.gmail.com>
in the scope of the SDW is ontology for temporal dimensions (in two places actually - both explicitly and implicitly in the use of RDF-qb to describe Web Coverage Services). :-) this feels better to me than trying to overload DCAT or define a new vocabulary - but i'd like to see how static and dynamic subsets would share definition elements with the "master" dataset description. On Thu, 4 Feb 2016 at 19:46 Maik Riechert <m.riechert@reading.ac.uk> wrote: > Interesting! > > Let's try it out... > > (pseudo JSON-LD) > > ...DCAT... > "distributions": [{ > "title": "Global hourly temperature for Jan 2016 as netCDF file", > "accessURL": "http://../data/2012-01.nc" <http://../data/2012-01.nc>, > "qb:slice": { > "qb:sliceStructure": { > "qb:componentProperty": "eg:refPeriod" > }, > "eg:refPeriod": { > "type": "Interval", > "hasBeginning": { > "inXSDDateTime": "2016-01-01T00:00:00" > }, > "hasEnd": { > "inXSDDateTime": "2016-02-01T00:00:00" > } > } > } > }] > > (eg: is a custom namespace) > > I left out the qb:DataStructureDefinition since it's not really needed > here I think. > > It has some challenges, but I can see how this could work. The main > challenge is that some common dimensions would have to be defined (like > eg:refPeriod) if they don't exist already somewhere. Often in a dataset > there are separate spatial dimensions like X and Y (e.g. lat/long), but in > the above they would very likely be grouped into a single spatial dimension. > > Cheers > > Maik > > > Am 04.02.2016 um 03:39 schrieb Rob Atkinson: > > > As a straw man... > > lets nicely describe a dimensional dataset (i.e. we can subset on ranges > on any dimension) - its kind of nice to use RDF-QB for this - as we can > describe dimensions using SKOS, OWL etc - all very powerful and a lot more > useful than DCAT for machines to use the data. > > (If DCAT is for cataloguing and discovery - then we should not overload it > with the description that RDF-QB can provide.) > > so lets say we generate a dataset on the fly via an API (pre-prepared > subsets provided as files are just a case of doing this at a different > point in the delivery chain.) > > I would think it would be possible to take a DCAT and a RDF-QB description > and generate a DCAT description for each subset - provided your description > of the dimension is good enough to define the granularity of access. so the > question of how to do it might boil down to is there enough information to > generate a new DCAT record on the fly.. > > this needs more thought than i am giving it here - but I would have > thought there should be enough information in such a DCAT record to > a) distinguish it from other subsets and allow a search using the > dimensions of the original dataset to find the DCAT record in a large > catalog. > b) to be able to collate such subsets and rebuild the original data cube > and its metadata (i.e. the domain of each dimension of the subset is > retained, but its range is made explict) > c) to define how it relates to the original dataset and the methods used > to subset the data - to make it possible to re-create the dataset > > If DCAT can be used safely in these modes then how to use DCAT to describe > data subsets should be clear. If you cannot support these approaches then > IMHO you are better off avoiding DCAT and treating subsets as datasets - > and move to a different information model designed explicitly for this. > > Rob Atkinson > > > > > On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney < > lewis.mcgibbney@gmail.com> wrote: > >> Hi Jon, >> I agree completely here. Many time we are 'forced' to partition data due >> to availability of, or improvements in query techniques... or simply >> requests from customers! >> Our data(set) partitioning strategies are dependent on a multitude of >> data modeling assumptions and decisions. They can also be determined by the >> hardware and software we are using to persist and query the data. >> Lewis >> >> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk> >> wrote: >> >>> Hi all, >>> >>> Just to chip in - I think that dataset partitioning is *not* >>> (necessarily) intrinsic to the dataset [1], but is a property of data >>> distribution (hence perhaps in scope for DCAT). A dataset might be >>> partitioned differently depending on user preference. Some users may prefer >>> a geographic partitioning, others may prefer a temporal partitioning. Still >>> others might want to partition by variable. One can imagine different >>> catalogues serving the “same” data to different users in different ways >>> (and in fact this does happen with large-volume geographic data like >>> satellite imagery or global models). >>> >>> like to think about dataset partioning as something simple, needing >>> only three semantic ingredients: being able to say that a resource is a >>> dataset, and being able to point to subsets and supersets. >>> >>> >>> I agree with this. I think this is the “level zero” requirement for >>> partitioning. >>> >>> [1] Actually, it probably depends on what you mean by "the dataset". If >>> you mean the logical entity, then the partitioning is not a property of the >>> dataset. But if you regard the dataset as a set of physical files then >>> maybe the partitioning *is* a property of the dataset. >>> >>> Cheers, >>> Jon >>> >>> >>> On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote: >>> >>> Hi Frans, >>> >>> In my opinion, it all depends on how the actual data is made available. >>> If it's a nice (possibly standard) API, then just link that as a >>> distribution and you're done I would say. Clients can explore subsets etc. >>> through that API (which in itself should be self-describing and doesn't >>> need any further metadata at the Distribution level, except media type if >>> possible). >>> >>> However, if you really *just* have a bunch of files, as is quite common >>> and which may be ok depending on data volume and intended users, then it >>> gets more complicated if you want to allow efficient machine access without >>> first being forced to download everything. >>> >>> So, yes, partitioning is intrinsic to the dataset, and that detail is >>> exposed to DCAT to allow more efficient access, both for humans and >>> machines. It is an optimization in the end, but in my opinion a quite >>> useful one. >>> >>> I wonder how many different partition strategies are really used in the >>> wild. >>> >>> Cheers >>> Maik >>> >>> Am 03.02.2016 um 11:13 schrieb Frans Knibbe: >>> >>> Hello Andrea, all, >>> >>> I like to think about dataset partioning as something simple, needing >>> only three semantic ingredients: being able to say that a resource is a >>> dataset, and being able to point to subsets and supersets. DCAT does not >>> seem necessary for those three. Is there really a need to see dataset >>> partioning as DCAT territory? DCAT is a vocabulary for data catalogs, I see >>> dataset partioning as something intrinsic to the dataset - its structure. >>> >>> That said, data about the structure of a dataset is metadata so it is >>> interesting to think about how data and metadata are coupled. For easy >>> navigation through the structure (by either man or machine) it is probably >>> best to keep the data volume small - metadata only. But it would be nice to >>> have the option to get the actual data from any dataset (at any structural >>> level). That means that additonial elements are needed: a indication of >>> ways to get the actual data, dcat:Distribution for instance. Also an >>> indication of size of the actual data would be very useful, to help decide >>> to get the data or to dig a bit deeper for smaller subsets. Only at the >>> highest level of the structure, the leaves of the tree, could the actual >>> data be returned by default. A friendly data provider will take care that >>> those subsets contain manageable volumes of data. >>> >>> My thoughts have little basis in practice, but I am trying to set up an >>> experiment with spatially partioned data. I think there are many >>> interesting possibilities. I hope to be able to share something practical >>> with the group soon. >>> >>> Regards, >>> Frans >>> >>> >>> >>> >>> >>> 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu >>> >: >>> >>>> Many thanks for sharing this work, Maik! >>>> >>>> Just a couple of notes from my side: >>>> >>>> 1. Besides temporal coverage, it may be worth adding in your scenarios >>>> also spatial coverage as another criterion of dataset partitioning. >>>> Actually, both criteria are frequently used concurrently. >>>> >>>> 2. In many of the scenarios you describe, dataset subsets are modelled >>>> as datasets. An alternative would be to model them just as distributions. >>>> So, I wonder whether those scenarios have requirements that cannot be met >>>> by the latter option. >>>> >>>> Some more words on point (2): >>>> >>>> As you probably know, there has been quite a long discussion in the >>>> DCAT-AP WG concerning this issue. The main points are probably summarised >>>> in the conversation recorded here: >>>> >>>> >>>> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets >>>> >>>> Of course, in DCAT-AP the objective was how to describe dataset >>>> subsets, and not about criteria for dataset subsetting. >>>> >>>> Notably, the discussion highlighted two different approaches: (a) >>>> dataset subsets modelled as datasets or (b) dataset subsets modelled simply >>>> as distributions. >>>> >>>> I don't see the two scenarios above as mutually exclusive. You can use >>>> one or the other depending of your use case and requirements. And you can >>>> use both (e.g., referring to point (1): time-related subsets modelled as >>>> child datasets, and their space-related subsets as distributions). However, >>>> I personally favour the idea of using distributions as the recommended >>>> option, and datasets only if you cannot do otherwise. In particular, I see >>>> two main issues with the dataset-based approach: >>>> >>>> - It includes an additional step to get to the data (dataset -> dataset >>>> -> distribution). Moreover, subsetting can be recursive - which increases >>>> the number of steps needed to get to the data. >>>> >>>> - I understand that your focus is on data discovery from a machine >>>> perspective. However, looking at how this will be reflected in catalogues >>>> used by people, the result is that you're going to have a record for each >>>> child dataset, in addition to the parent one. This scenario is quite >>>> typical nowadays (I know quite a few examples of tens of records having the >>>> same title, description, etc. - or just a slightly different one), and it >>>> doesn't help at all people trying to find what they're looking for. >>>> >>>> Thanks >>>> >>>> Andrea >>>> >>>> >>>> >>>> On 02/02/2016 12:02, Maik Riechert wrote: >>>> >>>>> Hi all, >>>>> >>>>> There has been a lot of discussion about subsetting data. I'd like to >>>>> give a slightly different perspective which is purely motivated from >>>>> the >>>>> point of view of someone who wants to publish data, and in parallel >>>>> someone who wants to discover and access that data without much hassle. >>>>> >>>>> Of course it is hard to think about all scenarios, so I picked what I >>>>> think are common ones: >>>>> - a bunch of static data files without any API >>>>> - an API without static data files >>>>> - both >>>>> >>>>> And then some specific variations on what structure the data has >>>>> (yearly >>>>> data files, daily, or another dimension used as splitting point, such >>>>> as >>>>> spatial). >>>>> >>>>> It is in no way final or complete and may even be wrong, but here is >>>>> what I came up with: >>>>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas >>>>> >>>>> So it always starts by looking at what data exists and how it is >>>>> exposed, and based on those constraints I tried to model that as DCAT >>>>> datasets, sometimes with subdatasets. Again, it is purely motivated >>>>> from >>>>> a machine-access point of view. There may be other things to consider. >>>>> >>>>> The point of this wiki page is to have something concrete to discuss >>>>> about and not just abstract ideas. It should uncover problems, possibly >>>>> solutions, perspectives... etc. >>>>> >>>>> Happy to hear your thoughts, >>>>> Maik >>>>> >>>>> >>>> -- >>>> Andrea Perego, Ph.D. >>>> Scientific / Technical Project Officer >>>> European Commission DG JRC >>>> Institute for Environment & Sustainability >>>> Unit H06 - Digital Earth & Reference Data >>>> Via E. Fermi, 2749 - TP 262 >>>> 21027 Ispra VA, Italy >>>> >>>> https://ec.europa.eu/jrc/ >>>> >>>> >>> >>> >>> >> >> >> -- >> *Lewis* >> > >
Received on Thursday, 4 February 2016 12:07:15 UTC