Re: Exposing datasets with DCAT (partitioning, subsets..) from Maik Riechert on 2016-02-04 (public-sdw-comments@w3.org from February 2016)

From: Maik Riechert <m.riechert@reading.ac.uk>
Date: Thu, 4 Feb 2016 08:46:07 +0000
To: Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
Message-ID: <56B30FCF.3020708@reading.ac.uk>
Interesting!

Let's try it out...

(pseudo JSON-LD)

...DCAT...
"distributions": [{
   "title": "Global hourly temperature for Jan 2016 as netCDF file",
   "accessURL": "http://../data/2012-01.nc",
   "qb:slice": {
     "qb:sliceStructure": {
       "qb:componentProperty": "eg:refPeriod"
     },
     "eg:refPeriod": {
       "type": "Interval",
       "hasBeginning": {
         "inXSDDateTime": "2016-01-01T00:00:00"
       },
       "hasEnd": {
         "inXSDDateTime": "2016-02-01T00:00:00"
       }
     }
   }
}]

(eg: is a custom namespace)

I left out the qb:DataStructureDefinition since it's not really needed 
here I think.

It has some challenges, but I can see how this could work. The main 
challenge is that some common dimensions would have to be defined (like 
eg:refPeriod) if they don't exist already somewhere. Often in a dataset 
there are separate spatial dimensions like X and Y (e.g. lat/long), but 
in the above they would very likely be grouped into a single spatial 
dimension.

Cheers
Maik

Am 04.02.2016 um 03:39 schrieb Rob Atkinson:
>
> As a straw man...
>
> lets nicely describe a dimensional dataset (i.e. we can subset on 
> ranges on  any dimension)  - its kind of nice to use RDF-QB for this - 
> as we can describe dimensions using SKOS, OWL etc - all very powerful 
> and a lot more useful than DCAT for machines to use the data.
>
> (If DCAT is for cataloguing and discovery - then we should not 
> overload it with the description that RDF-QB can provide.)
>
> so lets say we generate a dataset on the fly via an API (pre-prepared 
> subsets provided as files are just a case of doing this at a different 
> point in the delivery chain.)
>
> I would think it would be possible to take a DCAT and a RDF-QB 
> description and generate a DCAT description for each subset - provided 
> your description of the dimension is good enough to define the 
> granularity of access. so the question of how to do it might boil down 
> to is there enough information to generate a new DCAT record on the fly..
>
> this needs more thought than i am giving it here - but I would have 
> thought there should be enough information in such a DCAT record to
> a) distinguish it from other subsets and allow a search using the 
> dimensions of the original dataset to find the DCAT record in a large 
> catalog.
> b) to be able to collate such subsets and rebuild the original data 
> cube and its metadata (i.e. the domain of each dimension of the subset 
> is retained, but its range is made explict)
> c) to define how it relates to the original dataset and the methods 
> used to subset the data - to make it possible to re-create the dataset
>
> If DCAT can be used safely in these modes then how to use DCAT to 
> describe data subsets should be clear. If you cannot support these 
> approaches then IMHO you are better off avoiding DCAT and treating 
> subsets as datasets - and move to a different information model 
> designed explicitly for this.
>
> Rob Atkinson
>
>
>
>
> On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney 
> <lewis.mcgibbney@gmail.com <mailto:lewis.mcgibbney@gmail.com>> wrote:
>
>     Hi Jon,
>     I agree completely here. Many time we are 'forced' to partition
>     data due to availability of, or improvements in query
>     techniques... or simply requests from customers!
>     Our data(set) partitioning strategies are dependent on a multitude
>     of data modeling assumptions and decisions. They can also be
>     determined by the hardware and software we are using to persist
>     and query the data.
>     Lewis
>
>     On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower
>     <j.d.blower@reading.ac.uk <mailto:j.d.blower@reading.ac.uk>> wrote:
>
>         Hi all,
>
>         Just to chip in - I think that dataset partitioning is *not*
>         (necessarily) intrinsic to the dataset [1], but is a property
>         of data distribution (hence perhaps in scope for DCAT). A
>         dataset might be partitioned differently depending on user
>         preference. Some users may prefer a geographic partitioning,
>         others may prefer a temporal partitioning. Still others might
>         want to partition by variable. One can imagine different
>         catalogues serving the “same” data to different users in
>         different ways (and in fact this does happen with large-volume
>         geographic data like satellite imagery or global models).
>
>>>          like to think about dataset partioning as something simple,
>>>         needing only three semantic ingredients: being able to say
>>>         that a resource is a dataset, and being able to point to
>>>         subsets and supersets.
>
>         I agree with this. I think this is the “level zero”
>         requirement for partitioning.
>
>         [1] Actually, it probably depends on what you mean by "the
>         dataset". If you mean the logical entity, then the
>         partitioning is not a property of the dataset. But if you
>         regard the dataset as a set of physical files then maybe the
>         partitioning *is* a property of the dataset.
>
>         Cheers,
>         Jon
>
>
>>         On 3 Feb 2016, at 11:34, Maik Riechert
>>         <m.riechert@reading.ac.uk <mailto:m.riechert@reading.ac.uk>>
>>         wrote:
>>
>>         Hi Frans,
>>
>>         In my opinion, it all depends on how the actual data is made
>>         available. If it's a nice (possibly standard) API, then just
>>         link that as a distribution and you're done I would say.
>>         Clients can explore subsets etc. through that API (which in
>>         itself should be self-describing and doesn't need any further
>>         metadata at the Distribution level, except media type if
>>         possible).
>>
>>         However, if you really *just* have a bunch of files, as is
>>         quite common and which may be ok depending on data volume and
>>         intended users, then it gets more complicated if you want to
>>         allow efficient machine access without first being forced to
>>         download everything.
>>
>>         So, yes, partitioning is intrinsic to the dataset, and that
>>         detail is exposed to DCAT to allow more efficient access,
>>         both for humans and machines. It is an optimization in the
>>         end, but in my opinion a quite useful one.
>>
>>         I wonder how many different partition strategies are really
>>         used in the wild.
>>
>>         Cheers
>>         Maik
>>
>>         Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
>>>         Hello Andrea, all,
>>>
>>>         I like to think about dataset partioning as something
>>>         simple, needing only three semantic ingredients: being able
>>>         to say that a resource is a dataset, and being able to point
>>>         to subsets and supersets. DCAT does not seem necessary for
>>>         those three. Is there really a need to see dataset
>>>         partioning as DCAT territory? DCAT is a vocabulary for data
>>>         catalogs, I see dataset partioning as something intrinsic to
>>>         the dataset - its structure.
>>>
>>>         That said, data about the structure of a dataset is metadata
>>>         so it is interesting to think about how data and metadata
>>>         are coupled. For easy navigation through the structure (by
>>>         either man or machine) it is probably best to keep the data
>>>         volume small - metadata only. But it would be nice to have
>>>         the option to get the actual data from any dataset (at any
>>>         structural level). That means that additonial elements are
>>>         needed: a indication of ways to get the actual data,
>>>         dcat:Distribution for instance. Also an indication of size
>>>         of the actual data would be very useful, to help decide to
>>>         get the data or to dig a bit deeper for smaller subsets.
>>>         Only at the highest level of the structure, the leaves of
>>>         the tree, could the actual data be returned by default. A
>>>         friendly data provider will take care that those subsets
>>>         contain manageable volumes of data.
>>>
>>>         My thoughts have little basis in practice, but I am trying
>>>         to set up an experiment with spatially partioned data. I
>>>         think there are many interesting possibilities. I hope to be
>>>         able to share something practical with the group soon.
>>>
>>>         Regards,
>>>         Frans
>>>
>>>
>>>
>>>
>>>
>>>         2016-02-03 10:05 GMT+01:00 Andrea Perego
>>>         <andrea.perego@jrc.ec.europa.eu
>>>         <mailto:andrea.perego@jrc.ec.europa.eu>>:
>>>
>>>             Many thanks for sharing this work, Maik!
>>>
>>>             Just a couple of notes from my side:
>>>
>>>             1. Besides temporal coverage, it may be worth adding in
>>>             your scenarios also spatial coverage as another
>>>             criterion of dataset partitioning. Actually, both
>>>             criteria are frequently used concurrently.
>>>
>>>             2. In many of the scenarios you describe, dataset
>>>             subsets are modelled as datasets. An alternative would
>>>             be to model them just as distributions. So, I wonder
>>>             whether those scenarios have requirements that cannot be
>>>             met by the latter option.
>>>
>>>             Some more words on point (2):
>>>
>>>             As you probably know, there has been quite a long
>>>             discussion in the DCAT-AP WG concerning this issue. The
>>>             main points are probably summarised in the conversation
>>>             recorded here:
>>>
>>>             https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>>>
>>>             Of course, in DCAT-AP the objective was how to describe
>>>             dataset subsets, and not about criteria for dataset
>>>             subsetting.
>>>
>>>             Notably, the discussion highlighted two different
>>>             approaches: (a) dataset subsets modelled as datasets or
>>>             (b) dataset subsets modelled simply as distributions.
>>>
>>>             I don't see the two scenarios above as mutually
>>>             exclusive. You can use one or the other depending of
>>>             your use case and requirements. And you can use both
>>>             (e.g., referring to point (1): time-related subsets
>>>             modelled as child datasets, and their space-related
>>>             subsets as distributions). However, I personally favour
>>>             the idea of using distributions as the recommended
>>>             option, and datasets only if you cannot do otherwise. In
>>>             particular, I see two main issues with the dataset-based
>>>             approach:
>>>
>>>             - It includes an additional step to get to the data
>>>             (dataset -> dataset -> distribution). Moreover,
>>>             subsetting can be recursive - which increases the number
>>>             of steps needed to get to the data.
>>>
>>>             - I understand that your focus is on data discovery from
>>>             a machine perspective. However, looking at how this will
>>>             be reflected in catalogues used by people, the result is
>>>             that you're going to have a record for each child
>>>             dataset, in addition to the parent one. This scenario is
>>>             quite typical nowadays (I know quite a few examples of
>>>             tens of records having the same title, description, etc.
>>>             - or just a slightly different one), and it doesn't help
>>>             at all people trying to find what they're looking for.
>>>
>>>             Thanks
>>>
>>>             Andrea
>>>
>>>
>>>
>>>             On 02/02/2016 12:02, Maik Riechert wrote:
>>>
>>>                 Hi all,
>>>
>>>                 There has been a lot of discussion about subsetting
>>>                 data. I'd like to
>>>                 give a slightly different perspective which is
>>>                 purely motivated from the
>>>                 point of view of someone who wants to publish data,
>>>                 and in parallel
>>>                 someone who wants to discover and access that data
>>>                 without much hassle.
>>>
>>>                 Of course it is hard to think about all scenarios,
>>>                 so I picked what I
>>>                 think are common ones:
>>>                 - a bunch of static data files without any API
>>>                 - an API without static data files
>>>                 - both
>>>
>>>                 And then some specific variations on what structure
>>>                 the data has (yearly
>>>                 data files, daily, or another dimension used as
>>>                 splitting point, such as
>>>                 spatial).
>>>
>>>                 It is in no way final or complete and may even be
>>>                 wrong, but here is
>>>                 what I came up with:
>>>                 https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>>>
>>>                 So it always starts by looking at what data exists
>>>                 and how it is
>>>                 exposed, and based on those constraints I tried to
>>>                 model that as DCAT
>>>                 datasets, sometimes with subdatasets. Again, it is
>>>                 purely motivated from
>>>                 a machine-access point of view. There may be other
>>>                 things to consider.
>>>
>>>                 The point of this wiki page is to have something
>>>                 concrete to discuss
>>>                 about and not just abstract ideas. It should uncover
>>>                 problems, possibly
>>>                 solutions, perspectives... etc.
>>>
>>>                 Happy to hear your thoughts,
>>>                 Maik
>>>
>>>
>>>             -- 
>>>             Andrea Perego, Ph.D.
>>>             Scientific / Technical Project Officer
>>>             European Commission DG JRC
>>>             Institute for Environment & Sustainability
>>>             Unit H06 - Digital Earth & Reference Data
>>>             Via E. Fermi, 2749 - TP 262
>>>             21027 Ispra VA, Italy
>>>
>>>             https://ec.europa.eu/jrc/
>>>
>>>
>>
>
>
>
>
>     -- 
>     /Lewis/
>
Received on Thursday, 4 February 2016 08:46:38 UTC