Re: Exposing datasets with DCAT (partitioning, subsets..) from Rob Atkinson on 2017-04-04 (public-sdw-comments@w3.org from April 2017)

From: Rob Atkinson <rob@metalinkage.com.au>
Date: Tue, 04 Apr 2017 22:42:49 +0000
To: Ed Parsons <eparsons@google.com>, Maik Riechert <m.riechert@reading.ac.uk>
Cc: "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
Message-ID: <CACfF9LwUZjK4XHUXk16Xvou9UbvaKnKFL3gWkTYCAJ+JF1BS4Q@mail.gmail.com>
There is a huge amount of work to do IMHO to get subsetting well described.
the QB4ST note is intended to be a building block for a small part of that,
allowing hierarchical dimensions to be described with spatio-temporal
characteristics. The use cases for aggregation ("rollup" functions in OLAP)
and diving down into detail need to be fleshed out, and also how services
that deliver dynamic subsets are handled in a way consistent with
pre-packaged subsets needs consideration. Lets hope this gets onto the
agenda moving forward :-)

Rob Atkinson

On Tue, 4 Apr 2017 at 23:04 Ed Parsons <eparsons@google.com> wrote:

> Hello Maik,
>
> I am working my way through the  public comments made to the Spatial Data
> on the Web working group prior to the release of final draft of the Best
> Practice Document, current version here https://www.w3.org/TR/sdw-bp/
>
> Although there was a good email discussion of subsetting creating
> partitions of larger datasets using DCAT amongst other approaches the topic
> was not taken up much further within the realms of the Best Practice
> deliverable. There is continuing work looking at publishing coverage data
> which is likely to continue within a future W3C/OGC activity post the end
> of this working group in June.
>
> Would you allow me to therefore mark this comment against the Best
> Practice document as closed ?
>
> Many Thanks for your contribution.
>
> Ed
> Co-Chair W3C/OGC Spatial Data on the Web Working Group
>
> On Wed, 10 Feb 2016 at 21:59 Rob Atkinson <rob@metalinkage.com.au> wrote:
>
> I was really thinking about being parsimonious with the information
> management - having the smallest number of manual (meta)data curation tasks
> and the maximum consistency and usefulness in the derived information -
> this smacks of OLAP with data warehouses to meet user needs, but highly
> normalised transactional backends to manage data in the most
> fast/reliable/cheap. way.
>
> how the DCAT description is best realised is a separate issue - and this
> depends on what operations clients expect to be able to do using it (what
> are the Use Cases here?)
>
>
> rob
>
> On Wed, 10 Feb 2016 at 23:07 Maik Riechert <m.riechert@reading.ac.uk>
> wrote:
>
> See below
>
>
> On 04/02/2016 03:39, Rob Atkinson wrote:
>
>
> As a straw man...
>
> lets nicely describe a dimensional dataset (i.e. we can subset on ranges
> on  any dimension)  - its kind of nice to use RDF-QB for this - as we can
> describe dimensions using SKOS, OWL etc - all very powerful and a lot more
> useful than DCAT for machines to use the data.
>
> (If DCAT is for cataloguing and discovery - then we should not overload it
> with the description that RDF-QB can provide.)
>
> so lets say we generate a dataset on the fly via an API (pre-prepared
> subsets provided as files are just a case of doing this at a different
> point in the delivery chain.)
>
> I would think it would be possible to take a DCAT and a RDF-QB description
> and generate a DCAT description for each subset - provided your description
> of the dimension is good enough to define the granularity of access. so the
> question of how to do it might boil down to is there enough information to
> generate a new DCAT record on the fly..
>
> Just for clarification, with DCAT record, do you mean a DCAT dataset or a
> DCAT distribution? I would say the latter.
>
> Cheers
>
> Maik
>
>
>
> this needs more thought than i am giving it here - but I would have
> thought there should be enough information in such a DCAT record to
> a) distinguish it from other subsets and allow a search using the
> dimensions of the original dataset to find the DCAT record in a large
> catalog.
> b) to be able to collate such subsets and rebuild the original data cube
> and its metadata (i.e. the domain of each dimension of the subset is
> retained, but its range is made explict)
> c) to define how it relates to the original dataset and the methods used
> to subset the data - to make it possible to re-create the dataset
>
> If DCAT can be used safely in these modes then how to use DCAT to describe
> data subsets should be clear. If you cannot support these approaches then
> IMHO you are better off avoiding DCAT and treating subsets as datasets -
> and move to a different information model designed explicitly for this.
>
> Rob Atkinson
>
>
>
>
> On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> Hi Jon,
> I agree completely here. Many time we are 'forced' to partition data due
> to availability of, or improvements in query techniques... or simply
> requests from customers!
> Our data(set) partitioning strategies are dependent on a multitude of data
> modeling assumptions and decisions. They can also be determined by the
> hardware and software we are using to persist and query the data.
> Lewis
>
> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk>
> wrote:
>
> Hi all,
>
> Just to chip in - I think that dataset partitioning is *not* (necessarily)
> intrinsic to the dataset [1], but is a property of data distribution (hence
> perhaps in scope for DCAT). A dataset might be partitioned differently
> depending on user preference. Some users may prefer a geographic
> partitioning, others may prefer a temporal partitioning. Still others might
> want to partition by variable. One can imagine different catalogues serving
> the “same” data to different users in different ways (and in fact this does
> happen with large-volume geographic data like satellite imagery or global
> models).
>
>  like to think about dataset partioning as something simple, needing only
> three semantic ingredients: being able to say that a resource is a dataset,
> and being able to point to subsets and supersets.
>
>
> I agree with this. I think this is the “level zero” requirement for
> partitioning.
>
> [1] Actually, it probably depends on what you mean by "the dataset". If
> you mean the logical entity, then the partitioning is not a property of the
> dataset. But if you regard the dataset as a set of physical files then
> maybe the partitioning *is* a property of the dataset.
>
> Cheers,
> Jon
>
>
> On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote:
>
> Hi Frans,
>
> In my opinion, it all depends on how the actual data is made available. If
> it's a nice (possibly standard) API, then just link that as a distribution
> and you're done I would say. Clients can explore subsets etc. through that
> API (which in itself should be self-describing and doesn't need any further
> metadata at the Distribution level, except media type if possible).
>
> However, if you really *just* have a bunch of files, as is quite common
> and which may be ok depending on data volume and intended users, then it
> gets more complicated if you want to allow efficient machine access without
> first being forced to download everything.
>
> So, yes, partitioning is intrinsic to the dataset, and that detail is
> exposed to DCAT to allow more efficient access, both for humans and
> machines. It is an optimization in the end, but in my opinion a quite
> useful one.
>
> I wonder how many different partition strategies are really used in the
> wild.
>
> Cheers
> Maik
>
> Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
>
> Hello Andrea, all,
>
> I like to think about dataset partioning as something simple, needing only
> three semantic ingredients: being able to say that a resource is a dataset,
> and being able to point to subsets and supersets. DCAT does not seem
> necessary for those three. Is there really a need to see dataset partioning
> as DCAT territory? DCAT is a vocabulary for data catalogs, I see dataset
> partioning as something intrinsic to the dataset - its structure.
>
> That said, data about the structure of a dataset is metadata so it is
> interesting to think about how data and metadata are coupled. For easy
> navigation through the structure (by either man or machine) it is probably
> best to keep the data volume small - metadata only. But it would be nice to
> have the option to get the actual data from any dataset (at any structural
> level). That means that additonial elements are needed: a indication of
> ways to get the actual data, dcat:Distribution for instance. Also an
> indication of size of the actual data would be very useful, to help decide
> to get the data or to dig a bit deeper for smaller subsets. Only at the
> highest level of the structure, the leaves of the tree, could the actual
> data be returned by default. A friendly data provider will take care that
> those subsets contain manageable volumes of data.
>
> My thoughts have little basis in practice, but I am trying to set up an
> experiment with spatially partioned data. I think there are many
> interesting possibilities. I hope to be able to share something practical
> with the group soon.
>
> Regards,
> Frans
>
>
>
>
>
> 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu>:
>
> Many thanks for sharing this work, Maik!
>
> Just a couple of notes from my side:
>
> 1. Besides temporal coverage, it may be worth adding in your scenarios
> also spatial coverage as another criterion of dataset partitioning.
> Actually, both criteria are frequently used concurrently.
>
> 2. In many of the scenarios you describe, dataset subsets are modelled as
> datasets. An alternative would be to model them just as distributions. So,
> I wonder whether those scenarios have requirements that cannot be met by
> the latter option.
>
> Some more words on point (2):
>
> As you probably know, there has been quite a long discussion in the
> DCAT-AP WG concerning this issue. The main points are probably summarised
> in the conversation recorded here:
>
>
> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>
> Of course, in DCAT-AP the objective was how to describe dataset subsets,
> and not about criteria for dataset subsetting.
>
> Notably, the discussion highlighted two different approaches: (a) dataset
> subsets modelled as datasets or (b) dataset subsets modelled simply as
> distributions.
>
> I don't see the two scenarios above as mutually exclusive. You can use one
> or the other depending of your use case and requirements. And you can use
> both (e.g., referring to point (1): time-related subsets modelled as child
> datasets, and their space-related subsets as distributions). However, I
> personally favour the idea of using distributions as the recommended
> option, and datasets only if you cannot do otherwise. In particular, I see
> two main issues with the dataset-based approach:
>
> - It includes an additional step to get to the data (dataset -> dataset ->
> distribution). Moreover, subsetting can be recursive - which increases the
> number of steps needed to get to the data.
>
> - I understand that your focus is on data discovery from a machine
> perspective. However, looking at how this will be reflected in catalogues
> used by people, the result is that you're going to have a record for each
> child dataset, in addition to the parent one. This scenario is quite
> typical nowadays (I know quite a few examples of tens of records having the
> same title, description, etc. - or just a slightly different one), and it
> doesn't help at all people trying to find what they're looking for.
>
> Thanks
>
> Andrea
>
>
>
> On 02/02/2016 12:02, Maik Riechert wrote:
>
> Hi all,
>
> There has been a lot of discussion about subsetting data. I'd like to
> give a slightly different perspective which is purely motivated from the
> point of view of someone who wants to publish data, and in parallel
> someone who wants to discover and access that data without much hassle.
>
> Of course it is hard to think about all scenarios, so I picked what I
> think are common ones:
> - a bunch of static data files without any API
> - an API without static data files
> - both
>
> And then some specific variations on what structure the data has (yearly
> data files, daily, or another dimension used as splitting point, such as
> spatial).
>
> It is in no way final or complete and may even be wrong, but here is
> what I came up with:
> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>
> So it always starts by looking at what data exists and how it is
> exposed, and based on those constraints I tried to model that as DCAT
> datasets, sometimes with subdatasets. Again, it is purely motivated from
> a machine-access point of view. There may be other things to consider.
>
> The point of this wiki page is to have something concrete to discuss
> about and not just abstract ideas. It should uncover problems, possibly
> solutions, perspectives... etc.
>
> Happy to hear your thoughts,
> Maik
>
>
> --
> Andrea Perego, Ph.D.
> Scientific / Technical Project Officer
> European Commission DG JRC
> Institute for Environment & Sustainability
> Unit H06 - Digital Earth & Reference Data
> Via E. Fermi, 2749 - TP 262
> 21027 Ispra VA, Italy
>
> https://ec.europa.eu/jrc/
>
>
>
>
>
>
>
> --
> *Lewis*
>
>
> --
>
>
> *Ed Parsons *FRGS
> Geospatial Technologist, Google
>
> +44 7825 382263 <+44%207825%20382263> @edparsons
> www.edparsons.com
>
Received on Tuesday, 4 April 2017 22:43:39 UTC