Re: Exposing datasets with DCAT (partitioning, subsets..) from Ed Parsons on 2017-04-05 (public-sdw-comments@w3.org from April 2017)

From: Ed Parsons <eparsons@google.com>
Date: Wed, 05 Apr 2017 11:24:39 +0000
To: Jon Blower <j.d.blower@reading.ac.uk>
Cc: "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
Message-ID: <CAHrFjcmek154pKKzFPpn7vmGo__B7r06RMguF9Pamy8g3PFYYQ@mail.gmail.com>
Thanks Jon, much obliged !

Ed


On Wed, 5 Apr 2017 at 10:28 Jon Blower <j.d.blower@reading.ac.uk> wrote:

> Hi Ed,
>
>
>
> Maik’s no longer at that email address, but I talked to him off-line and
> he’s happy for you to mark the comment as closed.
>
>
>
> Cheers,
> Jon
>
>
>
> *From: *Ed Parsons <eparsons@google.com>
> *Date: *Tuesday, 4 April 2017 14:04
>
>
> *To: *Maik Riechert <m.riechert@reading.ac.uk>
> *Cc: *"public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
> *Subject: *Re: Exposing datasets with DCAT (partitioning, subsets..)
>
> *Resent-From: *<public-sdw-comments@w3.org>
> *Resent-Date: *Tuesday, 4 April 2017 14:05
>
>
>
> Hello Maik,
>
>
>
> I am working my way through the  public comments made to the Spatial Data
> on the Web working group prior to the release of final draft of the Best
> Practice Document, current version here https://www.w3.org/TR/sdw-bp/
>
>
>
> Although there was a good email discussion of subsetting creating
> partitions of larger datasets using DCAT amongst other approaches the topic
> was not taken up much further within the realms of the Best Practice
> deliverable. There is continuing work looking at publishing coverage data
> which is likely to continue within a future W3C/OGC activity post the end
> of this working group in June.
>
>
>
> Would you allow me to therefore mark this comment against the Best
> Practice document as closed ?
>
>
>
> Many Thanks for your contribution.
>
>
>
> Ed
>
> Co-Chair W3C/OGC Spatial Data on the Web Working Group
>
>
>
> On Wed, 10 Feb 2016 at 21:59 Rob Atkinson <rob@metalinkage.com.au> wrote:
>
> I was really thinking about being parsimonious with the information
> management - having the smallest number of manual (meta)data curation tasks
> and the maximum consistency and usefulness in the derived information -
> this smacks of OLAP with data warehouses to meet user needs, but highly
> normalised transactional backends to manage data in the most
> fast/reliable/cheap. way.
>
>
>
> how the DCAT description is best realised is a separate issue - and this
> depends on what operations clients expect to be able to do using it (what
> are the Use Cases here?)
>
>
>
> rob
>
>
>
> On Wed, 10 Feb 2016 at 23:07 Maik Riechert <m.riechert@reading.ac.uk>
> wrote:
>
> See below
>
>
>
> On 04/02/2016 03:39, Rob Atkinson wrote:
>
>
>
> As a straw man...
>
>
>
> lets nicely describe a dimensional dataset (i.e. we can subset on ranges
> on  any dimension)  - its kind of nice to use RDF-QB for this - as we can
> describe dimensions using SKOS, OWL etc - all very powerful and a lot more
> useful than DCAT for machines to use the data.
>
>
>
> (If DCAT is for cataloguing and discovery - then we should not overload it
> with the description that RDF-QB can provide.)
>
>
>
> so lets say we generate a dataset on the fly via an API (pre-prepared
> subsets provided as files are just a case of doing this at a different
> point in the delivery chain.)
>
>
>
> I would think it would be possible to take a DCAT and a RDF-QB description
> and generate a DCAT description for each subset - provided your description
> of the dimension is good enough to define the granularity of access. so the
> question of how to do it might boil down to is there enough information to
> generate a new DCAT record on the fly..
>
> Just for clarification, with DCAT record, do you mean a DCAT dataset or a
> DCAT distribution? I would say the latter.
>
> Cheers
>
>
> Maik
>
>
>
>
>
>
> this needs more thought than i am giving it here - but I would have
> thought there should be enough information in such a DCAT record to
>
> a) distinguish it from other subsets and allow a search using the
> dimensions of the original dataset to find the DCAT record in a large
> catalog.
>
> b) to be able to collate such subsets and rebuild the original data cube
> and its metadata (i.e. the domain of each dimension of the subset is
> retained, but its range is made explict)
>
> c) to define how it relates to the original dataset and the methods used
> to subset the data - to make it possible to re-create the dataset
>
>
>
> If DCAT can be used safely in these modes then how to use DCAT to describe
> data subsets should be clear. If you cannot support these approaches then
> IMHO you are better off avoiding DCAT and treating subsets as datasets -
> and move to a different information model designed explicitly for this.
>
>
>
> Rob Atkinson
>
>
>
>
>
>
>
>
>
> On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> Hi Jon,
>
> I agree completely here. Many time we are 'forced' to partition data due
> to availability of, or improvements in query techniques... or simply
> requests from customers!
>
> Our data(set) partitioning strategies are dependent on a multitude of data
> modeling assumptions and decisions. They can also be determined by the
> hardware and software we are using to persist and query the data.
>
> Lewis
>
>
>
> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk>
> wrote:
>
> Hi all,
>
>
>
> Just to chip in - I think that dataset partitioning is *not* (necessarily)
> intrinsic to the dataset [1], but is a property of data distribution (hence
> perhaps in scope for DCAT). A dataset might be partitioned differently
> depending on user preference. Some users may prefer a geographic
> partitioning, others may prefer a temporal partitioning. Still others might
> want to partition by variable. One can imagine different catalogues serving
> the “same” data to different users in different ways (and in fact this does
> happen with large-volume geographic data like satellite imagery or global
> models).
>
>
>
>  like to think about dataset partioning as something simple, needing only
> three semantic ingredients: being able to say that a resource is a dataset,
> and being able to point to subsets and supersets.
>
>
>
> I agree with this. I think this is the “level zero” requirement for
> partitioning.
>
>
>
> [1] Actually, it probably depends on what you mean by "the dataset". If
> you mean the logical entity, then the partitioning is not a property of the
> dataset. But if you regard the dataset as a set of physical files then
> maybe the partitioning *is* a property of the dataset.
>
>
>
> Cheers,
>
> Jon
>
>
>
>
>
> On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote:
>
>
>
> Hi Frans,
>
> In my opinion, it all depends on how the actual data is made available. If
> it's a nice (possibly standard) API, then just link that as a distribution
> and you're done I would say. Clients can explore subsets etc. through that
> API (which in itself should be self-describing and doesn't need any further
> metadata at the Distribution level, except media type if possible).
>
> However, if you really *just* have a bunch of files, as is quite common
> and which may be ok depending on data volume and intended users, then it
> gets more complicated if you want to allow efficient machine access without
> first being forced to download everything.
>
> So, yes, partitioning is intrinsic to the dataset, and that detail is
> exposed to DCAT to allow more efficient access, both for humans and
> machines. It is an optimization in the end, but in my opinion a quite
> useful one.
>
> I wonder how many different partition strategies are really used in the
> wild.
>
> Cheers
> Maik
>
> Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
>
> Hello Andrea, all,
>
>
>
> I like to think about dataset partioning as something simple, needing only
> three semantic ingredients: being able to say that a resource is a dataset,
> and being able to point to subsets and supersets. DCAT does not seem
> necessary for those three. Is there really a need to see dataset partioning
> as DCAT territory? DCAT is a vocabulary for data catalogs, I see dataset
> partioning as something intrinsic to the dataset - its structure.
>
>
>
> That said, data about the structure of a dataset is metadata so it is
> interesting to think about how data and metadata are coupled. For easy
> navigation through the structure (by either man or machine) it is probably
> best to keep the data volume small - metadata only. But it would be nice to
> have the option to get the actual data from any dataset (at any structural
> level). That means that additonial elements are needed: a indication of
> ways to get the actual data, dcat:Distribution for instance. Also an
> indication of size of the actual data would be very useful, to help decide
> to get the data or to dig a bit deeper for smaller subsets. Only at the
> highest level of the structure, the leaves of the tree, could the actual
> data be returned by default. A friendly data provider will take care that
> those subsets contain manageable volumes of data.
>
>
>
> My thoughts have little basis in practice, but I am trying to set up an
> experiment with spatially partioned data. I think there are many
> interesting possibilities. I hope to be able to share something practical
> with the group soon.
>
>
>
> Regards,
>
> Frans
>
>
>
>
>
>
>
>
>
>
>
> 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu>:
>
> Many thanks for sharing this work, Maik!
>
> Just a couple of notes from my side:
>
> 1. Besides temporal coverage, it may be worth adding in your scenarios
> also spatial coverage as another criterion of dataset partitioning.
> Actually, both criteria are frequently used concurrently.
>
> 2. In many of the scenarios you describe, dataset subsets are modelled as
> datasets. An alternative would be to model them just as distributions. So,
> I wonder whether those scenarios have requirements that cannot be met by
> the latter option.
>
> Some more words on point (2):
>
> As you probably know, there has been quite a long discussion in the
> DCAT-AP WG concerning this issue. The main points are probably summarised
> in the conversation recorded here:
>
>
> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>
> Of course, in DCAT-AP the objective was how to describe dataset subsets,
> and not about criteria for dataset subsetting.
>
> Notably, the discussion highlighted two different approaches: (a) dataset
> subsets modelled as datasets or (b) dataset subsets modelled simply as
> distributions.
>
> I don't see the two scenarios above as mutually exclusive. You can use one
> or the other depending of your use case and requirements. And you can use
> both (e.g., referring to point (1): time-related subsets modelled as child
> datasets, and their space-related subsets as distributions). However, I
> personally favour the idea of using distributions as the recommended
> option, and datasets only if you cannot do otherwise. In particular, I see
> two main issues with the dataset-based approach:
>
> - It includes an additional step to get to the data (dataset -> dataset ->
> distribution). Moreover, subsetting can be recursive - which increases the
> number of steps needed to get to the data.
>
> - I understand that your focus is on data discovery from a machine
> perspective. However, looking at how this will be reflected in catalogues
> used by people, the result is that you're going to have a record for each
> child dataset, in addition to the parent one. This scenario is quite
> typical nowadays (I know quite a few examples of tens of records having the
> same title, description, etc. - or just a slightly different one), and it
> doesn't help at all people trying to find what they're looking for.
>
> Thanks
>
> Andrea
>
>
>
>
> On 02/02/2016 12:02, Maik Riechert wrote:
>
> Hi all,
>
> There has been a lot of discussion about subsetting data. I'd like to
> give a slightly different perspective which is purely motivated from the
> point of view of someone who wants to publish data, and in parallel
> someone who wants to discover and access that data without much hassle.
>
> Of course it is hard to think about all scenarios, so I picked what I
> think are common ones:
> - a bunch of static data files without any API
> - an API without static data files
> - both
>
> And then some specific variations on what structure the data has (yearly
> data files, daily, or another dimension used as splitting point, such as
> spatial).
>
> It is in no way final or complete and may even be wrong, but here is
> what I came up with:
> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>
> So it always starts by looking at what data exists and how it is
> exposed, and based on those constraints I tried to model that as DCAT
> datasets, sometimes with subdatasets. Again, it is purely motivated from
> a machine-access point of view. There may be other things to consider.
>
> The point of this wiki page is to have something concrete to discuss
> about and not just abstract ideas. It should uncover problems, possibly
> solutions, perspectives... etc.
>
> Happy to hear your thoughts,
> Maik
>
>
>
> --
> Andrea Perego, Ph.D.
> Scientific / Technical Project Officer
> European Commission DG JRC
> Institute for Environment & Sustainability
> Unit H06 - Digital Earth & Reference Data
> Via E. Fermi, 2749 - TP 262
> 21027 Ispra VA, Italy
>
> https://ec.europa.eu/jrc/
>
>
>
>
>
>
>
>
>
> --
>
> *Lewis*
>
>
>
> --
>
>
> *Ed Parsons *FRGS
> Geospatial Technologist, Google
>
> +44 7825 382263 <07825%20382263> @edparsons
> www.edparsons.com
>
-- 


*Ed Parsons *FRGS
Geospatial Technologist, Google

+44 7825 382263 @edparsons
www.edparsons.com
Received on Wednesday, 5 April 2017 11:25:25 UTC