- From: Ed Parsons <eparsons@google.com>
- Date: Wed, 05 Apr 2017 11:24:39 +0000
- To: Jon Blower <j.d.blower@reading.ac.uk>
- Cc: "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
- Message-ID: <CAHrFjcmek154pKKzFPpn7vmGo__B7r06RMguF9Pamy8g3PFYYQ@mail.gmail.com>
Thanks Jon, much obliged ! Ed On Wed, 5 Apr 2017 at 10:28 Jon Blower <j.d.blower@reading.ac.uk> wrote: > Hi Ed, > > > > Maik’s no longer at that email address, but I talked to him off-line and > he’s happy for you to mark the comment as closed. > > > > Cheers, > Jon > > > > *From: *Ed Parsons <eparsons@google.com> > *Date: *Tuesday, 4 April 2017 14:04 > > > *To: *Maik Riechert <m.riechert@reading.ac.uk> > *Cc: *"public-sdw-comments@w3.org" <public-sdw-comments@w3.org> > *Subject: *Re: Exposing datasets with DCAT (partitioning, subsets..) > > *Resent-From: *<public-sdw-comments@w3.org> > *Resent-Date: *Tuesday, 4 April 2017 14:05 > > > > Hello Maik, > > > > I am working my way through the public comments made to the Spatial Data > on the Web working group prior to the release of final draft of the Best > Practice Document, current version here https://www.w3.org/TR/sdw-bp/ > > > > Although there was a good email discussion of subsetting creating > partitions of larger datasets using DCAT amongst other approaches the topic > was not taken up much further within the realms of the Best Practice > deliverable. There is continuing work looking at publishing coverage data > which is likely to continue within a future W3C/OGC activity post the end > of this working group in June. > > > > Would you allow me to therefore mark this comment against the Best > Practice document as closed ? > > > > Many Thanks for your contribution. > > > > Ed > > Co-Chair W3C/OGC Spatial Data on the Web Working Group > > > > On Wed, 10 Feb 2016 at 21:59 Rob Atkinson <rob@metalinkage.com.au> wrote: > > I was really thinking about being parsimonious with the information > management - having the smallest number of manual (meta)data curation tasks > and the maximum consistency and usefulness in the derived information - > this smacks of OLAP with data warehouses to meet user needs, but highly > normalised transactional backends to manage data in the most > fast/reliable/cheap. way. > > > > how the DCAT description is best realised is a separate issue - and this > depends on what operations clients expect to be able to do using it (what > are the Use Cases here?) > > > > rob > > > > On Wed, 10 Feb 2016 at 23:07 Maik Riechert <m.riechert@reading.ac.uk> > wrote: > > See below > > > > On 04/02/2016 03:39, Rob Atkinson wrote: > > > > As a straw man... > > > > lets nicely describe a dimensional dataset (i.e. we can subset on ranges > on any dimension) - its kind of nice to use RDF-QB for this - as we can > describe dimensions using SKOS, OWL etc - all very powerful and a lot more > useful than DCAT for machines to use the data. > > > > (If DCAT is for cataloguing and discovery - then we should not overload it > with the description that RDF-QB can provide.) > > > > so lets say we generate a dataset on the fly via an API (pre-prepared > subsets provided as files are just a case of doing this at a different > point in the delivery chain.) > > > > I would think it would be possible to take a DCAT and a RDF-QB description > and generate a DCAT description for each subset - provided your description > of the dimension is good enough to define the granularity of access. so the > question of how to do it might boil down to is there enough information to > generate a new DCAT record on the fly.. > > Just for clarification, with DCAT record, do you mean a DCAT dataset or a > DCAT distribution? I would say the latter. > > Cheers > > > Maik > > > > > > > this needs more thought than i am giving it here - but I would have > thought there should be enough information in such a DCAT record to > > a) distinguish it from other subsets and allow a search using the > dimensions of the original dataset to find the DCAT record in a large > catalog. > > b) to be able to collate such subsets and rebuild the original data cube > and its metadata (i.e. the domain of each dimension of the subset is > retained, but its range is made explict) > > c) to define how it relates to the original dataset and the methods used > to subset the data - to make it possible to re-create the dataset > > > > If DCAT can be used safely in these modes then how to use DCAT to describe > data subsets should be clear. If you cannot support these approaches then > IMHO you are better off avoiding DCAT and treating subsets as datasets - > and move to a different information model designed explicitly for this. > > > > Rob Atkinson > > > > > > > > > > On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney < > lewis.mcgibbney@gmail.com> wrote: > > Hi Jon, > > I agree completely here. Many time we are 'forced' to partition data due > to availability of, or improvements in query techniques... or simply > requests from customers! > > Our data(set) partitioning strategies are dependent on a multitude of data > modeling assumptions and decisions. They can also be determined by the > hardware and software we are using to persist and query the data. > > Lewis > > > > On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower <j.d.blower@reading.ac.uk> > wrote: > > Hi all, > > > > Just to chip in - I think that dataset partitioning is *not* (necessarily) > intrinsic to the dataset [1], but is a property of data distribution (hence > perhaps in scope for DCAT). A dataset might be partitioned differently > depending on user preference. Some users may prefer a geographic > partitioning, others may prefer a temporal partitioning. Still others might > want to partition by variable. One can imagine different catalogues serving > the “same” data to different users in different ways (and in fact this does > happen with large-volume geographic data like satellite imagery or global > models). > > > > like to think about dataset partioning as something simple, needing only > three semantic ingredients: being able to say that a resource is a dataset, > and being able to point to subsets and supersets. > > > > I agree with this. I think this is the “level zero” requirement for > partitioning. > > > > [1] Actually, it probably depends on what you mean by "the dataset". If > you mean the logical entity, then the partitioning is not a property of the > dataset. But if you regard the dataset as a set of physical files then > maybe the partitioning *is* a property of the dataset. > > > > Cheers, > > Jon > > > > > > On 3 Feb 2016, at 11:34, Maik Riechert <m.riechert@reading.ac.uk> wrote: > > > > Hi Frans, > > In my opinion, it all depends on how the actual data is made available. If > it's a nice (possibly standard) API, then just link that as a distribution > and you're done I would say. Clients can explore subsets etc. through that > API (which in itself should be self-describing and doesn't need any further > metadata at the Distribution level, except media type if possible). > > However, if you really *just* have a bunch of files, as is quite common > and which may be ok depending on data volume and intended users, then it > gets more complicated if you want to allow efficient machine access without > first being forced to download everything. > > So, yes, partitioning is intrinsic to the dataset, and that detail is > exposed to DCAT to allow more efficient access, both for humans and > machines. It is an optimization in the end, but in my opinion a quite > useful one. > > I wonder how many different partition strategies are really used in the > wild. > > Cheers > Maik > > Am 03.02.2016 um 11:13 schrieb Frans Knibbe: > > Hello Andrea, all, > > > > I like to think about dataset partioning as something simple, needing only > three semantic ingredients: being able to say that a resource is a dataset, > and being able to point to subsets and supersets. DCAT does not seem > necessary for those three. Is there really a need to see dataset partioning > as DCAT territory? DCAT is a vocabulary for data catalogs, I see dataset > partioning as something intrinsic to the dataset - its structure. > > > > That said, data about the structure of a dataset is metadata so it is > interesting to think about how data and metadata are coupled. For easy > navigation through the structure (by either man or machine) it is probably > best to keep the data volume small - metadata only. But it would be nice to > have the option to get the actual data from any dataset (at any structural > level). That means that additonial elements are needed: a indication of > ways to get the actual data, dcat:Distribution for instance. Also an > indication of size of the actual data would be very useful, to help decide > to get the data or to dig a bit deeper for smaller subsets. Only at the > highest level of the structure, the leaves of the tree, could the actual > data be returned by default. A friendly data provider will take care that > those subsets contain manageable volumes of data. > > > > My thoughts have little basis in practice, but I am trying to set up an > experiment with spatially partioned data. I think there are many > interesting possibilities. I hope to be able to share something practical > with the group soon. > > > > Regards, > > Frans > > > > > > > > > > > > 2016-02-03 10:05 GMT+01:00 Andrea Perego <andrea.perego@jrc.ec.europa.eu>: > > Many thanks for sharing this work, Maik! > > Just a couple of notes from my side: > > 1. Besides temporal coverage, it may be worth adding in your scenarios > also spatial coverage as another criterion of dataset partitioning. > Actually, both criteria are frequently used concurrently. > > 2. In many of the scenarios you describe, dataset subsets are modelled as > datasets. An alternative would be to model them just as distributions. So, > I wonder whether those scenarios have requirements that cannot be met by > the latter option. > > Some more words on point (2): > > As you probably know, there has been quite a long discussion in the > DCAT-AP WG concerning this issue. The main points are probably summarised > in the conversation recorded here: > > > https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets > > Of course, in DCAT-AP the objective was how to describe dataset subsets, > and not about criteria for dataset subsetting. > > Notably, the discussion highlighted two different approaches: (a) dataset > subsets modelled as datasets or (b) dataset subsets modelled simply as > distributions. > > I don't see the two scenarios above as mutually exclusive. You can use one > or the other depending of your use case and requirements. And you can use > both (e.g., referring to point (1): time-related subsets modelled as child > datasets, and their space-related subsets as distributions). However, I > personally favour the idea of using distributions as the recommended > option, and datasets only if you cannot do otherwise. In particular, I see > two main issues with the dataset-based approach: > > - It includes an additional step to get to the data (dataset -> dataset -> > distribution). Moreover, subsetting can be recursive - which increases the > number of steps needed to get to the data. > > - I understand that your focus is on data discovery from a machine > perspective. However, looking at how this will be reflected in catalogues > used by people, the result is that you're going to have a record for each > child dataset, in addition to the parent one. This scenario is quite > typical nowadays (I know quite a few examples of tens of records having the > same title, description, etc. - or just a slightly different one), and it > doesn't help at all people trying to find what they're looking for. > > Thanks > > Andrea > > > > > On 02/02/2016 12:02, Maik Riechert wrote: > > Hi all, > > There has been a lot of discussion about subsetting data. I'd like to > give a slightly different perspective which is purely motivated from the > point of view of someone who wants to publish data, and in parallel > someone who wants to discover and access that data without much hassle. > > Of course it is hard to think about all scenarios, so I picked what I > think are common ones: > - a bunch of static data files without any API > - an API without static data files > - both > > And then some specific variations on what structure the data has (yearly > data files, daily, or another dimension used as splitting point, such as > spatial). > > It is in no way final or complete and may even be wrong, but here is > what I came up with: > https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas > > So it always starts by looking at what data exists and how it is > exposed, and based on those constraints I tried to model that as DCAT > datasets, sometimes with subdatasets. Again, it is purely motivated from > a machine-access point of view. There may be other things to consider. > > The point of this wiki page is to have something concrete to discuss > about and not just abstract ideas. It should uncover problems, possibly > solutions, perspectives... etc. > > Happy to hear your thoughts, > Maik > > > > -- > Andrea Perego, Ph.D. > Scientific / Technical Project Officer > European Commission DG JRC > Institute for Environment & Sustainability > Unit H06 - Digital Earth & Reference Data > Via E. Fermi, 2749 - TP 262 > 21027 Ispra VA, Italy > > https://ec.europa.eu/jrc/ > > > > > > > > > > -- > > *Lewis* > > > > -- > > > *Ed Parsons *FRGS > Geospatial Technologist, Google > > +44 7825 382263 <07825%20382263> @edparsons > www.edparsons.com > -- *Ed Parsons *FRGS Geospatial Technologist, Google +44 7825 382263 @edparsons www.edparsons.com
Received on Wednesday, 5 April 2017 11:25:25 UTC