- From: Maik Riechert <m.riechert@reading.ac.uk>
- Date: Thu, 4 Feb 2016 08:46:07 +0000
- To: Rob Atkinson <rob@metalinkage.com.au>, Lewis John Mcgibbney <lewis.mcgibbney@gmail.com>, Jon Blower <j.d.blower@reading.ac.uk>
- Cc: Frans Knibbe <frans.knibbe@geodan.nl>, Andrea Perego <andrea.perego@jrc.ec.europa.eu>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>
- Message-ID: <56B30FCF.3020708@reading.ac.uk>
Interesting!
Let's try it out...
(pseudo JSON-LD)
...DCAT...
"distributions": [{
"title": "Global hourly temperature for Jan 2016 as netCDF file",
"accessURL": "http://../data/2012-01.nc",
"qb:slice": {
"qb:sliceStructure": {
"qb:componentProperty": "eg:refPeriod"
},
"eg:refPeriod": {
"type": "Interval",
"hasBeginning": {
"inXSDDateTime": "2016-01-01T00:00:00"
},
"hasEnd": {
"inXSDDateTime": "2016-02-01T00:00:00"
}
}
}
}]
(eg: is a custom namespace)
I left out the qb:DataStructureDefinition since it's not really needed
here I think.
It has some challenges, but I can see how this could work. The main
challenge is that some common dimensions would have to be defined (like
eg:refPeriod) if they don't exist already somewhere. Often in a dataset
there are separate spatial dimensions like X and Y (e.g. lat/long), but
in the above they would very likely be grouped into a single spatial
dimension.
Cheers
Maik
Am 04.02.2016 um 03:39 schrieb Rob Atkinson:
>
> As a straw man...
>
> lets nicely describe a dimensional dataset (i.e. we can subset on
> ranges on any dimension) - its kind of nice to use RDF-QB for this -
> as we can describe dimensions using SKOS, OWL etc - all very powerful
> and a lot more useful than DCAT for machines to use the data.
>
> (If DCAT is for cataloguing and discovery - then we should not
> overload it with the description that RDF-QB can provide.)
>
> so lets say we generate a dataset on the fly via an API (pre-prepared
> subsets provided as files are just a case of doing this at a different
> point in the delivery chain.)
>
> I would think it would be possible to take a DCAT and a RDF-QB
> description and generate a DCAT description for each subset - provided
> your description of the dimension is good enough to define the
> granularity of access. so the question of how to do it might boil down
> to is there enough information to generate a new DCAT record on the fly..
>
> this needs more thought than i am giving it here - but I would have
> thought there should be enough information in such a DCAT record to
> a) distinguish it from other subsets and allow a search using the
> dimensions of the original dataset to find the DCAT record in a large
> catalog.
> b) to be able to collate such subsets and rebuild the original data
> cube and its metadata (i.e. the domain of each dimension of the subset
> is retained, but its range is made explict)
> c) to define how it relates to the original dataset and the methods
> used to subset the data - to make it possible to re-create the dataset
>
> If DCAT can be used safely in these modes then how to use DCAT to
> describe data subsets should be clear. If you cannot support these
> approaches then IMHO you are better off avoiding DCAT and treating
> subsets as datasets - and move to a different information model
> designed explicitly for this.
>
> Rob Atkinson
>
>
>
>
> On Thu, 4 Feb 2016 at 04:55 Lewis John Mcgibbney
> <lewis.mcgibbney@gmail.com <mailto:lewis.mcgibbney@gmail.com>> wrote:
>
> Hi Jon,
> I agree completely here. Many time we are 'forced' to partition
> data due to availability of, or improvements in query
> techniques... or simply requests from customers!
> Our data(set) partitioning strategies are dependent on a multitude
> of data modeling assumptions and decisions. They can also be
> determined by the hardware and software we are using to persist
> and query the data.
> Lewis
>
> On Wed, Feb 3, 2016 at 4:56 AM, Jon Blower
> <j.d.blower@reading.ac.uk <mailto:j.d.blower@reading.ac.uk>> wrote:
>
> Hi all,
>
> Just to chip in - I think that dataset partitioning is *not*
> (necessarily) intrinsic to the dataset [1], but is a property
> of data distribution (hence perhaps in scope for DCAT). A
> dataset might be partitioned differently depending on user
> preference. Some users may prefer a geographic partitioning,
> others may prefer a temporal partitioning. Still others might
> want to partition by variable. One can imagine different
> catalogues serving the “same” data to different users in
> different ways (and in fact this does happen with large-volume
> geographic data like satellite imagery or global models).
>
>>> like to think about dataset partioning as something simple,
>>> needing only three semantic ingredients: being able to say
>>> that a resource is a dataset, and being able to point to
>>> subsets and supersets.
>
> I agree with this. I think this is the “level zero”
> requirement for partitioning.
>
> [1] Actually, it probably depends on what you mean by "the
> dataset". If you mean the logical entity, then the
> partitioning is not a property of the dataset. But if you
> regard the dataset as a set of physical files then maybe the
> partitioning *is* a property of the dataset.
>
> Cheers,
> Jon
>
>
>> On 3 Feb 2016, at 11:34, Maik Riechert
>> <m.riechert@reading.ac.uk <mailto:m.riechert@reading.ac.uk>>
>> wrote:
>>
>> Hi Frans,
>>
>> In my opinion, it all depends on how the actual data is made
>> available. If it's a nice (possibly standard) API, then just
>> link that as a distribution and you're done I would say.
>> Clients can explore subsets etc. through that API (which in
>> itself should be self-describing and doesn't need any further
>> metadata at the Distribution level, except media type if
>> possible).
>>
>> However, if you really *just* have a bunch of files, as is
>> quite common and which may be ok depending on data volume and
>> intended users, then it gets more complicated if you want to
>> allow efficient machine access without first being forced to
>> download everything.
>>
>> So, yes, partitioning is intrinsic to the dataset, and that
>> detail is exposed to DCAT to allow more efficient access,
>> both for humans and machines. It is an optimization in the
>> end, but in my opinion a quite useful one.
>>
>> I wonder how many different partition strategies are really
>> used in the wild.
>>
>> Cheers
>> Maik
>>
>> Am 03.02.2016 um 11:13 schrieb Frans Knibbe:
>>> Hello Andrea, all,
>>>
>>> I like to think about dataset partioning as something
>>> simple, needing only three semantic ingredients: being able
>>> to say that a resource is a dataset, and being able to point
>>> to subsets and supersets. DCAT does not seem necessary for
>>> those three. Is there really a need to see dataset
>>> partioning as DCAT territory? DCAT is a vocabulary for data
>>> catalogs, I see dataset partioning as something intrinsic to
>>> the dataset - its structure.
>>>
>>> That said, data about the structure of a dataset is metadata
>>> so it is interesting to think about how data and metadata
>>> are coupled. For easy navigation through the structure (by
>>> either man or machine) it is probably best to keep the data
>>> volume small - metadata only. But it would be nice to have
>>> the option to get the actual data from any dataset (at any
>>> structural level). That means that additonial elements are
>>> needed: a indication of ways to get the actual data,
>>> dcat:Distribution for instance. Also an indication of size
>>> of the actual data would be very useful, to help decide to
>>> get the data or to dig a bit deeper for smaller subsets.
>>> Only at the highest level of the structure, the leaves of
>>> the tree, could the actual data be returned by default. A
>>> friendly data provider will take care that those subsets
>>> contain manageable volumes of data.
>>>
>>> My thoughts have little basis in practice, but I am trying
>>> to set up an experiment with spatially partioned data. I
>>> think there are many interesting possibilities. I hope to be
>>> able to share something practical with the group soon.
>>>
>>> Regards,
>>> Frans
>>>
>>>
>>>
>>>
>>>
>>> 2016-02-03 10:05 GMT+01:00 Andrea Perego
>>> <andrea.perego@jrc.ec.europa.eu
>>> <mailto:andrea.perego@jrc.ec.europa.eu>>:
>>>
>>> Many thanks for sharing this work, Maik!
>>>
>>> Just a couple of notes from my side:
>>>
>>> 1. Besides temporal coverage, it may be worth adding in
>>> your scenarios also spatial coverage as another
>>> criterion of dataset partitioning. Actually, both
>>> criteria are frequently used concurrently.
>>>
>>> 2. In many of the scenarios you describe, dataset
>>> subsets are modelled as datasets. An alternative would
>>> be to model them just as distributions. So, I wonder
>>> whether those scenarios have requirements that cannot be
>>> met by the latter option.
>>>
>>> Some more words on point (2):
>>>
>>> As you probably know, there has been quite a long
>>> discussion in the DCAT-AP WG concerning this issue. The
>>> main points are probably summarised in the conversation
>>> recorded here:
>>>
>>> https://joinup.ec.europa.eu/asset/dcat_application_profile/issue/mo12-grouping-datasets
>>>
>>> Of course, in DCAT-AP the objective was how to describe
>>> dataset subsets, and not about criteria for dataset
>>> subsetting.
>>>
>>> Notably, the discussion highlighted two different
>>> approaches: (a) dataset subsets modelled as datasets or
>>> (b) dataset subsets modelled simply as distributions.
>>>
>>> I don't see the two scenarios above as mutually
>>> exclusive. You can use one or the other depending of
>>> your use case and requirements. And you can use both
>>> (e.g., referring to point (1): time-related subsets
>>> modelled as child datasets, and their space-related
>>> subsets as distributions). However, I personally favour
>>> the idea of using distributions as the recommended
>>> option, and datasets only if you cannot do otherwise. In
>>> particular, I see two main issues with the dataset-based
>>> approach:
>>>
>>> - It includes an additional step to get to the data
>>> (dataset -> dataset -> distribution). Moreover,
>>> subsetting can be recursive - which increases the number
>>> of steps needed to get to the data.
>>>
>>> - I understand that your focus is on data discovery from
>>> a machine perspective. However, looking at how this will
>>> be reflected in catalogues used by people, the result is
>>> that you're going to have a record for each child
>>> dataset, in addition to the parent one. This scenario is
>>> quite typical nowadays (I know quite a few examples of
>>> tens of records having the same title, description, etc.
>>> - or just a slightly different one), and it doesn't help
>>> at all people trying to find what they're looking for.
>>>
>>> Thanks
>>>
>>> Andrea
>>>
>>>
>>>
>>> On 02/02/2016 12:02, Maik Riechert wrote:
>>>
>>> Hi all,
>>>
>>> There has been a lot of discussion about subsetting
>>> data. I'd like to
>>> give a slightly different perspective which is
>>> purely motivated from the
>>> point of view of someone who wants to publish data,
>>> and in parallel
>>> someone who wants to discover and access that data
>>> without much hassle.
>>>
>>> Of course it is hard to think about all scenarios,
>>> so I picked what I
>>> think are common ones:
>>> - a bunch of static data files without any API
>>> - an API without static data files
>>> - both
>>>
>>> And then some specific variations on what structure
>>> the data has (yearly
>>> data files, daily, or another dimension used as
>>> splitting point, such as
>>> spatial).
>>>
>>> It is in no way final or complete and may even be
>>> wrong, but here is
>>> what I came up with:
>>> https://github.com/ec-melodies/wp02-dcat/wiki/DCAT-partitioning-ideas
>>>
>>> So it always starts by looking at what data exists
>>> and how it is
>>> exposed, and based on those constraints I tried to
>>> model that as DCAT
>>> datasets, sometimes with subdatasets. Again, it is
>>> purely motivated from
>>> a machine-access point of view. There may be other
>>> things to consider.
>>>
>>> The point of this wiki page is to have something
>>> concrete to discuss
>>> about and not just abstract ideas. It should uncover
>>> problems, possibly
>>> solutions, perspectives... etc.
>>>
>>> Happy to hear your thoughts,
>>> Maik
>>>
>>>
>>> --
>>> Andrea Perego, Ph.D.
>>> Scientific / Technical Project Officer
>>> European Commission DG JRC
>>> Institute for Environment & Sustainability
>>> Unit H06 - Digital Earth & Reference Data
>>> Via E. Fermi, 2749 - TP 262
>>> 21027 Ispra VA, Italy
>>>
>>> https://ec.europa.eu/jrc/
>>>
>>>
>>
>
>
>
>
> --
> /Lewis/
>
Received on Thursday, 4 February 2016 08:46:38 UTC