- From: Frans Knibbe <frans.knibbe@geodan.nl>
- Date: Tue, 5 Jan 2016 09:30:16 +0100
- To: Annette Greiner <amgreiner@lbl.gov>
- Cc: Jon Blower <j.d.blower@reading.ac.uk>, Peter Baumann <p.baumann@jacobs-university.de>, Phil Archer <phila@w3.org>, Manolis Koubarakis <koubarak@di.uoa.gr>, "public-sdw-comments@w3.org" <public-sdw-comments@w3.org>, Eric Stephan <ericphb@gmail.com>, "Tandy, Jeremy" <jeremy.tandy@metoffice.gov.uk>, "public-dwbp-comments@w3.org" <public-dwbp-comments@w3.org>
- Message-ID: <CAFVDz40CjU6QjJfTdYy2bnXi8oWNBO_9304UMiOz89fHVh5r+w@mail.gmail.com>
Hello Annette, I think the two types of subsetting I mentioned earlier (a dataset partition that is predetermined by the data provider, and subsets that are based on *ad hoc* queries using an API) each have their advantages so the best situation would be for them to coexist. Greetings, Frans 2016-01-04 20:14 GMT+01:00 Annette Greiner <amgreiner@lbl.gov>: > Funny you should mention worldwide temperature data since 1901. That > exists at NERSC, and it's even more complex. > http://portal.nersc.gov/project/20C_Reanalysis/ > contains data with many more potentially organizing variables than lat/lon > and time. Making every potential organization of the data available > separately would be a waste of time. Instead, we allow the user to make a > query via a web form that retrieves (via Opendap) the data they want. They > can choose a specific measure of interest and get data over a range > combining lats, lons, ensemble members, and times. > This is what I mean by enabling retrieval of subsets. > > The organization by measurement type works for our users, but one might > want to grab different subsets if one's users were, say the residents of a > certain locale who wanted to look at temperature changes in their home > town. That would not be worth the effort for us, but someone else could > come along and download the relevant larger subsets and reuse the data in > that way. I guess what I'm saying is that one has to strike a balance > between making the data easy to access for the expected uses and enabling > others to take things further with a little more work. > -Annette > > > On 1/4/16 10:05 AM, Frans Knibbe wrote: > > > > 2016-01-04 17:57 GMT+01:00 Jon Blower <j.d.blower@reading.ac.uk>: > >> Hi all, >> >> [snip] >> >> For what it’s worth, I agree with Frans’ view of starting with a very >> simple approach to subsetting. The first problem to solve is simply how to >> record that Subset A is a part of Dataset X (and vice versa), assuming that >> both can be identified (with URIs). In MELODIES we’ve been using >> dct:isPartOf and dct:hasPart for this, but we’re just experimenting. Both >> the subset and the parent dataset can be typed as Datasets. >> > > Dct:Dataset, I presume? It is nice to see that a single vocabulary > already provides the main ingredients. Although perhaps some reasoning is > required to handle datasets with subsets (if a dataset has dct:isPartOf > properties it is a subset, if it has dct:hasPart properties it is a > superset, if it only has dct:hasPart properties it is a root dataset), > something that perhaps not all data consumers can be expected to perform? > > I do wonder if there already is a good way of describing the > subsetting method, or the fact that multiple subsetting methods are used. > Perhaps that is not vital information, but it could be useful. Suppose > there is a dataset containing worldwide temperature values since 1901. > Both temporal and spatial subdivisions would make sense, so a data provider > could decide to make subsets available by year and by country. Each > partitioning tree would encompass all data, so if it is not made clear that > two partitioning schemes are used for the same dataset a not-so-smart > consumer might wind up downloading the data twice. > > Greetings, > Frans > > > > > >> >> > >> Cheers, >> Jon >> >> >> >> On 4 Jan 2016, at 16:36, Peter Baumann < <p.baumann@jacobs-university.de> >> p.baumann@jacobs-university.de> wrote: >> >> Frans- >> >> On 2016-01-04 13:48, Frans Knibbe wrote: >> >> >> >> >> 2016-01-04 13:20 GMT+01:00 Peter Baumann <p.baumann@jacobs-university.de> >> : >> >>> Hi Frans, >>> >>> data partitioning is an implementation detail which serves to quicker >>> determine the subset. >>> >> >> But that does not need to be the only reason for dataset partitioning. >> I can think of some others: >> >> 1. It could allow complete automatic retrieval of a complete dataset >> without the need for a data dump (in a specific format) and without the >> need for a specialized API. That could be very important for building >> remote indexes (e.g. by a search engine); >> >> >> IMHO a search engine will not want to download full datasets - and if so, >> it will not be interested in how such a large dataset is split internally. >> Compare to an HTML file, who would want to know about the file system >> blocks it is split into? >> >> >> 1. It allows automatic representation of the data in a human friendly >> way (small enough HTML pages with annotation and navigation); >> >> >> while there might be special cases where you have a textual >> representation this is not the case with most data, such as sets of vectors >> or pixels. An image is best represented for a human as a visual pixel >> matrix, and delivering this again is independent from the storage >> organisation on a server. >> >> >> 1. It allows caching of meaningful and self-describing chunks of >> data. >> >> >> not sure what self-describing means in this context, but caching is >> independent from that. Again, looking at your HTML file do we want to know >> which of its file system blocks is in cache? >> >> >> >> >>> As such, any subsetting or querying interface should remain agnostic of >>> it, otherwise it runs the risk of supporting a particular implementation >>> (and there are quite a few specialized implementations out there). >>> >> >> I think the concept of dataset partitioning could be handled in a very >> general way. Perhaps the only required elements are: >> >> 1. express that something (a web resource) is a dataset (and could >> therefore be a subset or a superset); >> 2. express that a dataset is a superset or subset of another dataset. >> >> >> This could work with any type of data and with any query interface, I >> think. >> >> >> well, we don't know - I have raised the question a couple of times, but >> it does not find much sympathy: what type of data are we talking about, and >> what does subsetting mean for each of them? >> >> -Peter >> >> >> Re >> >> > > -- > Annette Greiner > NERSC Data and Analytics Services > Lawrence Berkeley National Laboratory > > >
Received on Tuesday, 5 January 2016 08:30:47 UTC