Re: Subsetting data

Dear all,
my name is Stefan Pröll and I am working with Andreas Rauber (CC'ed) in
the RDA Working Group for Data Citation. Peter, thanks for including me
into the discussion, this is very interesting.

As Peter wrote, the RDA Working Group on Data Citation (WG-DC) deals
with the questions how subsets of changing (i.e. growing, shrinking,
updating ) data sets can be identified in a unique and reproducible way.
Although the name of the working group is data citation for historical
reasons, we are focussing rather on data identification (which of course
is an important building block for all kinds of citations). However, we
are utilising a query based approach, which allows us to identify and
retrieve all previous versions of subsets of data.

In this little paragraph, I used already quite a few terms which are
prone to missunderstandings as they are used and interpreted differently
in many communities. The approach which we developed during the RDA WG
phase is based on versioned data, queries to retrieve subsets from this
data and the idea of associating persistent identifiers to the query. We
store additional metadata (like query execution time etc) in a piece of
infrastructure we denote the "query store".

Thus a data set can be any data which is stored in some sort of data
store (database systems, file systems, NoSQL systems etc), which
provides a query mechanism (think SQL, XPath, file system search etc).
Users utilise this query mechanism to retrieve a specific subset. Thus a
subset is (by our definition) a selection of records from the data set,
which fulfills specific selection criteria. Users could use for instance
SQL statements for creating a subset,  but in general this process is
transparently encapsulated via a Web interface, as you described below.

This Web interface translates the human request for a subset in some
sort of query language. We can record this process including its
parameters and re-execute it on demand on versioned data, thus allowing
us to retrieve the very same data of any specific point in time. This is
what we call a version. The trick that we apply is that we know the
execution time of the query and therefore we can map the query execution
to a specific state of the data in the data store. As a result, we can
retrieve the data exactly as it was on demand, without having to export
and archive different versions of subsets externally. Of course this is
not something that we invented, most RDBMS already have such mechanisms
in place. What is new is that we link this information to a persistent
identifier.

Regarding your question about the uniqueness of queries, I could offer
the following: Most data base management systems (DBMS) provide flexible
query languages, which are declarative and allow to retrieve the very
same subsets by a range of different queries. What should never happen
is that you get different results for the very same query (caeteris
paribus). Computationally, it is very hard to detect in advance if two
synonymous queries will deliver identical results. We mitigate this
problem by utilising query interfaces (e.g. a Web application allowing
only specific filters to be used) and therefore the query structure
remains the same for all queries. What we also apply is query
normalization. Usually the query parser is able to re-arrange queries in
a way the optimizer can improve the execution cost. We can exploit this
in order to normalise queries further.
Hence, we can detect duplicate queries and there is only a limited
potential for multiple queries delivering the same result. How this
normalisation is done, depends of course on the designated community and
the use case. So there needs to be some agreement.

However, if this simple restriction cannot be applied in your use case,
what can always be applied is an ex-post analysis which allows to detect
duplicate (identical) subsets delivered by dissimilar queries. In this
case, we compute a hash value for each query result (i.e. for each
subset) and therefore can detect duplicate subsets. We can than map
these together and assign the most efficient query (for instance in
terms of execution cost) as the proper query string for this result and
dismiss the duplicate. This gives us then also exactly one persistent
identifier per subset.

I just noticed that my email is quite lengthy already  There are a
lot of effects which could be considered, which make this topic quite a
challenge. I would gladly point you to our Working group Web site:
https://www.rd-alliance.org/group/data-citation-wg.html

As an outcome of this WGDC, we produced 14 recommendations, which are
briefly described in the flyer here:
https://rd-alliance.org/system/files/documents/RDA-DC-Recommendations_150924.pdf

We also wrote a journal article providing more details, which is
currently under review. Just drop me an email and I can send you the draft.


Please let me know if this email helps or if you have additional questions.

Kind regards,
Stefan





On 2016-01-01 14:48, Peter Baumann wrote:
>
>
> On 2016-01-01 10:26, Phil Archer wrote:
>>
>>
>> On 31/12/2015 09:33, Peter Baumann wrote:
>>> Hi all,
>>>
>>> there is work already in this realm which might be useful.
>>>
>>> - Stephan Proell has been working on subset identifiers in the
context of RDA.
>>
>> That's interesting. Can you put us in touch, please? I'll be engaging
more
>> fully with the RDA as of now and hope to got to the Japan plenary.
>
> no problem, I put Stephan on cc herewith.
>
> @Stephan: please meet the folks on cc, they are with the W3C Spatial
Data on the
> Web WG; currently we are having a discussion about identifying subsets
of data
> on the Web, ie: addressing "inside" an object. I feel that this is
overlapping
> with your research.
>
> cheers,
> Peter
>
>
>>
>>>
>>> - In the context of data/metadata linking there is work on
connecting arrays
>>> into tables (ie, relational -> ISO SQL/MDA [1]), into hierarchies
(ie, XML ->
>>> OGC WCPS [2]) and into RDF (have to find the paper). This allows to
determine
>>> subsets via the resp. query mechanism, which I consider the most
general way. As
>>> is the case with URLs already, different queries can point to the
same result =
>>> "subset". Path expressions, as Phil used in his example, is one way of
>>> expressing subsets from composite entities.
>>
>> This feels like live queries. Nothing wrong with that of course, but I'm
>> trying to focus on persistent IDs for 'typical subsets' like 'latest
satellite
>> image of location X.'
>>
>>>
>>> Generally, "subsetting" can mean many, many things. In the most
basic case it
>>> denotes identifying a part of a coverage that is a coverage again:
spatial and
>>> temporal subsetting in WCS Core, and also range subsetting, ie:
extraction of
>>> bands/channels/variables from a coverage, resulting in a coverage
again. With
>>> more general options, this can be trascended - such as retrieving
_sets_ of
>>> pixels from an image _matrix_. You can replace "coverage" with
anything where
>>> you wish to maintain some particular properties (array, set uniqueness,
>>> hierarchy, closure under a given ontology, ...).
>>
>> So you'd bake some dimensions into the URI and they could persist
even when
>> your great grand child writes WCS Core 27.0 in 2216
>>
>>>
>>> re change of a subset target over time: that is of course always the
case, any
>>> resource to which a URL points can change so this does not add
substantial new
>>> problems. A subset may even yield an empty result at some time (such
as maybe
>>> /UK/Edinburgh or EU/UK at some time  ).
>>
>> Indeed, yes.
>>
>> Cheers
>>
>> Phil
>>
>>>
>>> re clashes etc: what you are talking about below is not subsetting,
but fusion
>>> (a "join" or "union"). This is a different mechanism with different
rules (cf
>>> ontology matching when merging two ontologies).
>>>
>>> Happy 2016,
>>> Peter
>>>
>>> [1] D. Misev, P. Baumann: /Extending the SQL Array Concept to
Support Scientific
>>> Analytics/. Proc. Intl. Conf. on Scientific and Statistical Database
Management
>>> (SSDBM'2014), June 30 - July 2, 2014, Aalborg, Denmark, paper #10
>>> [2] P. Baumann: The OGC Web Coverage Processing Service (WCPS) Standard.
>>> Geoinformatica, 14(4)2010, pp 447-479
>>>
>>>
>>> On 2015-12-31 09:07, Clemens Portele wrote:
>>>> Rob,
>>>>
>>>> what you describe seems to apply to the dataset (resource) the same
way it
>>>> would apply to any subset resource. I.e. are you discussing a more
general
>>>> question, not the subsetting question?
>>>>
>>>> Phil,
>>>>
>>>> a (probably often unproblematic) restriction to the
temperature/uk/london or
>>>> stations/manchester approach is that there is only one path, so you
end up
>>>> with limitations on the subsets. If you want to support multiple
subsets, e.g.
>>>> also stations where high speed trains stop, stations that have a
ticket shop,
>>>> etc. then there are several issues with a
>>>> /{dataset}/{subset}/…/{subset}/{object} approach. These include an
unclear URI
>>>> scheme ("manchester" and "eurostar" would be on the same path level),
>>>> potential name collisions of subset names of different subsetting
categories,
>>>> and multiple URIs for the same feature/object.
>>>>
>>>> Best regards,
>>>> Clemens
>>>>
>>>>
>>>>> On 31 Dec 2015, at 03:07, Rob Atkinson <rob@metalinkage.com.au
>>>>> <mailto:rob@metalinkage.com.au>> wrote:
>>>>>
>>>>> I'm not a strong set-theoretician - but it strikes me there are
some tensions
>>>>> here:
>>>>>
>>>>> Does the identifier of a set mean that the members of that set are
constant,
>>>>> known in advance and always retrievable?   Is a query endpoint a
resource
>>>>> (does either URI or URL have meaning against a query that delivers
real time
>>>>> data - including the use case of "at this point in time we think
these things
>>>>> are members of this set?" )
>>>>>
>>>>> If the subset is the result of a query - and you care that it is
the same
>>>>> subset another time you look at it - are you actually assigning an
identifier
>>>>> to the artefact - which is the query response, whose properties
include the
>>>>> original query, where it was made, and the time it was made?
>>>>>
>>>>> Can you define an ontology for terms like subset, query, response
that you
>>>>> all agree on?
>>>>>
>>>>> I share Phil's implicit concern that subsetting by type with URI
patterns may
>>>>> not be universally applicable - IMHO that equates to a "sub-register"
>>>>> pattern, where a set has its members defined by some identifiable
process
>>>>> (indepent of any query functions available) - which may include
explicit
>>>>> subsets - for example by object type, or delegated registration
processes.
>>>>> That probably fits the UK implementation better than a
query-defined subset.
>>>>>
>>>>> If subsets have some prior meaning - and a query is used to access
then from
>>>>> a service endpint - then the query is a URL that needs to be bound
to the
>>>>> object URI. AFAICT thats a very different thing to saying an
arbitrary query
>>>>> result defines a subset of data.
>>>>>
>>>>> I think you may, in general, assign an ID to the artefact which is
the result
>>>>> of a query at a given time, and if you want to make that into
something with
>>>>> more semantics then you need make it into a new type of object
which can be
>>>>> described in terms of what it means. I think currently the
conversation is
>>>>> conflating these two perspectives of "subset".
>>>>>
>>>>> Cheers, and farewell to 2015.
>>>>> Rob Atkinson.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 31 Dec 2015 at 08:26 <Simon.Cox@csiro.au
<mailto:Simon.Cox@csiro.au>>
>>>>> wrote:
>>>>>
>>>>>      Another way of looking at it is that a query, encoded as a
URI pattern,
>>>>>      defines an implicit set of potential URIs, each of which
denotes a
>>>>> subset.
>>>>>
>>>>>      Simon J D Cox
>>>>>      Environmental Informatics
>>>>>      CSIRO Land and Water
>>>>>
>>>>>      E simon.cox@csiro.au <mailto:simon.cox@csiro.au> T +61 3 9545
2365 M +61
>>>>>      403 302 672
>>>>>      Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
>>>>>      Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>>>>      Postal: Private Bag 10, Clayton South, Vic 3169
>>>>>      http://people.csiro.au/Simon-Cox
>>>>>      http://orcid.org/0000-0002-3884-3420
>>>>>      http://researchgate.net/profile/Simon_Cox3*
>>>>>
>>>>>      *
>>>>>
>>>>>
--------------------------------------------------------------------------------
>>>>>      *From:* Phil Archer
>>>>>      *Sent:* Wednesday, 30 December 2015 6:31:16 PM
>>>>>      *To:* Manolis Koubarakis; 'public-sdw-comments@w3.org
>>>>>      <mailto:public-sdw-comments@w3.org>'; Annette Greiner; Eric
Stephan;
>>>>>      Tandy, Jeremy; public-dwbp-comments@w3.org
>>>>>      <mailto:public-dwbp-comments@w3.org>
>>>>>      *Subject:* Subsetting data
>>>>>
>>>>>      At various times in recent months I have promised to look
into the topic
>>>>>      of persistent identifiers for subsets of data. This came up
at the SDW
>>>>>      F2F in Sapporo but has also been raised by Annette in DWBP.
In between
>>>>>      festive activities I've been giving this some thought and
have tried to
>>>>>      begin to commit some ideas to a page [1].
>>>>>
>>>>>      During the CEO-LD meeting, Jeremy pointed to OpenSearch as a
possible
>>>>>      way forward, including its geo-temporal extensions defined by
the OGC.
>>>>>      There is also the Linked Data API as a means of doing this,
and what
>>>>>      they both have in common is that they offer an intermediate
layer that
>>>>>      turns a URL into a query.
>>>>>
>>>>>      How do you define a persistent identifier for a subset of a
dataset? IMO
>>>>>      you mint a URI and say "this identifies a subset of a
dataset" - and
>>>>>      then provide a means of programmatically going from the URI
to a query
>>>>>      that returns the subset. As long as you can replace the
intermediate
>>>>>      layer with another one that also returns the same subset,
we're done.
>>>>>
>>>>>      The UK Government Linked Data examples tend to be along the
lines of:
>>>>>
>>>>>      http://transport.data.gov.uk/id/stations
>>>>>      returns a list of all stations in Britain.
>>>>>
>>>>>      http://transport.data.gov.uk/id/stations/Manchester
>>>>>      returns a list of stations in Manchester
>>>>>
>>>>>      http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>>>>      identifies Manchester Piccadilly station.
>>>>>
>>>>>      All of that data of course comes from a single dataset.
>>>>>
>>>>>      Does this work in the real worlds of meteorology and UBL/PNNL?
>>>>>
>>>>>      Phil.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>      [1] https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>      --
>>>>>
>>>>>
>>>>>      Phil Archer
>>>>>      W3C Data Activity Lead
>>>>>      http://www.w3.org/2013/data/
>>>>>
>>>>>      http://philarcher.org <http://philarcher.org/>
>>>>>      +44 (0)7887 767755
>>>>>      @philarcher1
>>>>>
>>>>
>>>
>>
>

Received on Monday, 11 January 2016 21:27:24 UTC