Re: Subsetting data from Rob Atkinson on 2016-01-06 (public-dwbp-comments@w3.org from January 2016)

From: Rob Atkinson <rob@metalinkage.com.au>
Date: Wed, 06 Jan 2016 07:07:02 +0000
To: janowicz@ucsb.edu, Annette Greiner <amgreiner@lbl.gov>, Rob Atkinson <rob@metalinkage.com.au>, Simon.Cox@csiro.au, p.baumann@jacobs-university.de, danbri@google.com, portele@interactive-instruments.de
Cc: phila@w3.org, ericphb@gmail.com, jeremy.tandy@metoffice.gov.uk, koubarak@di.uoa.gr, public-dwbp-comments@w3.org, public-sdw-comments@w3.org
Message-ID: <CACfF9LwoGceoyt+nRvBbA6H0DHAxUrT41fRD9pei+ZyuE-Oa8Q@mail.gmail.com>
the last para in my email was about "just enough OWL in digestible peices"
- i.e. it cuts across Annettes and your concerns -  use OWL rather than
reinvent the wheel - but use it in a way that allows us to make the key
statements about how subsets are related - and your own choice of subset
vocabulary - interpretable.

So is there a best practice that models useful forms of subsets?  What I
have proposed is simply to allow more explicit forms of subset to be
modelled but bound to the common isPartOf so at least there is a simple,
common starting point for finding out what the context of a subset is.

If we want subsets to be self-describing - that means persisting "the
context within which the data was requested" in a canonical way - and thats
a huge job to make interoperable as you point out.

Is there a absolutely minimal OWL pattern we can use for this minimal scope
that could be used as a simple RDF statement by something that knows just
that part of the problem, but doesnt stop us integrating these statements
into a more comprehensive OWL model if we need to.

Rob


On Wed, 6 Jan 2016 at 15:39 Krzysztof Janowicz <janowicz@ucsb.edu> wrote:

> The data served up can be plenty useful without involving OWL or any
> vocabulary. Users have the context within which the data was requested to
> understand the relationship.
>
>
> I am not sure about this. The last years show that using data outside of
> their original creation context is often not possible and even if the
> context is known and machine understandable reuse remain a major
> challenges. OWL and other KR languages try to address exactly this
> challenge. The idea that Linked Data, for instance, can be used without
> proper vocabularies is disappearing quickly. In fact, this was intensively
> discussed at ISWC last year.
>
> @Rob: I largely agree except for the last paragraph (or maybe I simply do
> not understand it). IMHO, the role of OWL is to restrict the interpretation
> of domain terms towards their intended meaning, e.g., to improve semantic
> interoperability.  In the case at hand, OWL could be used to specify how
> data is split into subsets, what those subsets mean, in which relation they
> stand to the entire dataset, and so forth.
>
> It is still important to keep in mind that we are talking about models now
> and not URIs for subsets. These topics are related but also pretty
> different. There is a lot of work out there that formally models concepts
> such as dataobject, dataset, data, collection, and so forth.
>
> Best,
> Krzysztof
>
>
>
> On 01/05/2016 06:53 PM, Annette Greiner wrote:
>
> The data served up can be plenty useful without involving OWL or any
> vocabulary. Users have the context within which the data was requested to
> understand the relationship.
>
> On 1/5/16 6:36 PM, Rob Atkinson wrote:
>
> Thats fine - but the issue for these being useful is the ability to link
> them with some useful semantics - isPartOf is not generally going to be
> useful for any processing.
>
> You may want to state that whatever relationships that you do use should
> have a owl:subPropertyOf dct:isPartOf  declaration
> (i.e. this OWL model needs to be available - but not necessarily in the
> relationship-definfing ontology artefact - i.e. we should be able to
> retrofit it to existing vocabularies)
> Then, within a domain that cares, these minimal OWL declarations can be
> used to support useful integration.
>
> The way I have used VoiD is to subClass VoiD:Dataset to define datasets
> with specific properties - much like a VoiD:Linkset is a specialised
> Dataset.  We dont have to specify VoiD - but I think its reasonable to ask
> that if someone defines a subsetting semantics for a particular purpose
> those semantics are published in OWL, and dereferencable via the link
> property URI used.
>
>  Is OWL and acceptable "Best practice" ?  I think there is an issue that
> OWL is not heavily used to provide useful snippets to help link resources -
> people want to remodel the world in OWL when these use it, not use it for
> sharing models. If we cant use OWL as a canonical way of sharing the most
> basic semantics here we really dont have much to work with, but IMHO we
> could point to it as a best practice and at the same time recommend small
> pieces of OWL are deployed to support specific interoperability
> requirements - rather than attempt to model every possible aspect of
> everything in a monolithic document.
>
> Rob
>
> On Wed, 6 Jan 2016 at 11:24 <Simon.Cox@csiro.au> <Simon.Cox@csiro.au>
> <Simon.Cox@csiro.au> wrote:
>
>> +1
>>
>>
>> Simon J D Cox
>>
>> Research Scientist
>>
>> Environmental Information Infrastructures
>>
>> Land and Water
>>
>> CSIRO
>>
>>
>>
>> E simon.cox@csiro.au T +61 3 9545 2365 M +61 403 302 672
>>
>>    Physical: Reception Central, Bayview Avenue, Clayton, Vic 3168
>>
>>
>>    Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>
>>    Postal: Private Bag 10, Clayton South, Vic 3169
>>
>> people.csiro.au/Simon-Cox
>>
>> orcid.org/0000-0002-3884-3420
>>
>> researchgate.net/profile/Simon_Cox3
>>
>>
>>
>>
>> ------------------------------
>> *From:* Annette Greiner
>> *Sent:* Tuesday, 5 January 2016 10:40:13 PM
>> *To:* Peter Baumann; Rob Atkinson; janowicz@ucsb.edu; Dan Brickley;
>> Clemens Portele
>> *Cc:* Phil Archer; Cox, Simon (L&W, Clayton); ericphb@gmail.com;
>> jeremy.tandy@metoffice.gov.uk; koubarak@di.uoa.gr;
>> public-dwbp-comments@w3.org; public-sdw-comments@w3.org
>> *Subject:* Re: Subsetting data
>>
>> I would stop at saying URIs for subsets are a Good Thing, and maybe
>> mention in the implementation section that they will naturally be assigned
>> URIs if you use a REST-based architecture. What those URIs look like will
>> depend on the implementation. (There's a difference between nobody knowing
>> how to use them and the fact that different contexts call for different
>> implementations, which I think is the case here.)
>> -Annette
>>
>> On 1/5/16 1:43 PM, Peter Baumann wrote:
>>
>> +1
>> -Peter
>>
>> On 2016-01-05 22:18, Rob Atkinson wrote:
>>
>>
>> given stable URIs for subsets (which I don think there is any
>> disagreement about) AFACIT there are two unresolved issues - both concerned
>> with the scope of the BP:
>> 1) What is the BP for describing how subsets relate to each other and the
>> master data set (avoiding implementation details)
>> 2) what is the relationship between identifiable subsets, query endpoints
>> and the subsets returned - do they all have identifiers, and what is the BP
>> for a common vocabulary to relate these different aspects
>>
>> Maybe a valid result is to say that there really isnt a BP in term of
>> these requirements - and stop at saying URI idenfiers for subsets is a Good
>> Thing  Nobody Knows How To Use and throw out a challenge
>>
>> Rob
>>
>>
>> On Wed, 6 Jan 2016 at 05:13 Krzysztof Janowicz <janowicz@ucsb.edu> wrote:
>>
>>> Hi Dan,
>>>
>>>
>>> Isn't a "subset" just a query result, or which there are effectively an
>>> unlimited number?
>>>
>>>
>>> I would say so.
>>>
>>>
>>> Storing a query so it can be re-run against evolving data has value.
>>> Having a URI for that, perhaps less so.
>>>
>>>
>>> There are many cases, particularly if dealing with streams of data,
>>> e.g., form sensors, where having URIs for subsets is very useful, and, in
>>> fact, there are many ways to do so. Some years ago we implemented one such
>>> solution at 52North where we developed a transparent (i.e., invisible)
>>> proxy on top of an endpoint that translates a URI minted in a specific way
>>> to return specific query results.  For instance, the URI
>>> <http://my.authority.org/observations/samplingtimes/ont:time:relation:between,2008-01-10T14:00,2008-01-12T16:00/sensors/thermometer1/observedproperties/temperature>
>>> http://my.authority.org/observations/samplingtimes/ont:time:relation:between,2008-01-10T14:00,2008-01-12T16:00/sensors/thermometer1/observedproperties/temperature
>>> points to the observation collection with all temperature observations from
>>> January 10th 2008 at 2pm until January 12th at 4pm made by thermometer1.
>>> Having such URIs (and proxies) is one way of mitigating the problem that
>>> the content  referenced by URIs should be as stable as possible which is
>>> (often) not the case for sensor data. In our case we worked on
>>> transparently mapping between SPARQL and OGC's Sensor Observation Services
>>> but the idea can (and has been) used in many other settings.
>>>
>>> Happy new year.
>>> Krzysztof
>>>
>>>
>>> On 12/31/2015 03:09 AM, Dan Brickley wrote:
>>>
>>>
>>> Isn't a "subset" just a query result, or which there are effectively an
>>> unlimited number?
>>>
>>> Storing a query so it can be re-run against evolving data has value.
>>> Having a URI for that, perhaps less so.
>>>
>>> Dan
>>>
>>> On Thu, 31 Dec 2015, 08:14 Clemens Portele <
>>> portele@interactive-instruments.de> wrote:
>>>
>>>> Rob,
>>>>
>>>> what you describe seems to apply to the dataset (resource) the same way
>>>> it would apply to any subset resource. I.e. are you discussing a more
>>>> general question, not the subsetting question?
>>>>
>>>> Phil,
>>>>
>>>> a (probably often unproblematic) restriction to the
>>>> temperature/uk/london or stations/manchester approach is that there is only
>>>> one path, so you end up with limitations on the subsets. If you want to
>>>> support multiple subsets, e.g. also stations where high speed trains stop,
>>>> stations that have a ticket shop, etc. then there are several issues with a
>>>> /{dataset}/{subset}/…/{subset}/{object} approach. These include an unclear
>>>> URI scheme ("manchester" and "eurostar" would be on the same path level),
>>>> potential name collisions of subset names of different subsetting
>>>> categories, and multiple URIs for the same feature/object.
>>>>
>>>> Best regards,
>>>> Clemens
>>>>
>>>>
>>>> On 31 Dec 2015, at 03:07, Rob Atkinson <rob@metalinkage.com.au> wrote:
>>>>
>>>> I'm not a strong set-theoretician - but it strikes me there are some
>>>> tensions here:
>>>>
>>>> Does the identifier of a set mean that the members of that set are
>>>> constant, known in advance and always retrievable?   Is a query endpoint a
>>>> resource (does either URI or URL have meaning against a query that delivers
>>>> real time data - including the use case of "at this point in time we think
>>>> these things are members of this set?" )
>>>>
>>>> If the subset is the result of a query - and you care that it is the
>>>> same subset another time you look at it - are you actually assigning an
>>>> identifier to the artefact - which is the query response, whose properties
>>>> include the original query, where it was made, and the time it was made?
>>>>
>>>> Can you define an ontology for terms like subset, query, response that
>>>> you all agree on?
>>>>
>>>> I share Phil's implicit concern that subsetting by type with URI
>>>> patterns may not be universally applicable - IMHO that equates to a
>>>> "sub-register" pattern, where a set has its members defined by some
>>>> identifiable process (indepent of any query functions available) - which
>>>> may include explicit subsets - for example by object type, or delegated
>>>> registration processes. That probably fits the UK implementation better
>>>> than a query-defined subset.
>>>>
>>>> If subsets have some prior meaning - and a query is used to access then
>>>> from a service endpint - then the query is a URL that needs to be bound to
>>>> the object URI. AFAICT thats a very different thing to saying an arbitrary
>>>> query result defines a subset of data.
>>>>
>>>> I think you may, in general, assign an ID to the artefact which is the
>>>> result of a query at a given time, and if you want to make that into
>>>> something with more semantics then you need make it into a new type of
>>>> object which can be described in terms of what it means. I think currently
>>>> the conversation is conflating these two perspectives of "subset".
>>>>
>>>> Cheers, and farewell to 2015.
>>>> Rob Atkinson.
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, 31 Dec 2015 at 08:26 <Simon.Cox@csiro.au> wrote:
>>>>
>>>>> Another way of looking at it is that a query, encoded as a URI
>>>>> pattern, defines an implicit set of potential URIs, each of which denotes a
>>>>> subset.
>>>>>
>>>>> Simon J D Cox
>>>>> Environmental Informatics
>>>>> CSIRO Land and Water
>>>>>
>>>>> E simon.cox@csiro.au T +61 3 9545 2365 M +61 403 302 672
>>>>> Physical: Central Reception, Bayview Avenue, Clayton, Vic 3168
>>>>> Deliveries: Gate 3, Normanby Road, Clayton, Vic 3168
>>>>> Postal: Private Bag 10, Clayton South, Vic 3169
>>>>> http://people.csiro.au/Simon-Cox
>>>>> http://orcid.org/0000-0002-3884-3420
>>>>> http://researchgate.net/profile/Simon_Cox3
>>>>>
>>>>> ------------------------------
>>>>> *From:* Phil Archer
>>>>> *Sent:* Wednesday, 30 December 2015 6:31:16 PM
>>>>> *To:* Manolis Koubarakis; 'public-sdw-comments@w3.org'; Annette
>>>>> Greiner; Eric Stephan; Tandy, Jeremy; public-dwbp-comments@w3.org
>>>>> *Subject:* Subsetting data
>>>>>
>>>>> At various times in recent months I have promised to look into the
>>>>> topic
>>>>> of persistent identifiers for subsets of data. This came up at the SDW
>>>>> F2F in Sapporo but has also been raised by Annette in DWBP. In between
>>>>> festive activities I've been giving this some thought and have tried
>>>>> to
>>>>> begin to commit some ideas to a page [1].
>>>>>
>>>>> During the CEO-LD meeting, Jeremy pointed to OpenSearch as a possible
>>>>> way forward, including its geo-temporal extensions defined by the OGC.
>>>>> There is also the Linked Data API as a means of doing this, and what
>>>>> they both have in common is that they offer an intermediate layer that
>>>>> turns a URL into a query.
>>>>>
>>>>> How do you define a persistent identifier for a subset of a dataset?
>>>>> IMO
>>>>> you mint a URI and say "this identifies a subset of a dataset" - and
>>>>> then provide a means of programmatically going from the URI to a query
>>>>> that returns the subset. As long as you can replace the intermediate
>>>>> layer with another one that also returns the same subset, we're done.
>>>>>
>>>>> The UK Government Linked Data examples tend to be along the lines of:
>>>>>
>>>>> http://transport.data.gov.uk/id/stations
>>>>> returns a list of all stations in Britain.
>>>>>
>>>>> http://transport.data.gov.uk/id/stations/Manchester
>>>>> returns a list of stations in Manchester
>>>>>
>>>>> http://transport.data.gov.uk/id/stations/Manchester/Piccadilly
>>>>> identifies Manchester Piccadilly station.
>>>>>
>>>>> All of that data of course comes from a single dataset.
>>>>>
>>>>> Does this work in the real worlds of meteorology and UBL/PNNL?
>>>>>
>>>>> Phil.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [1] <https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md>
>>>>> https://github.com/w3c/sdw/blob/gh-pages/subsetting/index.md
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>> Phil Archer
>>>>> W3C Data Activity Lead
>>>>> http://www.w3.org/2013/data/
>>>>>
>>>>> http://philarcher.org
>>>>> +44 (0)7887 767755
>>>>> @philarcher1
>>>>>
>>>>>
>>>>
>>>
>>> --
>>> Krzysztof Janowicz
>>>
>>> Geography Department, University of California, Santa Barbara
>>> 4830 Ellison Hall, Santa Barbara, CA 93106-4060
>>>
>>> Email: jano@geog.ucsb.edu
>>> Webpage: http://geog.ucsb.edu/~jano/
>>> Semantic Web Journal: http://www.semantic-web-journal.net
>>>
>>>
>> --
>> Dr. Peter Baumann
>>  - Professor of Computer Science, Jacobs University Bremen
>>    www.faculty.jacobs-university.de/pbaumann
>>    mail: p.baumann@jacobs-university.de
>>    tel: +49-421-200-3178, fax: +49-421-200-493178
>>  - Executive Director, rasdaman GmbH Bremen (HRB 26793)
>>    www.rasdaman.com, mail: baumann@rasdaman.com
>>    tel: 0800-rasdaman, fax: 0800-rasdafax, mobile: +49-173-5837882
>> "Si forte in alienas manus oberraverit hec peregrina epistola incertis ventis dimissa, sed Deo commendata, precamur ut ei reddatur cui soli destinata, nec preripiat quisquam non sibi parata." (mail disclaimer, AD 1083)
>>
>>
>>
>>
>> --
>> Annette Greiner
>> NERSC Data and Analytics Services
>> Lawrence Berkeley National Laboratory
>>
>>
>>
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
>
>
>
>
> --
> Krzysztof Janowicz
>
> Geography Department, University of California, Santa Barbara
> 4830 Ellison Hall, Santa Barbara, CA 93106-4060
>
> Email: jano@geog.ucsb.edu
> Webpage: http://geog.ucsb.edu/~jano/
> Semantic Web Journal: http://www.semantic-web-journal.net
>
>
Received on Wednesday, 6 January 2016 07:07:47 UTC