Re: Schema.org extension for (geo)sciences

Simon,
  Thanks..   at this point I think Adam and I likely need to review, document some resources and run some tests.   At that point we could revisit this.

  I concur that many of my points near the end are not in scope for DCAT.  They were more to express some of the functional pressures we are trying to address.

  Really appreciate this thread Simon!

  Adam and I will be at RDA..  if anyone wants to chat more about this there we'd be interested.

Doug

________________________________
From: Simon.Cox@csiro.au <Simon.Cox@csiro.au>
Sent: Thursday, February 14, 2019 12:02 AM
To: Douglas Fils; Lewis.J.McGibbney@jpl.nasa.gov; kcoyle@kcoyle.net; public-dxwg-wg@w3.org
Cc: ashepherd@whoi.edu
Subject: RE: Schema.org extension for (geo)sciences


As we note in the DCAT ED - https://w3c.github.io/dxwg/dcat/#Class:Distribution




dcat:Distribution - A specific representation of a dataset. A dataset might be available in multiple serializations that may differ in various ways, including natural language, media-type or format, schematic organization, temporal and spatial resolution, level of detail or profiles (which might specify any or all of the above).



The corollary is that concerns of format, and even organization, are not essential in the description of a dataset. Most of the data services and APIs that we deploy provide access to data in multiple formats, and sometimes in multiple schematic representations as well, so the user can select the one that suits. That’s why the main association is from DataDistributionService to Dataset, and not just to Distribution (i.e. the dcat:servesDataset property).



For streaming data I would suggest that the dataset description will have a single-ended temporal-coverage.

Currently the DCAT Dataset description only has one property related to temporal spacing `dct:accrualPeriodicity` and there is some ambiguity as to whether this implies dataset-versioning or item-spacing (for time-series data). See https://github.com/w3c/dxwg/issues/728




The second part of your mail seems to be primarily about the ranking of results from a catalog query.

DCAT does not address this, though I see your concern – if all the files in a legacy-catalog are classified as Datasets, then they could swamp the more ‘valuable’ catalog entries. But this must be handled some other way – ranking and relevance.



Simon





From: Douglas Fils [mailto:dfils@oceanleadership.org]
Sent: Thursday, 14 February, 2019 16:27
To: Cox, Simon (L&W, Clayton) <Simon.Cox@csiro.au>; Lewis.J.McGibbney@jpl.nasa.gov; kcoyle@kcoyle.net; public-dxwg-wg@w3.org
Cc: ashepherd@whoi.edu
Subject: Re: Schema.org extension for (geo)sciences



Simon,

   So I can't answer right off if it would still be true in the case of 'dataset' as concept vs file.   I will definitely try and work with Adam to address that though and discuss this with him.



  Given a conceptual dataset that is distinct from its distribution (like in schema.org) then we would want to associate services with the distribution.   Then we can describe those services in such a way that would allow us to detail the query parameters (which we can do) including temporal ranges.  Or point to service description document.



  For example, a common function the groups seem to want is a search along the lines of:

instrument X in spatial Y during Z   where "during" is something like Nov 2009.  The instrument and spatial issues are addressable but we needed more to deal with the time search.



For speed and simplicity we would likely initially require something along the lines of a key value pair in the search request.   So like    date:iso8601   (where iso8601 is, obviously,  an ISO 8601 encoded date).    Or something like daterage:startiso;endiso  (we can review best practices for date ranges).



I think this is the initial approach Adam and I would like to investigate.  Describe streaming data as a resource with a distribution that is defined by a service description.



There are some other points that are perhaps not related to DCAT or schema.org though.  Describing a resource, whether 'classic' or 'conceptual' is one part.   Another is when we gather and index these and try to search across both types.



Some questions we have.



a) given hundreds of thousands of 'classic' datasets could a relatively small number of 'conceptual dataset with service distributions' get lost in the results.   Are there issues we need to address when building an index of conceptual and classic datasets

b) how do we weight temporal results (or any streamed dimension).   If a dataset matches a time range it might get a large score boost.   However, if it matches nothing else in the search it's likely not relevant that it has the correct time range.    That extreme case is easy but milder versions of this might make ranking to the two types in a relatable manner harder (not sure).   What is the role of a domain ontology in such ranking.

c) is the citable unit of a streaming service the conceptual dataset as a whole?   That doesn't help much so we might want to cite a conceptual dataset combined with the parameters on the distribution service.  What the is model to do that then (both RDA and ESIP have a lot to help/say there).

d) Can we digitally sign a subset of a conceptual dataset with something like a sha256 hash to essentially fingerprint the results for later validation.



Part of the reason for this is simple.  With things like Google Dataset Search out there, groups with these "conceptual" datasets described by distributions defined by service still want there "dataset" in those results.



We're interested in exploring that question given that groups like DataONE and others are also interested in that goal as well it's not just a Google game.



Take care

Doug









________________________________

From: Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au> <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>
Sent: Wednesday, February 13, 2019 7:35 PM
To: Douglas Fils; Lewis.J.McGibbney@jpl.nasa.gov<mailto:Lewis.J.McGibbney@jpl.nasa.gov>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: RE: Schema.org extension for (geo)sciences



>  we had many groups in the NSF geo space that didn't have resources that mapped well to the concept of a "dataset"



Does that remain true if ‘dataset’ is understood to be a ‘conceptual’ thing, and not a simple file?

This is very much the philosophy behind DCAT (and schema.org), which separates Dataset (the conceptual thing) and Distribution (~DataDownload) (its representation).

Then DCAT-2014 conflated Distribution and API/Application, so in DCAT-rev we attempt to unpick this.

The triangle of relationships between Dataset, Distribution, and DataDistributionService are key https://w3c.github.io/dxwg/dcat/#UML_DCAT_All_Attr




(It might be that an additional relationship ‘servedBy’ is required from Dataset to DataService – the inverse of `dcat:servesDataset` - to support key use-cases?

That is the kind of feedback which would be useful to us in finalizing the revised vocabulary. )



Simon



From: Douglas Fils [mailto:dfils@oceanleadership.org]
Sent: Thursday, 14 February, 2019 11:38
To: Cox, Simon (L&W, Clayton) <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>; Lewis.J.McGibbney@jpl.nasa.gov<mailto:Lewis.J.McGibbney@jpl.nasa.gov>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: Re: Schema.org extension for (geo)sciences



Simon,

  Thanks, I'll check those out.   Obviously my stuff is just 0th order ideas as we begin to think about these things.   Our main issue was that we had many groups in the NSF geo space that didn't have resources that mapped well to the concept of a "dataset" and so were sidelined a bit in Project 418.   The seismic, atmospheric, etc people with more streaming data accessed by APIs for example.   Our interest was to see if there was a way to still expose them in such a way as to allow at least some level of integration with the harvesting, graph generation and search results coming back from the more traditional dataset resources.   Obviously relevancy ranking is a huge issue there and we are trying to look at some ways to present results from across the two classes in a useful way to the end user.



  We'll keep digging..  and testing concepts...



Doug



________________________________

From: Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au> <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>
Sent: Wednesday, February 13, 2019 5:25 PM
To: Douglas Fils; Lewis.J.McGibbney@jpl.nasa.gov<mailto:Lewis.J.McGibbney@jpl.nasa.gov>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: RE: Schema.org extension for (geo)sciences



Hey Doug –



Note that dcat:Dataservice is intended to be a ‘minimum viable’ approach, delegating all the details to values of the `dct:conformsTo` and `dcat:endpointDescription` properties, which SHOULD use external standards. See https://w3c.github.io/dxwg/dcat/#ex-access-service and https://w3c.github.io/dxwg/dcat/#data-service-examples and the third example here https://rawgit.com/w3c/dxwg/dcat-issue317-simon/dcat/index.html#bag-of-files (on a branch that will likely be merged into gh-pages shortly). The last one is mirrored by an attempt I made to translate it into schema.org – see https://github.com/w3c/dxwg/blob/dcat-issue317-simon/dcat/examples/csiro-stratchart.schema.ttl (on the branch again). I struggled a bit to interpret how to use schema:EntryPoint and may have got it ‘wrong’ at line 42 etc.



Simon



From: Douglas Fils [mailto:dfils@oceanleadership.org]
Sent: Thursday, 14 February, 2019 06:24
To: Cox, Simon (L&W, Clayton) <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>; Lewis.J.McGibbney@jpl.nasa.gov<mailto:Lewis.J.McGibbney@jpl.nasa.gov>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: Re: Schema.org extension for (geo)sciences



Simon,

  Thanks for the ping..  As Adam, mentioned he and I got funding from NSF/EarthCube for follow on work and should be starting around the time RDA P13 takes place.   So we actually have some funding to do some of this work.   I'll get that geologic time stuff in for the scientific drilling!!!  I've owed you that for TOOOOOO long.

We want to do the vocabulary stuff inside ESIP to make sure we have a broader geoscience audience than just Adam and I.  😊    Hat tip to Adam for doing that and to Lewis for fostering the momentum at ESIP.



  One of the other items Adam and I talked about at ESIP Winter meeting was as the dcat:DataService.   I'm keen to review that in more detail with regard to it's mapping to schema.org.   I put a simple image at http://labs.geodex.org/pres/01.html#slide3 that we used to talk about this with some of the people at ESIP.   We are working with IRIS and UNAVCO,  both who deal more with streaming data and we are interested in how we can better represent that data.  The full set of groups involved with us is at http://geodex.org/about.html .



  Adam and I will be at RDA in April and we have been following bioschemas.org but never had the chance to really interact with them.   Not for lack of interest!   Sadly more lack of time.   However, it would be good for us to focus on communicating and exchanging experience with them.   Also, we have been involved with the Data Discovery Paradigms IG in RDA.  Mostly though talking and interacting with Mingfang.



  I quickly read through section E much of this resonates with me based on the experience we got harvesting the NSF providers and building a test graph and UI at https://geodex.org/  (lot's of UI bugs there.  a test for sure that needs some attention in this next phase).     The experience of dealing with the material provided by so many really highlighted some approaches and procedure changes we need to incorporate into this next round of work.



Thanks..  fun and exciting..

Perhaps there would be some opportunity at RDA to meet and discuss this more?



Take care

Doug







________________________________

From: Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au> <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>
Sent: Wednesday, February 13, 2019 5:02 AM
To: Lewis.J.McGibbney@jpl.nasa.gov<mailto:Lewis.J.McGibbney@jpl.nasa.gov>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: Douglas Fils; ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: RE: Schema.org extension for (geo)sciences



Hi Lewis -

First we are seeking feedback from the community as a whole on the proposed revisions to DCAT.
The editor's  draft is https://w3c.github.io/dxwg/dcat/ and the particular things that we are seeking comments on are
(a) the proposed changes as listed in the change-log Annex E ;
(b) any obvious errors or omissions.

Bear in mind that DCAT is a general-purpose vocabulary for data catalogs, not specific to research or science.
And it sits in the context of the W3C suite of RDF vocabularies and ontologies.
There is a more reference made to some complementary vocabularies such as PROV-O and DQV in this version of DCAT.
The other main area of innovation in the revision is the addition of DataServices.

And, as I mentioned below, the catalog and dataset elements in schema.org were drawn more or less directly from an earlier version of DCAT.

Next there is some interest in whether the extension points available in DCAT for attaching descriptors that are important for research data are sufficient. We are primarily expecting to recommend use of elements from PROV-O for provenance and versioning requirements, and do not expect to be prescriptive on the details otherwise.

Simon

-----Original Message-----
From: Mcgibbney, Lewis J (398M) [mailto:Lewis.J.McGibbney@jpl.nasa.gov]
Sent: Wednesday, 13 February, 2019 17:17
To: Cox, Simon (L&W, Clayton) <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>>; kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net>; public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
Cc: dfils@oceanleadership.org<mailto:dfils@oceanleadership.org>; ashepherd@whoi.edu<mailto:ashepherd@whoi.edu>
Subject: Re: Schema.org extension for (geo)sciences

Thank you for connecting the dots. Can you point us at the specific material you are looking for feedback on? Discovering the overlaps and reducing the duplication of effort is exactly where the science-on-schema.org (soon to be renamed geosci.schema.org) effort is at.
Thanks

Dr. Lewis John McGibbney Ph.D., B.Sc.
Data Scientist II

Computer Science for Data Intensive Applications Group (398M) Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibbney@jpl.nasa.gov<mailto:lewis.j.mcgibbney@jpl.nasa.gov>
ORCID: orcid.org/0000-0003-2185-928X







 Dare Mighty Things

On 2/12/19, 10:12 PM, "Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>" <Simon.Cox@csiro.au<mailto:Simon.Cox@csiro.au>> wrote:

    Karen - science-on-schema.org is based on schema.org.
    And in turn, the dataset/catalog parts of schema.org were based on DCAT 0.9.
    So this is no coincidence - see

    https://schema.org/Dataset

    https://schema.org/distribution etc.

    The close relationship with schema.org was already mentioned and a partial mapping provided in the DCAT-rev draft
    https://w3c.github.io/dxwg/dcat/#dcat-sdo


    Or perhaps I am missing your point?

    FWIW I tried mapping one of the examples that we've recently been working on into schema.org -
    Compare https://github.com/w3c/dxwg/blob/dcat-issue317-simon/dcat/examples/csiro-stratchart.schema.ttl

    with https://github.com/w3c/dxwg/blob/dcat-issue317-simon/dcat/examples/csiro-stratchart.ttl

    It is almost complete, though the schema.org `EntryPoint` model is a little different (more elaborate) than the proposed `dcat:DataService` modeling - see https://github.com/w3c/dxwg/blob/dcat-issue317-simon/dcat/examples/csiro-stratchart.schema.ttl#L42


    I think I already contacted the (geo)science guys for feedback on our work, but I've cced them again here in case I missed it.

    Simon

    -----Original Message-----
    From: Karen Coyle [mailto:kcoyle@kcoyle.net]
    Sent: Wednesday, 13 February, 2019 01:43
    To: public-dxwg-wg@w3.org<mailto:public-dxwg-wg@w3.org>
    Subject: Re: Schema.org extension for (geo)sciences

    They appear to have directly borrowed from DCAT, using "data catalog"
    and "distribution" as DCAT does. It definitely makes sense to ping this group for any comments on DCAT. Andrea, can you do that?

    Thanks!
    kc

    On 2/12/19 2:57 AM, andrea.perego@ec.europa.eu<mailto:andrea.perego@ec.europa.eu> wrote:
    > Dears,
    >
    > I don't remember if we have already mentioned this work:
    >
    > https://github.com/ESIPFed/science-on-schema.org

    >
    > (which, AFAIS, follows-up from:
    > https://github.com/earthcubearchitecture-project418/p418Vocabulary )
    >
    > They provide a way for describing repositories and datasets which
    > include most of the features under discussion in the revision of DCAT
    > (e.g., funding sources, identifiers, access to data via services).
    >
    > It may be worth getting in touch with them, to have their feedback.
    >
    > WDYT?
    >
    > Cheers,
    >
    > Andrea
    >
    > ----
    > Andrea Perego, Ph.D.
    > Scientific / Technical Project Officer European Commission DG JRC
    > Directorate B - Growth and Innovation Unit B6 - Digital Economy Via E.
    > Fermi, 2749 - TP 262
    > 21027 Ispra VA, Italy
    >
    > https://ec.europa.eu/jrc/

    >
    > ----
    > The views expressed are purely those of the writer and may not in any
    > circumstances be regarded as stating an official position of the
    > European Commission.
    >
    >

    --
    Karen Coyle
    kcoyle@kcoyle.net<mailto:kcoyle@kcoyle.net> http://kcoyle.net

    m: 1-510-435-8234 (Signal)
    skype: kcoylenet/+1-510-984-3600

Received on Thursday, 14 February 2019 12:47:13 UTC