Re: Google search and Datasets from Rob Atkinson on 2018-09-05 (public-dxwg-wg@w3.org from September 2018)

From: Rob Atkinson <rob@metalinkage.com.au>
Date: Thu, 6 Sep 2018 09:41:44 +1000
To: Dan Brickley <danbri@google.com>
Cc: Rob Atkinson <rob@metalinkage.com.au>, Annette Greiner <amgreiner@lbl.gov>, Dataset Exchange Working Group <public-dxwg-wg@w3.org>, Natasha Noy <noy@google.com>
Message-ID: <CACfF9LzF8wJrViBjY4zzDq-NZNv7Bk0GP+TBnrPZHwb1x4-M-Q@mail.gmail.com>
Hi Dan,

I think its worth iterating around the notion: the nuance in "profile" is
probably one of formalism:
- my viewpoint is formalising the identification of the supported set and
content of (meta)data elements - what subset you need and care about and
will act on - so we can let people know what to expect when they see the
dataset metadata, and can ask systems to validate the relevant profiles etc
- the "expression" of the profile - which may or may not be complete or
authoritative (i.e. you can implement something and someone else could read
your docs and try to make a useful machine readable expression - this is
what we would need to do to retrofit semantic metadata to many resources)

These converge if we are publishing or registering key interoperability
profiles in a metadata rich machine-readable canonical form (e.g. SHACL +
profileDesc) - and  there is no reason for example that a SHACL document
cannot be accessed from a profile URI by content-negotiation - if we wanted
to use its URI as the formal identifier of the conceptual profile.

I doubt there is much out there yet - GeoDCAT or DCAT-AP may be the best
bet - these are formally published, but not yet (authoritatively)
machine-readable forms of profile - and we are only just getting to
practices for declaring conformance to these.

Nevertheless, as you point out - your challenge is to identify patterns,
irrespective of the low level of support for declaring these currently, so
you will settle on some implicit profile that can be articulated at some
stage - so a pathway to better metadata via the motivation to optimise use
of deployed search functionality seems likely (as you have been doing
naturally via schema.org)

Rob



On Thu, 6 Sep 2018 at 07:29 Dan Brickley <danbri@google.com> wrote:

> It may be that there are various notions of "profile" in play here. I'll
> check in with Ed! If there are interesting quantities of data out there
> expressed in DCAT-based patterns (potentially captured via shex/shacl
> shapes) and if they're written in a form we extract (json-ld etc) then
> there's certainly potential.  Can you give examples of any pages (rather
> than the underlying specs) with the kind of dataset-describing profile you
> have in mind? Re fora, I'm happy having a mail thread here until the WG
> chairs nudge us to move along elsewhere :)
>
> Dan
>
> On Wed, 5 Sep 2018 at 22:18, Rob Atkinson <rob@metalinkage.com.au> wrote:
>
>>
>> Hi Dan, et al
>>
>> I spoke to Ed Parsons about this, and he advised that it was unlikely
>> that any specific DCAT profiles would be supported, but my thinking is that
>> if you support DCAT + some way of handling, say, statistical datasets using
>> datacube - that support would actually constitute a DCAT profile logically,
>> and could be described as such.
>>
>> Happy to work with you therefore to describe what you do support AS
>> profiles, rather than push a profile at you :-)  It would make sense to
>> formalise goverance of geospatial data profiles via OGC - as a sub-profile
>> of GeoDCAT for example, if you support GeoDCAT (????)
>>
>> I'm trying to track this issue across a number of statistical data fora -
>> but struggling to identify a center of gravity for the discussion - do you
>> have any suggestions
>>
>> Rob Atkinson
>>
>>
>> On Thu, 6 Sep 2018 at 06:33 Dan Brickley <danbri@google.com> wrote:
>>
>>> You beat me to it :)
>>>
>>> (cc:'ing Natasha Noy who led this work at at Google, and who might not
>>> be able to post to this list directly but I can relay any bounced posts)
>>>
>>> I am really happy to see this work launch and am happy to answer any
>>> questions, here or offlist as folk prefer.
>>>
>>> Schema.org's dataset vocab is based on the core pattern from the early
>>> DCAT drafts a few years ago (and so shares its strengths and weaknesses).
>>> The Google implementation is based on JSON-LD, RDFa and Microdata embedded
>>> in the main per-dataset pages. While we focussed more on Schema.org there
>>> is some understanding of DCAT too and our support for both will hopefully
>>> evolve with the ecosystem (and updated W3C specs) over time. Other
>>> questions of course loom, e.g. how this relates to markup for fact
>>> checking, or for describing funders and projects, specialist domains (e.g.
>>> bioschemas, ...), or other W3C efforts like Data Cube and CSVW....
>>>
>>> Dan
>>>
>>> On Wed, 5 Sep 2018, 19:38 Annette Greiner, <amgreiner@lbl.gov> wrote:
>>>
>>>> I noticed their developer guide says "We can understand structured data
>>>> in Web pages about datasets, using either schema.org Dataset markup
>>>> <http://schema.org/Dataset>, or equivalent structures represented in
>>>> W3C <http://www.w3.org/>'s Data Catalog Vocabulary (DCAT) format
>>>> <https://www.w3.org/TR/vocab-dcat/>." :)
>>>>
>>>> -Annette
>>>>
>>>> On 9/5/18 11:16 AM, Karen Coyle wrote:
>>>>
>>>> "Making it easier to find datasets" at the Google Blog:
>>>> https://www.blog.google/products/search/making-it-easier-discover-datasets/
>>>>
>>>> You may already be aware of their developer guide for datasets:
>>>> https://developers.google.com/search/docs/data-types/dataset
>>>>
>>>> which advises the use of schema.org.
>>>>
>>>> Apologies if this is old news to some of you.
>>>>
>>>>
>>>> --
>>>> Annette Greiner
>>>> NERSC Data and Analytics Services
>>>> Lawrence Berkeley National Laboratory
>>>>
>>>>
>>>>
Received on Wednesday, 5 September 2018 23:42:33 UTC