Re: [BP - MET] - Best Practices - Guidance on the Provision of Metadata from Phil Archer on 2014-05-16 (public-dwbp-wg@w3.org from May 2014)

From: Phil Archer <phila@w3.org>
Date: Fri, 16 May 2014 09:37:41 +0100
To: DWBP Public List <public-dwbp-wg@w3.org>
Message-ID: <5375CE55.10507@w3.org>
Reading through this thread today I have a couple of comments.

I think what Laufer is doing is starting from the abstract position, 
expecting in future to transform that into practical advice. +1 to that. 
DCAT is the vocab for describing datasets in a catalogue (it's the Data 
CATalogue vocabulary) and it does that job fine - I don't think anyone's 
suggesting we reinvent it.

But there are several pieces missing from the landscape and our job is 
to guide people on how to use what pieces exist and, where necessary, 
fix those gaps.

Semantics of the dataset
========================
This is indeed what the CSVW WG is working on for tabular data. And VoID 
does a similar job for Linked Data. So a link from a dcat:Distribution 
to a machine readable metadata about the semantics could well be useful. 
But I agree with Bernadette that that's as far as we should go, i.e. we 
just provide the hooks.

That said, we should be mindful that the CSVW work will include links 
from the data to the (semantic) metadata. VoID uses my least favourite 
method (a well known location) to achieve the same thing.

<#myDistro> a dcat:distribution;
   dcat:semanticMetadata <http://example.com/myDistro-meta>.

doesn't prevent or conflict with any other link that may exist between 
the dataset and its metadata and could possibly be useful.

NB. Semantic metadata is going to be format-specific so I guess it has 
to be linked from each distribution, not from the (abstract) 
dcat:Dataset. WDYT?

Application profiles
====================
We're trying to get a new WG up and running on this - i.e. a method to 
make things like the DCAT-AP machine readable. My colleague Eric 
Prud'hommeaux is working on this. If W3C member organizations 
represented in *this* WG would be interested in that work, please let me 
know - we're building the community for what we expect to be the RDF 
Data Shapes WG. The hope is that it will have its first f2f meeting at 
TPAC (so you could go to both f2f meetings in one trip :-) )

Data Quality
============
Yep - that's what we're working on too.

CKAN
====
The message from Open Knowledge Foundation is exactly what you'd expect 
from any open source project: you want new or improved features - create 
and improve them! Bernadette's students' work on building extensions for 
CKAN is a very important part of our work here. I hope that we can see 
real instances of CKAN with the extension installed. The same vocabulary 
extensions in non-CKAN portals is just as important (I'm nudging you 
Martin & Carlos ;-) )


Best Practices
==============
So... in terms of BP, I suggest we explain the high level needs - which 
I think was Laufer's starting point - and then dive into how to do it, 
pointing to whatever method is or will be available for doing so.

HTH

Phil.



On 16/05/2014 08:26, Ghislain Atemezing wrote:
> Hi Laufer, all,
> Thanks for this great starting discussion. Find below my 2 cents ...
>> I created a page on the wiki, "Best Practices – Guidance on the
>> Provision of Metadata", where we can put the information about this
>> topic. I took the liberty to define a prefix in the subject of the
>> e-mails related to these discussions: [BP- MET].
>>
>> I would like to expose some thoughts that I think are related to the
>> data on the web ecosystem. I see a kind of data architecture that has
>> three big roles: a data Publisher, a data Consumer and a data Broker.
>> The Broker is the one that has information that can be used by the
>> Consumer to find data published by the Publisher.
>>
>> As an example of Brokers we can think about implementations of CKAN,
>> used by data.gov <http://data.gov>, dados.gov.br <http://dados.gov.br>,
>> etc. CKAN has metadata (provided by Publishers) that are useful for
>> Consumers to find data. CKAN is a registry and can also be a repository
>> for the data to be consumed. Almost all use cases of DWBP WG are
>> examples of Brokers.
>>
>> At the same time, data published in CKAN implementations can have
>> multiple formats, as CSV, for example. Once a Consumer chooses some data
>> to use from a Publisher, she needs another kind of metadata to
>> understand how to access the data and its semantics.
>>
>> I propose to create categories and types of metadata. I see two
>> categories: metadata for search and metadata for use. Each of these
>> categories would have types of metadata. For example:
>>
> +1. I could consider also metada "computed" based on some provenance
> data + metrics. For e.g.: If a dataset is published by a "certified
> organization" and it is reused by many users/applications, then it has
> higher quality.
>> Metadata Types for Search
>>
>> Human Content Description (free text)
> ..and categories/themes
>>
>> Machine Content Description (vocabularies)
>>
>> Provenance
>>
>> License
>>
>> Revenue
>>
>> Credentials
>>
>> Quality / Metrics
>>
>> Release Schedule
>>
>> Data Format
>>
>> Data Access
> +1 for all this first metadata types
>>
>> Metadata Types for Use
>>
>> URI Design Principles
>>
>> Machine Access to Data
>>
>> API specification
>>
> I am not sure to understand the above types. Could you give us an
> example why "vocabularies" are not in this list, but "URI design
> principles" is here? One may think that there is no principles in
> designing URIs for vocabs.
>> Format Specification
>>
> What's the difference between "format spec" and "data format"?
>
> As others pointed out, we could define a small set of mandatory field
> when providing the metadata.
>
> Thanks again for taking care of this section.
>
> Cheers,
> Ghislain
>

-- 


Phil Archer
W3C Data Activity Lead
http://www.w3.org/2013/data/

http://philarcher.org
+44 (0)7887 767755
@philarcher1
Received on Friday, 16 May 2014 08:38:01 UTC