Re: RFC Words - Levels

Regarding the two questions, I don’t think we need to worry about whether a maturity level applies at the level of a dataset or a collection. It’s up to the publisher of the data to decide whether they will follow/claim a certain level for a dataset or a collection. For each BP, the relevant aspects will be different. Breaking BPs out into multiple levels is a matter of determining the aspects that are relevant to that BP or group of BPs. So, for metadata, you could say the lowest level is “provide structural metadata and provide localization metadata for locale-sensitive fields" , because we’d rather have some incomplete metadata than none at all, but the data is meaningless without structural metadata, and locale-sensitive fields are meaningless without the localization info. The next level could be “provide  descriptive metadata”, less crucial but still a huge help to have at least something. The third level could be “Provide complete descriptive metadata, including license information, provenance, quality information, and versioning information.”

Some of the metadata BPs I mention above seem like they could still be separate BPs, just split into their own levels. “provide license information” could be satisfied at a low level by providing a custom description of licensing rules, and a higher level of maturity would be to use a standard license.

This begs the question of how many levels we should have, and how we will assign them. If we go for three, how do we assign the groups of only two? We might be able to determine that by splitting up each BP or set of BPs however seems natural for that group, and seeing what turns out to be the highest number of levels for any such group. Then we can try and come up with some general rules to describe the levels. (I think the lowest and highest levels will be easy to generalize, but the middle ones will be hard.) Once we have generalized rules, it should be easy to assign the groups that have fewer levels.

I would like us to try and avoid the use of SHOULD and MUST altogether, since their use in a best practice recommendation cannot agree with their RFC2119 meanings. (The web will not break if you fail to provide metadata with your dataset.) Instead of saying “Datasets must have x, y, and z.” we can simply say “Provide x, y, and z.”
-Annette

--
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
510-495-2935

On Sep 14, 2015, at 5:00 PM, Bernadette Farias Lóscio <bfl@cin.ufpe.br> wrote:

> 
> Hi Laufer,
> 
> I agree with you that we should have more fine grained sets of best practices. It is also important to review the BP to make sure that SHOULD and MUST were used correctly.  IMO we should also discuss what type of classification we'd like to have with the maturity model. I have some questions about this:
> 
> The maturity model will be used to evaluate a single dataset or a set of datasets?  
> Which main aspects should be considered for the evaluation?
> 
> Thanks!
> Bernadette
> 
> 2015-09-04 11:21 GMT-03:00 Laufer <laufer@globo.com>:
> Hi All,
> 
> After our discussions about maintaining or not the RFC words and creating or not a mature model in conjunction with a set of BP levels, I grouped the BPs by RFC words:
> 
> MUST
>     Best Practice  1: Provide metadata
>     Best Practice  2: Provide descriptive metadata
>     Best Practice  4: Provide structural metadata
>     Best Practice 10: Use persistent URIs as identifiers
>     Best Practice 12: Use machine-readable standardized data formats
>     Best Practice 21: Preserve people's right to privacy
>     Best Practice 26: Provide data up to date
>     Best Practice 29: Use a trusted serialization format for preserved data dumps
> 
> SHOULD
>     Best Practice  3: Provide locale parameters metadata
>     Best Practice  5: Provide data license information
>     Best Practice  6: Provide data provenance information
>     Best Practice  7: Provide data quality information
>     Best Practice  8: Provide versioning information
>     Best Practice  9: Provide version history
>     Best Practice 11: Assign URIs to dataset versions and series
>     Best Practice 13: Use non-proprietary data formats
>     Best Practice 14: Provide data in multiple formats
>     Best Practice 15: Use standardized terms
>     Best Practice 16: Document vocabularies
>     Best Practice 17: Share vocabularies in an open way
>     Best Practice 18: Vocabulary versioning
>     Best Practice 19: Re-use vocabularies
>     Best Practice 20: Choose the right formalization level
>     Best Practice 22: Provide data unavailability reference
>     Best Practice 23: Provide bulk download
>     Best Practice 24: Follow REST principles when designing APIs
>     Best Practice 25: Provide real-time access
>     Best Practice 27: Maintain separate versions for a data API
>     Best Practice 28: Assess dataset coverage
>     Best Practice 30: Update the status of identifiers
>     Best Practice 31: Gather feedback from data consumers
>     Best Practice 32: Provide information about feedback
>     Best Practice 33: Enrich data by generating new metadata.
> 
> We currently have two groups of BPs to guide the publisher.
> 
> Maybe we could, from this two groups, make an exercise to define a more fine grained set of groups to, in some sense, assert some "quality" (mature) to a published dataset.
> 
> What do you think about this?
> 
> Cheers,
> Laufer
> 
> -- 
> .  .  .  .. .  . 
> .        .   . ..
> .     ..       .
> 
> 
> 
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------
> 
> 
> 
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------

Received on Tuesday, 15 September 2015 01:05:00 UTC