Re: old issues we've been ignoring

Hi Annette,

Thanks again for you review. I'm sorry for the late answer, but just now I
had time to carefully read and answer your message. Please, find my
comments below.

2016-03-24 16:48 GMT-03:00 Annette Greiner <amgreiner@lbl.gov>:

> Following up on the list of issues I had in my email of 6/18/15, I see we
> are just now getting around to addressing the things I marked as essential
> to address before the next publication. That made me wonder where we are on
> the items that weren't starred, so I went through the list.
>
> The following are still issues:
> -------------------------
>
> Data Quality
> The introduction says that quality “can affect the potentiality of the
> application that use data”. I don’t understand that phrase.
>

---> Suggestion: Data quality can have a big impact on the quality of the
application that use data, as a consequence, its inclusion in the data
publishing and consumption pipelines is of primary importance


>
> Provide descriptive metadata
> Re the possible approach to implementation, the list of metadata fields to
> be included is not an implementation, so that should be moved up and listed
> under intended outcome. Spatial coverage and temporal period are irrelevant
> for lots of datasets, so they should be marked “if relevant". Keywords and
> themes/categories are dependent on the context of a catalog, so I think we
> should leave them out of this list, or say that they are needed in that
> case only.
>

---> Suggestion:

The machine readable version of the descriptive metadata can be provided
using the vocabulary recommended by W3C to describe datasets, i.e. the Data
Catalog Vocabulary [VOCAB-DCAT
<http://w3c.github.io/dwbp/bp.html#bib-VOCAB-DCAT>]. This provides a
framework in which datasets can be described as abstract entities.

Descriptive metadata should include the following overall features of a
dataset:

   - The *title* and a *description* o--->f the dataset.
   - The *keywords* describing the dataset.
   - The *date of publication* of the dataset.
   - The *entity responsible (publisher)* for making the dataset available.
   - The * contact point * of the dataset.

When relevant, the following metadata can also be included:

   - The *spatial coverage * of the dataset.
   - The * temporal period * that the dataset covers.
   - The * themes/categories * covered by a dataset.


>  Use standardized terms
> should be “Standardized terms should be used to provide metadata whenever
> they are available.” In scientific domains, often there are no standard
> terms yet available.
> (The test for this one should at least allow for some terms to not be
> standardized, because often there is no standard.)
>

---> I think it is not necessary to include "whenever they are available."
because I think this is implicit. If we include this for this BP, then we
should do the same for others BPs. The Why section already mentions that
standardized lists of codes other commonly used terms for data and metadata
values should be used as much as possible.



>
> Use a trusted serialization format for preserved data dumps
> To the extent that this is in scope, it is covered under the BP about
> using standardized formats. We could add a note to that mentioning the
> value for preservation. I don’t think this needs to be a separate BP.
>
> Update the status of identifiers
> To the extent that this is in scope, it should be covered under versioning
> or unavailability. What are “preserved” datasets? Are they available on the
> web? If not, it is out of scope. If they are, then they are versions.
>

--> I created an issue to discuss this with the group - ISSUE-251 [1]

>
> Feedback
> We say “blogs and other publicly available feedback should be displayed in
> a human-readable form through the user interface.” That suggests that
> publishers should re-publish blog content, which is probably not what we
> want (copyright issues, for one thing). Publishers of data can’t control
> the format of other people’s publications.
>

Suggestion: To remove "blogs and other". The phrase will be: "Publicly
available feedback should be displayed in a human-readable form through the
user interface"


>
>
> The following are new issues related to issues in that same email:
> -----------------------------------------------------------------
> BP28, Assess dataset coverage, is still written in the context of
> archiving data, which we have agreed was out of scope. It is valuable for
> the point that datasets should have minimal dependencies on external
> entities that may not be preserved. It needs to be rewritten to be about
> that rather than about assessing a dataset for its value in an archive.
>

---> see ISSUE-251 [1]


>
> Sensitive Data: The introduction gives a lot of advice that sounds like it
> should be in a BP. I find it awkward that we offer it in this form instead
> of a BP. If we want to say that it is out of scope, then we shouldn't be
> offering all this advice in an introduction.
>

---> I don't agree. I think it is out of scope of the document to identify
the sensitive data and to tell how to protect the sensitive data. But once
the sensitive data was identified and properly protected, then the BP shows
what should be done to tell consumers why the data is not available.


BP32, provide information about feedback
> The possible approach to implementation is about assigning metadata about
> the feedback. I don't think this is a best practice, and in any case, it's
> not an implementation of providing *useful* information about feedback. The
> useful information is the actual feedback, not metadata about it. I would
> suggest implementation with an issue tracker. The tests have the same
> problem, they are about testing metadata, not testing that the feedback
> itself can be read by other users.
>

---> I agree that we shouldn't mention metadata about feedback. I have a
suggestion for the rewriting of this BP:

Best Practice 32: Make feedback available

Feedback  should be available for both human users and computer applications

Why

Making feedback about datasets and distributions publicly available allows
users to become aware of other data consumers, supports a collaborative
environment, and allows user community experiences, concerns or questions
are currently being addressed. Providing feedback in a machine-readable
format allows computer applications to automatically collect and process
feedback about datasets.

Intended Outcome

It should be possible for humans to have access to feedback on a dataset or
distribution given by one or more data consumers.

It should be possible for machines to automatically process feedback  about
a dataset or distribution.

Possible Approach to Implementation

Feedback can be availabe  as part of an HTML Web page, but it can also be
provided in a machine-readable format according to the vocabulary to
describe dataset usage  [DUV <http://w3c.github.io/dwbp/bp.html#bib-DUV>].

How to Test
Check if a human consumer can access the feedback about the dataset or
distribution and check if a computer application can automatically process
the feedback.
Please let me know if you agree with my suggestions.

Thanks!
Bernadette

[1] https://www.w3.org/2013/dwbp/track/issues/251



>
> --
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
>
>
>


-- 
Bernadette Farias Lóscio
Centro de Informática
Universidade Federal de Pernambuco - UFPE, Brazil
----------------------------------------------------------------------------

Received on Thursday, 31 March 2016 22:52:08 UTC