Re: old issues we've been ignoring from Annette Greiner on 2016-04-01 (public-dwbp-wg@w3.org from April 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Thu, 31 Mar 2016 21:39:32 -0700
To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Cc: DWBP Public List <public-dwbp-wg@w3.org>, Caroline Burle <cburle@nic.br>
Message-ID: <56FDFB84.2090508@lbl.gov>
Thanks, Berna, for working through this at this late moment. I really 
appreciate the effort you make to move things forward. I'll type 
responses inline below.

On 3/31/16 3:51 PM, Bernadette Farias Lóscio wrote:
> Hi Annette,
>
> Thanks again for you review. I'm sorry for the late answer, but just 
> now I had time to carefully read and answer your message. Please, find 
> my comments below.
>
> 2016-03-24 16:48 GMT-03:00 Annette Greiner <amgreiner@lbl.gov 
> <mailto:amgreiner@lbl.gov>>:
>
>     Following up on the list of issues I had in my email of 6/18/15, I
>     see we are just now getting around to addressing the things I
>     marked as essential to address before the next publication. That
>     made me wonder where we are on the items that weren't starred, so
>     I went through the list.
>
>     The following are still issues:
>     -------------------------
>
>     Data Quality
>     The introduction says that quality “can affect the potentiality of
>     the application that use data”. I don’t understand that phrase.
>
>
> ---> Suggestion: Data quality can have a big impact on the quality of 
> the application that use data, as a consequence, its inclusion in the 
> data publishing and consumption pipelines is of primary importance

This is much better. I can see what you're trying to say. It just needs 
to have some logical connections clarified. Is this what you mean?
"The quality of a dataset can have a big impact on the quality of 
applications that use it. As a consequence, the inclusion of data 
quality considerations in data publishing and consumption pipelines is 
of primary importance."
>
>
>     Provide descriptive metadata
>     Re the possible approach to implementation, the list of metadata
>     fields to be included is not an implementation, so that should be
>     moved up and listed under intended outcome. Spatial coverage and
>     temporal period are irrelevant for lots of datasets, so they
>     should be marked “if relevant". Keywords and themes/categories are
>     dependent on the context of a catalog, so I think we should leave
>     them out of this list, or say that they are needed in that case only.
>
>
> ---> Suggestion:
>
> The machine readable version of the descriptive metadata can be 
> provided using the vocabulary recommended by W3C to describe datasets, 
> i.e. the Data Catalog Vocabulary [VOCAB-DCAT 
> <http://w3c.github.io/dwbp/bp.html#bib-VOCAB-DCAT>]. This provides a 
> framework in which datasets can be described as abstract entities.
>
> Descriptive metadata should include the following overall features of 
> a dataset:
>
>   * The *title* and a *description* o--->f the dataset.
>   * The *keywords* describing the dataset.
>   * The *date of publication* of the dataset.
>   * The *entity responsible (publisher)* for making the dataset available.
>   * The *contact point * of the dataset.
>
> When relevant, the following metadata can also be included:
>
>   * The *spatial coverage * of the dataset.
>   * The *temporal period * that the dataset covers.
>   * The *themes/categories * covered by a dataset.
>

I'm a little confused about this one. Are we saying that all the fields 
listed in first group should be included in order to meet the criteria 
of the BP? If that's the case, I think that list belongs in the intended 
outcome rather than the implementation. The implementation section 
shouldn't be telling us what we *should* do, right? I think it would be 
okay if we just removed the "shoulds".
>
>
>      Use standardized terms
>     should be “Standardized terms should be used to provide metadata
>     whenever they are available.” In scientific domains, often there
>     are no standard terms yet available.
>     (The test for this one should at least allow for some terms to not
>     be standardized, because often there is no standard.)
>
>
> ---> I think it is not necessary to include "whenever they are 
> available." because I think this is implicit. If we include this for 
> this BP, then we should do the same for others BPs. The Why section 
> already mentions that standardized lists of codes other commonly used 
> terms for data and metadata values should be used as much as possible.
>
I think that's reasonable. I'm probably just being pedantic.
>
>
>     Use a trusted serialization format for preserved data dumps
>     To the extent that this is in scope, it is covered under the BP
>     about using standardized formats. We could add a note to that
>     mentioning the value for preservation. I don’t think this needs to
>     be a separate BP.
>
>     Update the status of identifiers
>     To the extent that this is in scope, it should be covered under
>     versioning or unavailability. What are “preserved” datasets? Are
>     they available on the web? If not, it is out of scope. If they
>     are, then they are versions.
>
>
> --> I created an issue to discuss this with the group - ISSUE-251 [1]
>
>
>     Feedback
>     We say “blogs and other publicly available feedback should be
>     displayed in a human-readable form through the user interface.”
>     That suggests that publishers should re-publish blog content,
>     which is probably not what we want (copyright issues, for one
>     thing). Publishers of data can’t control the format of other
>     people’s publications.
>
>
> Suggestion: To remove "blogs and other". The phrase will be: "Publicly 
> available feedback should be displayed in a human-readable form 
> through the user interface"
Sounds good!
>
>
>
>     The following are new issues related to issues in that same email:
>     -----------------------------------------------------------------
>     BP28, Assess dataset coverage, is still written in the context of
>     archiving data, which we have agreed was out of scope. It is
>     valuable for the point that datasets should have minimal
>     dependencies on external entities that may not be preserved. It
>     needs to be rewritten to be about that rather than about assessing
>     a dataset for its value in an archive.
>
>
> ---> see ISSUE-251 [1]
>
>
>     Sensitive Data: The introduction gives a lot of advice that sounds
>     like it should be in a BP. I find it awkward that we offer it in
>     this form instead of a BP. If we want to say that it is out of
>     scope, then we shouldn't be offering all this advice in an
>     introduction.
>
>
> ---> I don't agree. I think it is out of scope of the document to 
> identify the sensitive data and to tell how to protect the sensitive 
> data. But once the sensitive data was identified and properly 
> protected, then the BP shows what should be done to tell consumers why 
> the data is not available.
I agree that it's out of scope to tell how to identify sensitive data 
and how to protect it. But the introduction still says to " identify all 
sensitive data, assess the exposure risk, determine the intended usage, 
data user audience and any related usage policies, obtain appropriate 
approval, and determine the appropriate security measures needed to 
taken to protect the data" and to " preserve the privacy of individuals 
where the release of personal information would endanger safety 
(unintended accidents) or security (deliberate attack)." Those sound 
like BPs to me. I'd like to hear what other people in the group think, 
though.
>
>
>     BP32, provide information about feedback
>     The possible approach to implementation is about assigning
>     metadata about the feedback. I don't think this is a best
>     practice, and in any case, it's not an implementation of providing
>     *useful* information about feedback. The useful information is the
>     actual feedback, not metadata about it. I would suggest
>     implementation with an issue tracker. The tests have the same
>     problem, they are about testing metadata, not testing that the
>     feedback itself can be read by other users.
>
>
> ---> I agree that we shouldn't mention metadata about feedback. I have 
> a suggestion for the rewriting of this BP:
>
> Best Practice 32: Make feedback available
>
> Feedback  should be available for both human users and computer 
> applications
>
> Why
>
> Making feedback about datasets and distributions publicly available 
> allows users to become aware of other data consumers, supports a 
> collaborative environment, and allows user community experiences, 
> concerns or questions are currently being addressed. Providing 
> feedback in a machine-readable format allows computer applications to 
> automatically collect and process feedback about datasets.
>
> Intended Outcome
>
> It should be possible for humans to have access to feedback on a 
> dataset or distribution given by one or more data consumers.
>
> It should be possible for machines to automatically process feedback  
> about a dataset or distribution.
>
> Possible Approach to Implementation
>
> Feedback can be availabe  as part of an HTML Web page, but it can also 
> be provided in a machine-readable format according to the vocabulary 
> to describe dataset usage  [DUV 
> <http://w3c.github.io/dwbp/bp.html#bib-DUV>].
>
> How to Test
>
> Check if a human consumer can access the feedback about the dataset or 
> distribution and check if a computer application can automatically 
> process the feedback.
> Please let me know if you agree with my suggestions.

I like this except for the requirement of having the feedback machine 
readable. I think it's a best practice to make it human readable, but I 
don't see a compelling reason to make the feedback machine readable.  I 
have never done that. Do other people think that is a common practice? 
It seems to me one could get caught in an infinite loop of providing 
feedback as a dataset and getting feedback on the feedback dataset, etc.
>
> Thanks!
> Bernadette
>
> [1] https://www.w3.org/2013/dwbp/track/issues/251
>
>
>     -- 
>     Annette Greiner
>     NERSC Data and Analytics Services
>     Lawrence Berkeley National Laboratory
>
>
>
>
>
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Friday, 1 April 2016 04:40:24 UTC