Re: old issues we've been ignoring from Annette Greiner on 2016-04-06 (public-dwbp-wg@w3.org from April 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Tue, 5 Apr 2016 17:34:46 -0700
To: Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Cc: DWBP Public List <public-dwbp-wg@w3.org>, Caroline Burle <cburle@nic.br>
Message-ID: <570459A6.3010403@lbl.gov>
Thanks, Bernadette!

---> I replaced "should" by "can". This is just a suggestion of 
information that can be provided as descriptive metadata. I think it 
should be as part of the approach to implementation. Do you agree?

yes, that's great, thanks.

---> As agreed in our las meeting, I included a note on the introduction 
of the Data Preservation section [2]

Looks good. I hope we get some feedback on these issues.

---> We had a long discussion about this during the F2F in São Paulo 
[3]. We agreed to remove the BP on Preserve People's Right to privacy 
and to review the sensitive data**section. So, I think we shouldn't 
create a new BP. The paragraph in the introduction was rewritten 
considering the discussion that we had during the F2F.

I'm sorry that I missed the discussion about this. I do agree with 
everything that's in the text; it just seems very odd to provide 
information about privacy best practices there. It would be less awkward 
if we could add a sentence or two to the effect of "As the laws about 
publishing sensitive data differ from country to country, and it is 
beyond the mandate of this group to advise about such policy matters, we 
do not provide best practices for determining which data should or 
should not be published. Rather, we provide guidance about dealing with 
gaps in published data." I also think that people may remove data from a 
dataset for other reasons, so it's also odd to find that the only BP in 
the section about sensitive data might need to be considered by people 
without sensitive data. For example, data could be removed because it is 
found to be wrong or misleading or irrelevant. And data may be sensitive 
for more reasons than personal privacy. For example, there can be 
corporate secrets, issues of national security, intellectual property, 
etc. I think we need to show that we've given this some more 
comprehensive thought if we're going to say anything about it at all.


---> One of the reasons of having feedback machine readable is to make 
it easier to collect feedback about datasets. It will also be possible 
to process the feedback and it will be easier to share feedback with 
consumers. Does it make sense for you?

That makes some sense, though I would say that meeting the criteria of 
machine readability as described here makes it considerably more 
difficult to collect feedback. I think it's not very clear what level of 
machine readability you are thinking of. I don't think it's worthwhile 
to do a lot of transformation of the data so that it can be released as 
a dataset on its own. I do acknowledge that it's useful to have a 
capability for users to determine whether others have given similar 
feedback, so I agree that it's good for the feedback to be made 
searchable. It's a good idea to collect the data so that it is 
machine-readable for that purpose and for internal reuse, but I don't 
think it has to be the case that any random semweb application can 
automatically process the feedback. Maybe this just needs to be 
rephrased so that the bar is not quite so high.


On 4/5/16 9:27 AM, Bernadette Farias Lóscio wrote:
> Hi Annette,
>
> Thanks again for your review and your comments! I made some updates on 
> the doc [1] considering your last message, but I still have some 
> comments.
>
>>
>>         Provide descriptive metadata
>>         Re the possible approach to implementation, the list of
>>         metadata fields to be included is not an implementation, so
>>         that should be moved up and listed under intended outcome.
>>         Spatial coverage and temporal period are irrelevant for lots
>>         of datasets, so they should be marked “if relevant". Keywords
>>         and themes/categories are dependent on the context of a
>>         catalog, so I think we should leave them out of this list, or
>>         say that they are needed in that case only.
>>
>>
>>     ---> Suggestion:
>>
>>     The machine readable version of the descriptive metadata can be
>>     provided using the vocabulary recommended by W3C to describe
>>     datasets, i.e. the Data Catalog Vocabulary [VOCAB-DCAT
>>     <http://w3c.github.io/dwbp/bp.html#bib-VOCAB-DCAT>]. This
>>     provides a framework in which datasets can be described as
>>     abstract entities.
>>
>>     Descriptive metadata should include the following overall
>>     features of a dataset:
>>
>>       * The *title* and a *description* o--->f the dataset.
>>       * The *keywords* describing the dataset.
>>       * The *date of publication* of the dataset.
>>       * The *entity responsible (publisher)* for making the dataset
>>         available.
>>       * The *contact point * of the dataset.
>>
>>     When relevant, the following metadata can also be included:
>>
>>       * The *spatial coverage * of the dataset.
>>       * The *temporal period * that the dataset covers.
>>       * The *themes/categories * covered by a dataset.
>>
>
>     I'm a little confused about this one. Are we saying that all the
>     fields listed in first group should be included in order to meet
>     the criteria of the BP? If that's the case, I think that list
>     belongs in the intended outcome rather than the implementation.
>     The implementation section shouldn't be telling us what we
>     *should* do, right? I think it would be okay if we just removed
>     the "shoulds".
>
>
> ---> I replaced "should" by "can". This is just a suggestion of 
> information that can be provided as descriptive metadata. I think it 
> should be as part of the approach to implementation. Do you agree?
>
>>
>>         Use a trusted serialization format for preserved data dumps
>>         To the extent that this is in scope, it is covered under the
>>         BP about using standardized formats. We could add a note to
>>         that mentioning the value for preservation. I don’t think
>>         this needs to be a separate BP.
>>
>>         Update the status of identifiers
>>         To the extent that this is in scope, it should be covered
>>         under versioning or unavailability. What are “preserved”
>>         datasets? Are they available on the web? If not, it is out of
>>         scope. If they are, then they are versions.
>>
>>
>>     --> I created an issue to discuss this with the group - ISSUE-251
>>     [1]
>
>
> ---> As agreed in our las meeting, I included a note on the 
> introduction of the Data Preservation section [2]
>
>>
>>         Sensitive Data: The introduction gives a lot of advice that
>>         sounds like it should be in a BP. I find it awkward that we
>>         offer it in this form instead of a BP. If we want to say that
>>         it is out of scope, then we shouldn't be offering all this
>>         advice in an introduction.
>>
>>
>>     ---> I don't agree. I think it is out of scope of the document to
>>     identify the sensitive data and to tell how to protect the
>>     sensitive data. But once the sensitive data was identified and
>>     properly protected, then the BP shows what should be done to tell
>>     consumers why the data is not available.
>
>     I agree that it's out of scope to tell how to identify sensitive
>     data and how to protect it. But the introduction still says to "
>     identify all sensitive data, assess the exposure risk, determine
>     the intended usage, data user audience and any related usage
>     policies, obtain appropriate approval, and determine the
>     appropriate security measures needed to taken to protect the data"
>     and to " preserve the privacy of individuals where the release of
>     personal information would endanger safety (unintended accidents)
>     or security (deliberate attack)." Those sound like BPs to me. I'd
>     like to hear what other people in the group think, though.
>
>
> ---> We had a long discussion about this during the F2F in São Paulo 
> [3]. We agreed to remove the BP on Preserve People's Right to privacy 
> and to review the sensitive data**section. So, I think we shouldn't 
> create a new BP. The paragraph in the introduction was rewritten 
> considering the discussion that we had during the F2F.
>
>>     BP32, provide information about feedback
>>     The possible approach to implementation is about assigning
>>     metadata about the feedback. I don't think this is a best
>>     practice, and in any case, it's not an implementation of
>>     providing *useful* information about feedback. The useful
>>     information is the actual feedback, not metadata about it. I
>>     would suggest implementation with an issue tracker. The tests
>>     have the same problem, they are about testing metadata, not
>>     testing that the feedback itself can be read by other users.
>>
>>
>> ---> I agree that we shouldn't mention metadata about feedback. I 
>> have a suggestion for the rewriting of this BP:
>>
>> Best Practice 32: Make feedback available
>>
>> Feedback  should be available for both human users and computer 
>> applications
>>
>> Why
>>
>> Making feedback about datasets and distributions publicly available 
>> allows users to become aware of other data consumers, supports a 
>> collaborative environment, and allows user community experiences, 
>> concerns or questions are currently being addressed. Providing 
>> feedback in a machine-readable format allows computer applications to 
>> automatically collect and process feedback about datasets.
>>
>> Intended Outcome
>>
>> It should be possible for humans to have access to feedback on a 
>> dataset or distribution given by one or more data consumers.
>>
>> It should be possible for machines to automatically process feedback  
>> about a dataset or distribution.
>>
>> Possible Approach to Implementation
>>
>> Feedback can be availabe  as part of an HTML Web page, but it can 
>> also be provided in a machine-readable format according to the 
>> vocabulary to describe dataset usage  [DUV 
>> <http://w3c.github.io/dwbp/bp.html#bib-DUV>].
>>
>> How to Test
>>
>> Check if a human consumer can access the feedback about the dataset 
>> or distribution and check if a computer application can automatically 
>> process the feedback.
>> Please let me know if you agree with my suggestions.
>
> I like this except for the requirement of having the feedback machine 
> readable. I think it's a best practice to make it human readable, but 
> I don't see a compelling reason to make the feedback machine readable. 
> I have never done that. Do other people think that is a common 
> practice? It seems to me one could get caught in an infinite loop of 
> providing feedback as a dataset and getting feedback on the feedback 
> dataset, etc.
>
> ---> One of the reasons of having feedback machine readable is to make 
> it easier to collect feedback about datasets. It will also be possible 
> to process the feedback and it will be easier to share feedback with 
> consumers. Does it make sense for you?
>
> Thanks!
> Berna
>
> [1] http://w3c.github.io/dwbp/bp.html
> [2] http://w3c.github.io/dwbp/bp.html#dataPreservation
> [3] https://www.w3.org/2015/09/24-dwbp-minutes
>
>
>
>
>>
>> Thanks!
>> Bernadette
>>
>> [1] https://www.w3.org/2013/dwbp/track/issues/251
>>
>>
>>     -- 
>>     Annette Greiner
>>     NERSC Data and Analytics Services
>>     Lawrence Berkeley National Laboratory
>>
>>
>>
>>
>>
>> -- 
>> Bernadette Farias Lóscio
>> Centro de Informática
>> Universidade Federal de Pernambuco - UFPE, Brazil
>> ----------------------------------------------------------------------------
>
> -- 
> Annette Greiner
> NERSC Data and Analytics Services
> Lawrence Berkeley National Laboratory
>
>
>
>
> -- 
> Bernadette Farias Lóscio
> Centro de Informática
> Universidade Federal de Pernambuco - UFPE, Brazil
> ----------------------------------------------------------------------------

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Wednesday, 6 April 2016 00:35:22 UTC