Re: My additions today (was Re: Sensitive data text for enrichment section) from Annette Greiner on 2016-05-10 (public-dwbp-wg@w3.org from May 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Tue, 10 May 2016 11:04:52 -0700
To: Phil Archer <phila@w3.org>, Eric Stephan <ericphb@gmail.com>
Cc: Public DWBP WG <public-dwbp-wg@w3.org>, Bernadette Farias Lóscio <bfl@cin.ufpe.br>, Eric Kauz <eric.kauz@gs1.org>, Caroline Burle <cburle@nic.br>, Newton Calegari <newton@nic.br>
Message-ID: <5763071e-21f7-43c0-d4ef-d1513aa09844@lbl.gov>
+1 from me 2, BTW.

-Annette


On 5/10/16 8:35 AM, Phil Archer wrote:
> Thanks Eric, I've made those changes.
>
> @Editors, these are included in my current pull request.
>
> Phil
>
> On 09/05/2016 14:22, Eric Stephan wrote:
>> Phil and Annette,
>>
>> I agree with the placement of the sensitive data in the introduction 
>> of the
>> document.
>>
>> ~~~
>>
>> Could you change "Not all data" to "Not all data (and metadata)"?  If 
>> this
>> were changed there would be no need for adding anything to the metadata
>> section.
>>
>> ~~~
>>
>> Could you remove the statement:
>>
>> "It is for data publishers, not a technical standards working group, to
>> determine policy on which data should be shared and under what
>> circumstances. "
>>
>> To:
>>
>> "It is for data publishers to determine policy on which data should be
>> shared and under what circumstances. "
>>
>> I understand the sentiment "not a technical standards working group" 
>> in the
>> context of the DWBP , the tone of the statement just sounds a bit to
>> universal.  Some domain specific (health care) technical standards 
>> for data
>> sharing are constrained by policy and protocol.
>>
>> Thanks and great work,
>>
>> Eric S.
>>
>> On Mon, May 9, 2016 at 3:38 AM, Phil Archer <phila@w3.org> wrote:
>>
>>> Dear all,
>>>
>>> I have implemented a couple of minor changes to the BP doc that came 
>>> up as
>>> a result of our discussions on Friday.
>>>
>>> 1. Thanks Eric K for the suggested text that allows us to include 
>>> direct
>>> refs to the GS1 work which, IMHO, is well worth including. See 
>>> versions of
>>> your text in
>>> http://philarcher1.github.io/dwbp/bp.html#identifiersWithinDatasets
>>>
>>> (For tracker, action-279)
>>>
>>> 2. Thanks Eric S for words on privacy - I think you'll agree that what
>>> Annette has suggested covers exactly what you were talking about?
>>>
>>> (For tracker: action-278)
>>>
>>> 3. Annette, thanks for reviewing and improving the suggestions about
>>> handling sensitive data - I agree entirely with your suggestions 
>>> and, more
>>> importantly, I am confident that they reflect the wider discussions 
>>> we had
>>> on Friday's call. Therefore I have put your improved text in my latest
>>> version of the doc. See
>>> http://philarcher1.github.io/dwbp/bp.html#intro
>>> and
>>> http://philarcher1.github.io/dwbp/bp.html#enrichment
>>>
>>> 4. Having done that, I have deleted the Sensitive Data section and 
>>> moved
>>> the Data Unavailability to within the Access section, just before 
>>> the sub
>>> section on APIs. I really wasn't sure where to put it but that 
>>> seemed as
>>> good as anywhere?
>>>
>>> 5. I've made that sub group on BPs on APIs into a section so it becomes
>>> 8.10.1, complete with ID and all the rest of it.
>>>
>>> 6. I've updated the date of the doc to today's date.
>>>
>>> 7. Editors - I have issued a Pull Request for your consideration.
>>>
>>> HTH
>>>
>>> Phil.
>>>
>>>
>>>
>>> On 07/05/2016 21:05, Annette Greiner wrote:
>>>
>>>> Hi Phil,
>>>> Thanks for letting me weigh in. I understand the connection you’re 
>>>> making
>>>> here, and I think it’s a good thing to mention in the enrichment 
>>>> section.
>>>> What I think is crucial but is not yet reflected in here is the 
>>>> issue of
>>>> privacy breach arising from putting together disparate data that 
>>>> presents
>>>> less risk separately. The second paragraph here is a good but more 
>>>> general
>>>> discussion of security and privacy issues that strikes me as not 
>>>> belonging
>>>> in this particular section. I would suggest instead addressing the 
>>>> more
>>>> general issues in the introduction to our document. Most of the third
>>>> paragraph would also be better in the document introduction, but 
>>>> the last
>>>> sentence is relevant here. As I see it, the real issue with data 
>>>> enrichment
>>>> is combining datasets that each hold so little information about any
>>>> individual that they cannot be identified but that together offer 
>>>> enough
>>>> information that they can be. I would suggest that here we just say,
>>>>
>>>> Data enrichment refers to a set of processes that can be used to 
>>>> enhance,
>>>>> refine or otherwise improve raw or previously processed data. This 
>>>>> idea and
>>>>> other similar concepts contribute to making data a valuable asset for
>>>>> almost any modern business or enterprise. It is a diverse topic in 
>>>>> itself,
>>>>> details of which are beyond the scope of the current document. 
>>>>> However, it
>>>>> is worth noting that some of these techniques should be approached 
>>>>> with
>>>>> caution, as ethical concerns may arise. In scientific research, 
>>>>> care must
>>>>> be taken to avoid enrichment that distorts results or statistical 
>>>>> outcomes.
>>>>> For data about individuals, privacy issues may arise when combining
>>>>> datasets. That is, enriching one dataset with another, when neither
>>>>> contains sufficient information about any individual to identify 
>>>>> them, may
>>>>> yield a combined dataset that compromises privacy. Furthermore, these
>>>>> techniqes can be carried out at scale, which in turn highlights 
>>>>> the need
>>>>> for caution.
>>>>>
>>>>
>>>> Then, in the document introduction, I would suggest adding the 
>>>> following,
>>>> after the paragraph that begins “In this context…”.
>>>>
>>>> Not all data should be shared openly, however. Security, commercial
>>>>> sensitivity and, above all, individuals' privacy need to be taken 
>>>>> into
>>>>> account. It is for data publishers, not a technical standards working
>>>>> group, to determine policy on which data should be shared and 
>>>>> under what
>>>>> circumstances. Data sharing policies are likely to assess the 
>>>>> exposure risk
>>>>> and determine the appropriate security measures to be taken to 
>>>>> protect
>>>>> sensitive data, such as secure authentication and authorization.
>>>>>
>>>>> Depending on circumstances, sensitive information about individuals
>>>>> might include full name, home address, email address, national
>>>>> identification number, IP address, vehicle registration plate number,
>>>>> driver's license number, face, fingerprints, or handwriting, 
>>>>> credit card
>>>>> numbers, digital identity, date of birth, birthplace, genetic 
>>>>> information,
>>>>> telephone number, login name, screen name, nickname, health 
>>>>> records etc.
>>>>> Although it is likely to be safe to share some of that information 
>>>>> openly,
>>>>> and even more within a controlled environment, publishers should 
>>>>> bear in
>>>>> mind that combining data from multiple sources may allow inadvertent
>>>>> identification of individuals.
>>>>>
>>>>
>>>> (I took out mention of https, as it will soon be everywhere, which 
>>>> would
>>>> make our doc out of date.)
>>>>
>>>> Also, I noticed a grammatical error in the implementation section 
>>>> of BP
>>>> 31. (Subject-verb agreement is off.) It should read "Techniques for 
>>>> data
>>>> enrichment are complex and go well beyond the scope of this 
>>>> document, which
>>>> can only highlight the possibilities."
>>>> -Annette
>>>>
>>>> On May 6, 2016, at 7:50 AM, Phil Archer <phila@w3.org> wrote:
>>>>>
>>>>> Berna,
>>>>>
>>>>> As promised, I've copied the text from the sensitive data section and
>>>>> merged some of it with the data enrichment intro to end up with 
>>>>> this as a
>>>>> suggestion.
>>>>>
>>>>> @Annette - we resolved to do this and move the BP about data
>>>>> unavailability to the data access section. Do you agree with this?
>>>>>
>>>>> ===Begins==
>>>>>
>>>>> Data enrichment refers to a set of processes that can be used to
>>>>> enhance, refine or otherwise improve raw or previously processed 
>>>>> data. This
>>>>> idea and other similar concepts contribute to making data a 
>>>>> valuable asset
>>>>> for almost any modern business or enterprise. It is a diverse 
>>>>> topic in
>>>>> itself, details of which are beyond the scope of the current 
>>>>> document.
>>>>> However, it is worth noting that techniques exist to carry out such
>>>>> enrichment at scale which in turn highlights the need for caution.
>>>>>
>>>>> Not all data should be shared openly. Security, commercial 
>>>>> sensitivity
>>>>> and, above all, individuals' privacy need to be taken into 
>>>>> account. It is
>>>>> for data publishers, not a technical standards working group, to 
>>>>> determine
>>>>> policy on which data should be shared and under what 
>>>>> circumstances. Data
>>>>> sharing policies are likely to assess the exposure risk and 
>>>>> determine the
>>>>> appropriate security measures to be taken to protect sensitive 
>>>>> data, such
>>>>> as secure authentication and use of HTTPS.
>>>>>
>>>>> Depending on circumstance, sensitive information about individuals 
>>>>> might
>>>>> include: full name, home address, email address, national 
>>>>> identification
>>>>> number, IP address, vehicle registration plate number, driver's 
>>>>> license
>>>>> number, face, fingerprints, or handwriting, credit card numbers, 
>>>>> digital
>>>>> identity, date of birth, birthplace, genetic information, 
>>>>> telephone number,
>>>>> login name, screen name, nickname, health records etc. Although it is
>>>>> likely to be safe to share some of that information openly, and 
>>>>> even more
>>>>> within a controlled environment, publishers should bear in mind 
>>>>> that data
>>>>> enrichment techniques may allow some elements to be discovered and 
>>>>> linked
>>>>> from elsewhere.
>>>>>
>>>>> Notwithstanding that caution, data enrichment offers exciting
>>>>> possibilities for both data publishers and consumers.
>>>>>
>>>>>
>>>>>
>>>>> == ends==
>>>>>
>>>>> -- 
>>>>>
>>>>>
>>>>> Phil Archer
>>>>> W3C Data Activity Lead
>>>>> http://www.w3.org/2013/data/
>>>>>
>>>>> http://philarcher.org
>>>>> +44 (0)7887 767755
>>>>> @philarcher1
>>>>>
>>>>
>>>>
>>>>
>>> -- 
>>>
>>>
>>> Phil Archer
>>> W3C Data Activity Lead
>>> http://www.w3.org/2013/data/
>>>
>>> http://philarcher.org
>>> +44 (0)7887 767755
>>> @philarcher1
>>>
>>
>

-- 
Annette Greiner
NERSC Data and Analytics Services
Lawrence Berkeley National Laboratory
Received on Tuesday, 10 May 2016 19:56:20 UTC