Re: Sensitive data text for enrichment section from Annette Greiner on 2016-05-07 (public-dwbp-wg@w3.org from May 2016)

From: Annette Greiner <amgreiner@lbl.gov>
Date: Sat, 7 May 2016 13:05:35 -0700
To: Phil Archer <phila@w3.org>
Cc: Public DWBP WG <public-dwbp-wg@w3.org>, Bernadette Farias Lóscio <bfl@cin.ufpe.br>
Message-Id: <CE264BA3-759E-4383-97DB-5EB22D2B72CB@lbl.gov>
Hi Phil,
Thanks for letting me weigh in. I understand the connection you’re making here, and I think it’s a good thing to mention in the enrichment section. What I think is crucial but is not yet reflected in here is the issue of privacy breach arising from putting together disparate data that presents less risk separately. The second paragraph here is a good but more general discussion of security and privacy issues that strikes me as not belonging in this particular section. I would suggest instead addressing the more general issues in the introduction to our document. Most of the third paragraph would also be better in the document introduction, but the last sentence is relevant here. As I see it, the real issue with data enrichment is combining datasets that each hold so little information about any individual that they cannot be identified but that together offer enough information that they can be. I would suggest that here we just say, 

> Data enrichment refers to a set of processes that can be used to enhance, refine or otherwise improve raw or previously processed data. This idea and other similar concepts contribute to making data a valuable asset for almost any modern business or enterprise. It is a diverse topic in itself, details of which are beyond the scope of the current document. However, it is worth noting that some of these techniques should be approached with caution, as ethical concerns may arise. In scientific research, care must be taken to avoid enrichment that distorts results or statistical outcomes. For data about individuals, privacy issues may arise when combining datasets. That is, enriching one dataset with another, when neither contains sufficient information about any individual to identify them, may yield a combined dataset that compromises privacy. Furthermore, these techniqes can be carried out at scale, which in turn highlights the need for caution.

Then, in the document introduction, I would suggest adding the following, after the paragraph that begins “In this context…”.

> Not all data should be shared openly, however. Security, commercial sensitivity and, above all, individuals' privacy need to be taken into account. It is for data publishers, not a technical standards working group, to determine policy on which data should be shared and under what circumstances. Data sharing policies are likely to assess the exposure risk and determine the appropriate security measures to be taken to protect sensitive data, such as secure authentication and authorization.
> 
> Depending on circumstances, sensitive information about individuals might include full name, home address, email address, national identification number, IP address, vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc. Although it is likely to be safe to share some of that information openly, and even more within a controlled environment, publishers should bear in mind that combining data from multiple sources may allow inadvertent identification of individuals.

(I took out mention of https, as it will soon be everywhere, which would make our doc out of date.)

Also, I noticed a grammatical error in the implementation section of BP 31. (Subject-verb agreement is off.) It should read "Techniques for data enrichment are complex and go well beyond the scope of this document, which can only highlight the possibilities."
-Annette

> On May 6, 2016, at 7:50 AM, Phil Archer <phila@w3.org> wrote:
> 
> Berna,
> 
> As promised, I've copied the text from the sensitive data section and merged some of it with the data enrichment intro to end up with this as a suggestion.
> 
> @Annette - we resolved to do this and move the BP about data unavailability to the data access section. Do you agree with this?
> 
> ===Begins==
> 
> Data enrichment refers to a set of processes that can be used to enhance, refine or otherwise improve raw or previously processed data. This idea and other similar concepts contribute to making data a valuable asset for almost any modern business or enterprise. It is a diverse topic in itself, details of which are beyond the scope of the current document. However, it is worth noting that techniques exist to carry out such enrichment at scale which in turn highlights the need for caution.
> 
> Not all data should be shared openly. Security, commercial sensitivity and, above all, individuals' privacy need to be taken into account. It is for data publishers, not a technical standards working group, to determine policy on which data should be shared and under what circumstances. Data sharing policies are likely to assess the exposure risk and determine the appropriate security measures to be taken to protect sensitive data, such as secure authentication and use of HTTPS.
> 
> Depending on circumstance, sensitive information about individuals might include: full name, home address, email address, national identification number, IP address, vehicle registration plate number, driver's license number, face, fingerprints, or handwriting, credit card numbers, digital identity, date of birth, birthplace, genetic information, telephone number, login name, screen name, nickname, health records etc. Although it is likely to be safe to share some of that information openly, and even more within a controlled environment, publishers should bear in mind that data enrichment techniques may allow some elements to be discovered and linked from elsewhere.
> 
> Notwithstanding that caution, data enrichment offers exciting possibilities for both data publishers and consumers.
> 
> 
> 
> == ends==
> 
> -- 
> 
> 
> Phil Archer
> W3C Data Activity Lead
> http://www.w3.org/2013/data/
> 
> http://philarcher.org
> +44 (0)7887 767755
> @philarcher1
Received on Saturday, 7 May 2016 21:35:57 UTC