- From: Roy T. Fielding <fielding@gbiv.com>
- Date: Thu, 31 Jul 2014 15:30:53 -0700
- To: Roy T. Fielding <fielding@gbiv.com>
- Cc: "public-tracking@w3.org List" <public-tracking@w3.org>
On Jul 23, 2014, at 9:19 AM, Roy T. Fielding wrote: > Alternatively, I would be happy with: > > A data set is considered de-identified when there exists a reasonable > level of justified confidence that none of the data within it can be > linked to a particular user, user agent, or device. I did a bit of research and again found that this definition is stronger than what is commonly called de-identified in existing US regulations and human subject studies. A closer version would be A data set is considered de-identified when a person with appropriate expertise has determined (with justification) that no human subject can be identified, directly or through an identifier linked to the subject, by that data alone or in combination with other reasonably available information. or a simpler variant A data set is considered de-identified when there exists a reasonable level of justified confidence that no user can be identified, directly or through an identifier linked to the user, by that data alone or in combination with other reasonably available information. Note that this does not prevent a person with information that is not reasonably available, such as the holder of a secret key, from re-identifying the data set. I assume that is because most of the background in US regulations is from medical studies, where there is some obligation to go back and inform the subjects if a correlation is later found which might indicate a health concern. My question is: Is that what the working group wants? I find it incredibly frustrating that some folks (including me in my proposal quoted above) are trying to make a new definition of de-identified so that it looks a lot more like a different commonly used term in this space: anonymized. Obviously, I wandered into a tar pit. I finally understand why the red-yellow-green states were proposed instead. However, I am also ridiculously stubborn when it comes to tar pits. I suggest that, rather than continue trying to mangle the definition of an existing term to be more acceptable, we first decide whether we want to be using that term in the first place. In other words, assume we have one big requirement up front that says: Data that is noa is out of scope: none of these restrictions on collection, retention, use, or sharing apply when data is noa. [I am using "noa" here in the faint hope that nobody here has a preconceived understanding of that term (it does have one, but not one in English).] So, my next question is do we want to define that as: Data is noa if only a small set of people sworn to secrecy are capable of identifying any human data subject observed by that data. or Data is noa if it is impossible (as far as we know) for anyone, including those who made it noa, to identify or re-identify any human data subject observed by that data. because the former is closer to de-identified and the latter is closer to anonymized. Cheers, Roy T. Fielding <http://roy.gbiv.com/> Senior Principal Scientist, Adobe <http://www.adobe.com/>
Received on Thursday, 31 July 2014 22:31:17 UTC