Re: Deidentification (ISSUE-188) from Roy T. Fielding on 2014-07-31 (public-tracking@w3.org from July 2014)

From: Roy T. Fielding <fielding@gbiv.com>
Date: Thu, 31 Jul 2014 15:30:53 -0700
To: Roy T. Fielding <fielding@gbiv.com>
Cc: "public-tracking@w3.org List" <public-tracking@w3.org>
Message-Id: <83C6EC15-38A2-4DF0-8393-53C3D87BA55E@gbiv.com>

On Jul 23, 2014, at 9:19 AM, Roy T. Fielding wrote:
> Alternatively, I would be happy with:
> 
>  A data set is considered de-identified when there exists a reasonable
>  level of justified confidence that none of the data within it can be
>  linked to a particular user, user agent, or device.

I did a bit of research and again found that this definition is stronger
than what is commonly called de-identified in existing US regulations
and human subject studies.  A closer version would be

  A data set is considered de-identified when a person with appropriate
  expertise has determined (with justification) that no human subject
  can be identified, directly or through an identifier linked to the
  subject, by that data alone or in combination with other reasonably
  available information.

or a simpler variant

  A data set is considered de-identified when there exists a reasonable
  level of justified confidence that no user can be identified,
  directly or through an identifier linked to the user, by that data
  alone or in combination with other reasonably available information.

Note that this does not prevent a person with information that is not
reasonably available, such as the holder of a secret key, from
re-identifying the data set.  I assume that is because most of the
background in US regulations is from medical studies, where there is
some obligation to go back and inform the subjects if a correlation
is later found which might indicate a health concern.

My question is: Is that what the working group wants?

I find it incredibly frustrating that some folks (including me in my
proposal quoted above) are trying to make a new definition of
de-identified so that it looks a lot more like a different commonly
used term in this space: anonymized.

Obviously, I wandered into a tar pit.  I finally understand why the
red-yellow-green states were proposed instead.

However, I am also ridiculously stubborn when it comes to tar pits.

I suggest that, rather than continue trying to mangle the definition
of an existing term to be more acceptable, we first decide whether
we want to be using that term in the first place.  In other words,
assume we have one big requirement up front that says:

   Data that is noa is out of scope: none of these restrictions on
   collection, retention, use, or sharing apply when data is noa.

   [I am using "noa" here in the faint hope that nobody here has a
   preconceived understanding of that term (it does have one, but
   not one in English).]

So, my next question is do we want to define that as:

   Data is noa if only a small set of people sworn to secrecy
   are capable of identifying any human data subject observed by
   that data.

or

   Data is noa if it is impossible (as far as we know) for anyone,
   including those who made it noa, to identify or re-identify
   any human data subject observed by that data.

because the former is closer to de-identified and the latter is
closer to anonymized.

Cheers,

Roy T. Fielding                     <http://roy.gbiv.com/>
Senior Principal Scientist, Adobe   <http://www.adobe.com/>

Received on Thursday, 31 July 2014 22:31:17 UTC