Re: [Bug 167] explanation of identified, identifiable, and linked from Ari Schwartz on 2003-07-18 (public-p3p-spec@w3.org from July 2003)

From: Ari Schwartz <ari@cdt.org>
Date: Fri, 18 Jul 2003 14:15:23 -0400
To: Rigo Wenning <rigo@w3.org>, public-p3p-spec@w3.org
Message-Id: <a05200f17bb3de677b9c6@[67.201.25.179]>
Hi Rigo,

My definitions were not meant to be from a US cultural perspective. 
In fact, if I understand them correctly, your definitions are much 
closer to the experience (legal interpretations) of the US Privacy 
Act of 1974 then mine are.

The reason that I strayed from this view is because of the use of 
relational databases.  A data collector may create a system where 
fields are not searched at the time of collection but the person is 
still generally identified within the system.

My understanding from Europe was that different countries were 
interpreting this differently and it hadn't been worked out.  If you 
can point to some definitive writing on this subject, I'd definitely 
like to see it.

Ari




At 6:55 PM +0200 7/18/03, Rigo Wenning wrote:
>Ari,
>
>I owe you the response to your very stringent analysis of the word and
>definition of 'identifiable' and 'identified' from a US cultural
>perspective.
>
>So if you read the following, keep in mind that I try to do the same
>from a european perspective. I try to explain more about the model
>behind the european definitions and understanding of identity and
>identifiable. Note that 'identified' does not exist in there, but it is an
>original take-up by the P3P Specification WG that might make it's way
>into data protection (as it makes a lot of sense)
>
>So I risk to be long, but I tried to be short in Bugzilla and was not
>understood. So I try from to explain from Adam and Eve on.
>
>The EU-Directive defines identifiable at two places:
>1/ The scope of data protection as such is limited to identifiable data
>Article 2 a states:
>  'personal data' shall mean any information relating to an identified or
>  identifiable natural person ('data subject'); an identifiable person is
>  one who can be identified, directly or indirectly, in particular by
>  reference to an identification number or to one or more factors
>  specific to his physical, physiological, mental, economic, cultural or
>  social identity;
>
>2/ Identifiable is explained further in consideration 26 of the
>Directive:
>
>(26) Whereas the principles of protection must apply to any information
>concerning an identified or identifiable person; whereas, to determine
>whether a person is identifiable, account should be taken of all the
>means likely reasonably to be used either by the controller or by any
>other person to identify the said person; whereas the principles of
>protection shall not apply to data rendered anonymous in such a way that
>the data subject is no longer identifiable; whereas codes of conduct
>within the meaning of Article 27 may be a useful instrument for
>providing guidance as to the ways in which data may be rendered
>anonymous and retained in a form in which identification of the data
>subject is no longer possible;
>
>Those are my assumptions and definitions as I want to keep P3P somewhat
>in sync with the Directive (see Art. 10 Taskforce for this goal)
>
>So the EU-Directive, if you look carefully at the definitions, does not
>have recurse on the intent of the data controller (or processor) to
>process data (use in your case) to identify. To trigger the application
>of data protection law (scope) it is sufficient that with likely and
>reasonable means, a person is able to say: This is person X (e.g. Ari
>Schwartz). The scope is not limited to the data processor.
>For this email, identifying is easy, as your email-address is a unique
>identifier making clear that you're the particular Ari Schwartz who sent
>the email cited.  So if you would use 12345@w3.org, the W3-Systeam could
>still identify you. So it is still identifiable data. (any person has to
>be taken into account)
>
>This makes the scope too large to give a basis for a technology tool.
>While it might be understandable that a law wants to define a very broad
>scope, it is unusable for most of the decisions that P3P wants to help
>with. That's why we have invented/taken-up the 'identified' term in
>P3P 1.0, mostly because of the access-issues (access identified and not
>all potential identifiable data, which would require a lot of processing
>and identify the requester by doing his request)
>
>So despite all the discussions in this world what PII is or what it
>isn't, despite the misconceptions floating around the world, we better
>ought to have a conscise concept of identity behind P3P. This concept of
>identity distinguishes IMHO between non-identifiable, identifiable and
>identified. I want to prevent that we lose overview by packing
>processing, purpose, disclosure, sharing, retention, sensitive data
>types and other things into this one term and definition. Doing so could
>only bring us into confusion.
>
>So for the sake of clarity, I propose the following definitions for the
>suggested three terms.
>
>1/ Identifiable
>We always operated under the assumption of the definition in whereas 26
>of the EU-Directive. If there is a possibility to get to 'who is that
>guy' with likely reasonable means by any party, than we have the
>assumption of identifiable information, thus information that makes
>assertions about a natural person. The likely and reasonable prevent
>to take science fiction scenarios into account like invading Iraq to
>find out who has joked on the phone about those wappons. (I think you
>made this point in our earlier discussions ;)
>
>2/ non-identifiable
>This is the flip-side of 1/. This is reflected in the <non-identifiable>
>element, which is based directly on the definition. (second phrase
>after colon) of whereas 26. P3P gives a hint to render IP-addresses
>anonymous. This could be taken up by a code of conduct e.g.
>
>3/ identified
>The means (processing of data) described in 1/ have already been
>applied. This makes access to assertions about a natural person easier
>and faster. This is particularly important for access rights but also
>for the overall risk analysis. So identified is data, that does not need
>further processing in this direction (I think the technical term is
>'crossing')
>
>Those three are all relative to 'data'. Data is an object here. The
>attributes to the term data are the three terms above describing the
>relationsship between a given natural person (also an object here) and
>the object 'data'. This relationsship has different levels and
>qualities. We deliberatly choosed the three described (IMHO)
>
>Now, like a cross-roads, there is another concept: The data processing life
>cycle:
>Data is
>a/ collected
>b/ processed
>c/ transferred
>d/ deleted
>
>(Retention is a sub-group to collection, transfer is a subgroup to
>processing)
>
>Now if you suggest to describe 'identified data' as 'how the information
>can be or is being used', or if you describe it as 'reasonably can be
>used by the data collector to identify an individual, your definition
>mixes up identifiable, the processing of some raw data to identify
>an individual.
>
>In fact you say: We use this raw traffic data to mine it and poke until
>we know who you are, where you live and how you shop.
>
>This mixes up the model as it describes a purpose of processing with the
>collection of data and also with the relationsship between our objects
>'natural person' and 'data'. That's why I suggest to keep this
>relationsship and the purpose of processing separate. This is done in
>my definitions. So in non-identifiable, they can't find out, in
>identifiable, they could find out and in identified, they found out.
>
>If I collect identifiable information like IP (not domain), we have to
>separate the question whether this is used to find out or not. The
>question of whether they want to find out or not is an important
>question, but it has to be treated separatly from the term identity as
>it is a purpose and normally not a primary purpose ;).
>
>A/You might say: We collect identifiable information. This is for the user
>-agent to decide to use an anonymizer.
>
>B/You might say: We use the data in A to find out who you are.
>(The problem is, today, we can't say that as the purpose section is
>still too poor)
>
>It doesn't make sense to say:
>
>A/ We collect identifiable information
>B/ We do not use them to identify you (but we could, so trust us and our
>sysadmin ;)
>
>negative assertions just don't fit into the P3P model.
>
>Another cross-roads is so called 'sensitive' data.
>
>According to Shannon/Weaver, any data can be sensitive, if it carries
>enough information. Did Clinten have sex with Lewinsky? A simple 'yes'
>from Mister C., a simple bit, could be sufficient.
>
>So the EU-Directive takes some info up that is 'usually' very senitive,
>like financial info, health info, religion and politics. Those are just
>defined in the law. I don't think we have sufficient technology today to
>define whether data is sensitive or not. We can take those from the OECD
>and the EU-Directive up as it seems to make sense in front of such a
>broad agreement. But outside this narrow scope, it is on the user
>himself, confronted with the data, to judge whether data is sensitive or
>not. And it should surely not mix into the definition of identity,
>identifiable or identified.
>
>That's why I think we should stick with my definitions described in
>1/ 2/ and 3/ and revise the Spec accordingly as far as we can get in 1.1
>
>Best,
>
>Rigo
>
>
>
>
>On Fri, Jun 20, 2003 at 12:09:27PM -0400, Ari Schwartz wrote:
>>
>>  Here is my draft text for addressing the confusion around identity
>>  terms in the spec.
>>
>>
>>
>>  >http://www.w3.org/Bugs/Public/show_bug.cgi?id=167
>>
>>
>>
>>  Identity Definitions in the P3P Specification
>>
>>  In privacy regulations, guidelines and papers about privacy a variety
>>  of terms are used to describe data that identifies an individual to
>>  varying degrees.  Some common terms such as "personally identifiable
>>  information (PII)" are often not defined or the cause for heated
>>  debate.  In different documents, "identity" can be tied to:
>>
>>  1) how the information can be or is being used,
>>  2) how the information is stored, or
>>  3) the type of information.
>>
>>  The P3P Specification Working Group tried to capture all three of
>>  these ideas so that different implementers and users can make
>>  decisions based on the importance they place on these various
>>  definitions of identity. (1)
>>
>>  Identity Through Usage ("identified" data)
>>
>>  The most common term in the specification is "identified data" and
>>  focuses on how the information can be or is being used.
>>
>>  "Identified data" is information that reasonably can be used by the
>>  data collector to identify an individual.  Admittedly, this is a
>>  somewhat subjective standard.  For example, a data collector storing
>>  Internet Protocol (IP) addresses  (which can be created dynamically
>>  or could be static and therefore tied to a particular computer used
>>  by a single individual) should consider the IP address "identified
>>  data" only when an attempt is made to tie the exact addresses to past
>>  records or work with others to identify the specific individual or
>>  computer over a long period of time.  In the more common case, where
>>  data collectors use IP addressing information in the aggregate or
>>  make no attempt to tie the IP address to a specified individual or
>>  computer over a long period of time, IP addresses are not considered
>>  identified even though it is possible for someone (eg, law
>>  enforcement agents with proper subpoena powers) to identify the
>>  individual based on the stored data.
>>
>>
>>  Identity Through Storage ("non-identifiable" and "linked" data)
>>
>>  The working group also felt that data collectors should be able
>>  acknowledge when they make specific attempts to anonymize what would
>>  otherwise be identifiable in its storage.
>>
>>  The term "non-identifiable" data refers to how the information is
>>  stored.  For example, a data collector collecting and storing IP
>>  addresses but not using them should NOT call this data
>>  "non-identifiable" even in the common case where they have no plans
>>  to identify an actual individual or computer. However, if a Web site
>>  collects IP addresses, but actively deletes all but the last four
>>  digits of this information in order to determine short term use, but
>>  insure that a particular individual or computer cannot be
>>  consistently identified, then the data collector can and should call
>>  this information "non-identifiable."  Also, non-identifiable can be
>>  used in cases where no information is being collected at all.  Since
>  > most Web servers are designed to keep Web logs for maintenance, this
>>  would most likely mean that the data collector has taken specific
>>  efforts to ensure the anonymity of users.
>>
>>  Under the above definitions, a lot of information could be
>  > "identifiable" (not specifically made anonymous), but not
>>  "identified" (reasonably able to be tied to an individual or
>>  computer).
>>
>>  Similarly, the term "linked" refers to how information is being
>>  stored in connection with a cookie. All data in a cookie or linked to
>>  a particular user must be disclosed in the cookie's policy. Using the
>>  terminology above, if the data collector collects "identifiable"
>>  information about the user it is generally "linked" data.
>>
>>  Identity Through Information Type
>>
>>  The Working Group felt that different user agent implementations
>>  could be created to focus on different concerns around data type.
>>  Therefore, the working group enabled the creation of a robust data
>>  schema including broad categories of information that may be
>>  considered sensitive by certain user groups.  The Working Group hopes
>>  that a diverse set of user agents will be created to allow users the
>>  ability to make identity decisions based on specific collections and
>>  types of collects if they desire to do so.  For example, a user agent
>>  could allow users to opt to be prompted when medical or financial
>>  identifier is being collected, independent of how that information is
>>  being used.
>>
>>  (1)   More information on the debate and the definitions can be found
>>  in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.
>>
>>
>>
>>  --
>>  ------------------------------------
>>  Ari Schwartz
>>  Associate Director
>>  Center for Democracy and Technology
>>  1634 I Street NW, Suite 1100
>>  Washington, DC 20006
>>  202 637 9800
>>  fax 202 637 0968
>>  ari@cdt.org
>>  http://www.cdt.org
>>  ------------------------------------


-- 
------------------------------------
Ari Schwartz
Associate Director
Center for Democracy and Technology
1634 I Street NW, Suite 1100
Washington, DC 20006
202 637 9800
fax 202 637 0968
ari@cdt.org
http://www.cdt.org
------------------------------------
Received on Friday, 18 July 2003 14:14:58 UTC