[Bug 167] explanation of identified, identifiable, and linked from Ari Schwartz on 2003-08-18 (public-p3p-spec@w3.org from August 2003)

From: Ari Schwartz <ari@cdt.org>
Date: Mon, 18 Aug 2003 13:38:06 -0400
To: <public-p3p-spec@w3.org>
Message-Id: <a05200f06bb6691147ff3@[10.0.1.7]>
Review of what has changed:

1) Discussion of other laws in the context of "identifiable" I'm 
still checking with PIPEDA experts to make sure that the reference to 
Canadian law is correct.
2) Removed headers and all mention of the word "storage"
3) Added a final sentence about "linked" to make options clear to 
data collectors
4) Made small changes in the data by type section to make it clear 
that it is really about "identifiers"

Ari


------


Identity Definitions in the P3P Specification

In privacy regulations, guidelines and papers about privacy a variety 
of terms are used to describe data that identifies an individual to 
varying degrees.

The European Union Directive defines "an identifiable person" as "one 
who can be identified, directly or indirectly, in particular by 
reference to an identification number or to one or more factors 
specific to his physical, physiological, mental, economic, cultural 
or social identity."  The Directive also states that in determining 
whether a person is identifiable "account should be taken of all the 
means likely reasonably to be used either by the controller or by any 
other person to identify the said person; whereas the principles of 
protection shall not apply to data rendered anonymous in such a way 
that the data subject is no longer identifiable."

In Australia, "personal information" is information about an 
individual who can be identified, or whose identity could be 
reasonably ascertained."  In Canada "personal information" means 
information about an identifiable individual. In the United States, 
different sectors have different standards for identifiability of 
data. Similarly, in many other policy documents, terms such as 
"personally identifiable information (PII)" are often not defined or 
the cause for heated debate.

The P3P Specification Working Group has taken the view point that 
most information referring to an individual is "identifiable" in some 
way.  As with other important areas of the specification, the goal of 
the working group was to allow for a wide variety of understandings 
of identity in order to allow data collectors to best express their 
policy and users to make choices based on a definition of identity 
information that is important to them. (1)


"Identified" Data

The most common term in the specification is "identified data" and 
focuses on how the information can be or is being used.

"Identified data" is information that reasonably can be used by the 
data collector to identify an individual.  Admittedly, this is a 
somewhat subjective standard.  For example, a data collector storing 
Internet Protocol (IP) addresses  (which can be created dynamically 
or could be static and therefore tied to a particular computer used 
by a single individual) should consider the IP address "identified 
data" only when an attempt is made to tie the exact addresses to past 
records or work with others to identify the specific individual or 
computer over a long period of time.  In the more common case, where 
data collectors use IP addressing information in the aggregate or 
make no attempt to tie the IP address to a specified individual or 
computer over a long period of time, IP addresses are not considered 
identified even though it is possible for someone (eg, law 
enforcement agents with proper subpoena powers) to identify the 
individual based on the stored data.

As mentioned above, in the P3P context, any data that can be used 
reasonably by a data controller or any other person to identify an 
individual is considered to be identifiable data. The P3P 
specification uses the term "identified" to describe a subset of this 
data that can be reasonably used by a data collector *without 
assistance from other parties* to identify an individual.

"Non-Identifiable" and "Linked" Data

The working group also felt that data collectors should be able 
acknowledge when they make specific attempts to anonymize information.

The term "non-identifiable" data refers to efforts made specifically 
to de-identify data.  For example, a data collector collecting and 
storing IP addresses but not using them should NOT call this data 
"non-identifiable" even in the common case where they have no plans 
to identify an actual individual or computer. However, if a Web site 
collects IP addresses, but actively deletes all but the last four 
digits of this information in order to determine short term use, but 
insure that a particular individual or computer cannot be 
consistently identified, then the data collector can and should call 
this information "non-identifiable."  Also, non-identifiable can be 
used in cases where no information is being collected at all.  Since 
most Web servers are designed to keep Web logs for maintenance, this 
would most likely mean that the data collector has taken specific 
efforts to ensure the anonymity of users.

Under the above definitions, a lot of information could be 
"identifiable" (not specifically made anonymous), but not 
"identified" (reasonably able to be tied to an individual or 
computer).

Similarly, the term "linked" refers to how information is being used 
in connection with a cookie. All data in a cookie or linked to a 
particular user must be disclosed in the cookie's policy. Using the 
terminology above, if the data collector collects "identifiable" 
information about the user it is generally "linked" data. For 
example, if the data collector stores a login name in a file 
associated with a persistent cookie and the login name is linked to 
personal data, the cookie is clearly "linked."

In less clear cut example, if the data collector ties the cookie to a 
specific order id in a flat file and that order id is tied to 
personal information in a related file, the cookie would be linked to 
all of the relational data unless specific precautions have been 
taken to ensure that a data operator with access to the relational 
data cannot access the flat cookie data and vice versa.

In other words, a data collector must: collect no personal 
information; take active steps to make sure that information 
referring to a specific individual cannot be identified; or must use 
the "linked" tag.


Identifiers

The Working Group decided against an identified or identifiable label 
for particular types of data. However, user agent implementers have 
the option of assigning these or other labels themselves and building 
user interfaces that allow users to make decisions about web sites on 
the basis of how they collect and use certain types of data.

The Working Group felt that different user agent implementations 
could be created to focus on different concerns around data type. 
Therefore, the working group enabled the creation of a robust data 
schema including broad categories of information that may be 
considered sensitive by certain user groups.  The Working Group hopes 
that a diverse set of user agents will be created to allow users the 
ability to make identity decisions based on specific collections and 
types of collects if they desire to do so.  For example, a user agent 
could allow users to opt to be prompted when medical or financial 
identifier is being collected, independent of how that information is 
being used.

(1)   More information on the debate and the definitions can be found 
in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.





-- 
------------------------------------
Ari Schwartz
Associate Director
Center for Democracy and Technology
1634 I Street NW, Suite 1100
Washington, DC 20006
202 637 9800
fax 202 637 0968
ari@cdt.org
http://www.cdt.org
------------------------------------
Received on Monday, 18 August 2003 13:41:17 UTC