[Bug 167] explanation of identified, identifiable, and linked from Ari Schwartz on 2003-08-06 (public-p3p-spec@w3.org from August 2003)

From: Ari Schwartz <ari@cdt.org>
Date: Wed, 6 Aug 2003 12:22:35 -0400
To: <public-p3p-spec@w3.org>
Message-Id: <a05200f1fbb56da86738d@[67.208.73.61]>
Updated Draft based on comments from the WG:

----


Identity Definitions in the P3P Specification

In privacy regulations, guidelines and papers about privacy a variety 
of terms are used to describe data that identifies an individual to 
varying degrees.  The European Union Directive defines "an 
identifiable person" as "one who can be identified, directly or 
indirectly, in particular by reference to an identification number or 
to one or more factors specific to his physical, physiological, 
mental, economic, cultural or social identity."  The Directive also 
states that in determining whether a person is identifiable "account 
should be taken of all the means likely reasonably to be used either 
by the controller or by any other person to identify the said person; 
whereas the principles of protection shall not apply to data rendered 
anonymous in such a way that the data subject is no longer 
identifiable."  In other policy documents terms such as "personally 
identifiable  information (PII)" are often not defined or the cause 
for heated debate.

In different documents, "identity" can be tied to:

1) how the information can be or is being used,
2) how the information is stored, or
3) the type of information.

The P3P Specification Working Group tried to capture all three of 
these ideas so that different implementers and users can make 
decisions based on the importance they place on these various 
definitions of identity. (1)

Identity Through Usage ("identified" data)

The most common term in the specification is "identified data" and 
focuses on how the information can be or is being used.

"Identified data" is information that reasonably can be used by the 
data collector to identify an individual.  Admittedly, this is a 
somewhat subjective standard.  For example, a data collector storing 
Internet Protocol (IP) addresses  (which can be created dynamically 
or could be static and therefore tied to a particular computer used
by a single individual) should consider the IP address "identified 
data" only when an attempt is made to tie the exact addresses to past 
records or work with others to identify the specific individual or 
computer over a long period of time.  In the more common case, where 
data collectors use IP addressing information in the aggregate or 
make no attempt to tie the IP address to a specified individual or 
computer over a long period of time, IP addresses are not considered 
identified even though it is possible for someone (eg, law 
enforcement agents with proper subpoena powers) to identify the 
individual based on the stored data.

In the P3P Specification, the term "identifiable" is used in a 
similar way as it as used in the EU Directive. Thus, in the P3P 
context, any data that can be used reasonably by a data controller or 
any other person to identify an individual is considered to be 
identifiable data. The P3P specification uses the term "identified" 
to describe a subset of this data that can be reasonably used by a 
data collector *without assistance from other parties* to identify an 
individual.

Identity Through Storage ("non-identifiable" and "linked" data)

The working group also felt that data collectors should be able 
acknowledge when they make specific attempts to anonymize what would 
otherwise be identifiable in its storage.

The term "non-identifiable" data refers to how the information is 
stored.  For example, a data collector collecting and storing IP 
addresses but not using them should NOT call this data 
"non-identifiable" even in the common case where they have no plans 
to identify an actual individual or computer. However, if a Web site 
collects IP addresses, but actively deletes all but the last four 
digits of this information in order to determine short term use, but 
insure that a particular individual or computer cannot be 
consistently identified, then the data collector can and should call 
this information "non-identifiable."  Also, non-identifiable can be 
used in cases where no information is being collected at all.  Since 
most Web servers are designed to keep Web logs for maintenance, this 
would most likely mean that the data collector has taken specific
efforts to ensure the anonymity of users.

Under the above definitions, a lot of information could be 
"identifiable" (not specifically made anonymous), but not 
"identified" (reasonably able to be tied to an individual or 
computer).

Similarly, the term "linked" refers to how information is being 
stored in connection with a cookie. All data in a cookie or linked to 
a particular user must be disclosed in the cookie's policy. Using the 
terminology above, if the data collector collects "identifiable" 
information about the user it is generally "linked" data. For 
example, if the data collector stores a login name in a file 
associated with a persistent cookie and the login name is linked to 
personal data, the cookie is clearly "linked."

In less clear cut example, if the data collector ties the cookie to a 
specific order id in a flat file and that order id is tied to 
personal information in a related file, the cookie would be linked to 
all of the relational data unless specific precautions have been 
taken to ensure that a data operator with access to the relational 
data cannot access the flat cookie data and vice versa.

Identity Through Information Type

The Working Group decided against an identified or identifiable label 
for particular types of data. However, user agent implementers have 
the option of assigning these or other labels themselves and building 
user interfaces that allow users to make decisions about web sites on 
the basis of how they collect and use certain types of data.

The Working Group felt that different user agent implementations 
could be created to focus on different concerns around data type. 
Therefore, the working group enabled the creation of a robust data 
schema including broad categories of information that may be 
considered sensitive by certain user groups.  The Working Group hopes 
that a diverse set of user agents will be created to allow users the 
ability to make identity decisions based on specific collections and 
types of collects if they desire to do so.  For example, a user agent 
could allow users to opt to be prompted when medical or financial 
identifier is being collected, independent of how that information is 
being used.

(1)   More information on the debate and the definitions can be found 
in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.



-- 
------------------------------------
Ari Schwartz
Associate Director
Center for Democracy and Technology
1634 I Street NW, Suite 1100
Washington, DC 20006
202 637 9800
fax 202 637 0968
ari@cdt.org
http://www.cdt.org
------------------------------------
Received on Wednesday, 6 August 2003 12:21:56 UTC