[Bug 167] explanation of identified, identifiable, and linked from Ari Schwartz on 2003-09-03 (public-p3p-spec@w3.org from September 2003)

From: Ari Schwartz <ari@cdt.org>
Date: Wed, 3 Sep 2003 12:17:05 -0400
To: <public-p3p-spec@w3.org>
Message-Id: <a05200f36bb7bc1e38f5c@[10.0.1.7]>
Final Version with Rigo's suggested changes -- 2 weeks to review (Sept. 17)



------

Identity Definitions in the P3P Specification

In privacy regulations, guidelines and papers about privacy a variety 
of terms are used to describe data that identifies an individual to 
varying degrees.

The European Union Directive defines "an identifiable person" as "one 
who can be identified, directly or indirectly, in particular by 
reference to an identification number or to one or more factors 
specific to his physical, physiological, mental, economic, cultural 
or social identity."  The Directive also states that in determining 
whether a person is identifiable "account should be taken of all the 
means likely reasonably to be used either by the controller or by any 
other person to identify the said person; whereas the principles of 
protection shall not apply to data rendered anonymous in such a way 
that the data subject is no longer identifiable."

In Australia, "personal information" is information about an 
individual who can be identified, or whose identity could be 
reasonably ascertained."  In Canada "personal information" means 
information about an "identifiable" individual. In the United States, 
different sectors have different standards for identifiability of 
data. Similarly, in many other policy documents, terms such as 
"personally identifiable information (PII)" are often not defined or 
the cause for heated debate.

The P3P Specification Working Group has taken the view point that 
most information referring to an individual is "identifiable" in some 
way.  As with other important areas of the specification, the goal of 
the working group was to allow for a wide variety of understandings 
of identity in order to allow data collectors to best express their 
policy and users to make choices based on a definition of identity 
information that is important to them. (1)


"Identified" Data

The most common term in the specification is "identified data" and 
focuses on whether a service knows the data subject's identity.

"Identified data" is information that in a record or profile and can 
reasonably be tied to an individual.  Admittedly, this is a somewhat 
subjective standard.  For example, a data collector storing Internet 
Protocol (IP) addresses  (which can be created dynamically or could 
be static and therefore tied to a particular computer used by a 
single individual) should consider the IP address "identified data" 
only when this data is added to the record or profile of a specific 
individual.  In the more common case, where data collectors use IP 
addressing information in the aggregate or make no attempt to tie the 
IP address to a specified individual or computer over a long period 
of time, IP addresses are not considered identified even though it is 
possible for someone (eg, law enforcement agents with proper subpoena 
powers) to identify the individual based on the stored data.

As mentioned above, in the P3P context, any data that can be used 
reasonably by a data controller or any other person to identify an 
individual is considered to be identifiable data. The P3P 
specification uses the term "identified" to describe a subset of this 
data that can be reasonably be used by a data collector *without 
assistance from other parties* to identify an individual.

"Non-Identifiable" and "Linked" Data

The working group also felt that data collectors should be able 
acknowledge when they make specific attempts to anonymize information.

The term "non-identifiable" data refers to efforts made specifically 
to de-identify data.  For example, a data collector collecting and 
storing IP addresses but not using them should NOT call this data 
"non-identifiable" even in the common case where they have no plans 
to identify an actual individual or computer. However, if a Web site 
collects IP addresses, but actively deletes all but the last four 
digits of this information in order to determine short term use, but 
insure that a particular individual or computer cannot be 
consistently identified, then the data collector can and should call 
this information "non-identifiable."  Also, non-identifiable can be 
used in cases where no information is being collected at all.  Since 
most Web servers are designed to keep Web logs for maintenance, this 
would most likely mean that the data collector has taken specific 
efforts to ensure the anonymity of users.

Under the above definitions, a lot of information could be 
"identifiable" (not specifically made anonymous), but not 
"identified" (reasonably able to be tied to an individual or 
computer).

Similarly, the term "linked" refers to how information is being used 
in connection with a cookie. All data in a cookie or linked to a 
particular user must be disclosed in the cookie's policy. Using the 
terminology above, if the data collector collects "identifiable" 
information about the user it is generally "linked" data. For 
example, if the data collector stores a login name in a file 
associated with a persistent cookie and the login name is linked to 
personal data, the cookie is clearly "linked."

In less clear cut example, if the data collector ties the cookie to a 
specific order id in a flat file and that order id is tied to 
personal information in a related file, the cookie would be linked to 
all of the relational data unless specific precautions have been 
taken to ensure that a data operator with access to the relational 
data cannot access the flat cookie data and vice versa.

In other words, a data collector must: collect no personal 
information; take active steps to make sure that information 
referring to a specific individual cannot be identified; or must use 
the "linked" tag.


Identifiers

The Working Group decided against an identified or identifiable label 
for particular types of data. However, user agent implementers have 
the option of assigning these or other labels themselves and building 
user interfaces that allow users to make decisions about web sites on 
the basis of how they collect and use certain types of data.

The Working Group felt that different user agent implementations 
could be created to focus on different concerns around data type. 
Therefore, the working group enabled the creation of a robust data 
schema including broad categories of information that may be 
considered sensitive by certain user groups.  The Working Group hopes 
that a diverse set of user agents will be created to allow users the 
ability to make identity decisions based on specific collections and 
types of collects if they desire to do so.  For example, a user agent 
could allow users to opt to be prompted when medical or financial 
identifier is being collected, independent of how that information is 
being used.

(1)   More information on the debate and the definitions can be found 
in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.





-- 
------------------------------------
Ari Schwartz
Associate Director
Center for Democracy and Technology
1634 I Street NW, Suite 1100
Washington, DC 20006
202 637 9800
fax 202 637 0968
ari@cdt.org
http://www.cdt.org
------------------------------------
-- 
------------------------------------
Ari Schwartz
Associate Director
Center for Democracy and Technology
1634 I Street NW, Suite 1100
Washington, DC 20006
202 637 9800
fax 202 637 0968
ari@cdt.org
http://www.cdt.org
------------------------------------
Received on Wednesday, 3 September 2003 12:19:30 UTC