Re: [Bug 167] explanation of identified, identifiable, and linked from Rigo Wenning on 2003-07-18 (public-p3p-spec@w3.org from July 2003)

From: Rigo Wenning <rigo@w3.org>
Date: Fri, 18 Jul 2003 18:55:27 +0200
To: public-p3p-spec@w3.org
Cc: Ari Schwartz <ari@cdt.org>
Message-ID: <20030718165527.GE10714@rigo.w3.org>
Ari, 

I owe you the response to your very stringent analysis of the word and
definition of 'identifiable' and 'identified' from a US cultural
perspective. 

So if you read the following, keep in mind that I try to do the same
from a european perspective. I try to explain more about the model
behind the european definitions and understanding of identity and
identifiable. Note that 'identified' does not exist in there, but it is an
original take-up by the P3P Specification WG that might make it's way
into data protection (as it makes a lot of sense)

So I risk to be long, but I tried to be short in Bugzilla and was not
understood. So I try from to explain from Adam and Eve on. 

The EU-Directive defines identifiable at two places: 
1/ The scope of data protection as such is limited to identifiable data
Article 2 a states:
 'personal data' shall mean any information relating to an identified or
 identifiable natural person ('data subject'); an identifiable person is
 one who can be identified, directly or indirectly, in particular by
 reference to an identification number or to one or more factors
 specific to his physical, physiological, mental, economic, cultural or
 social identity; 

2/ Identifiable is explained further in consideration 26 of the
Directive:

(26) Whereas the principles of protection must apply to any information
concerning an identified or identifiable person; whereas, to determine
whether a person is identifiable, account should be taken of all the
means likely reasonably to be used either by the controller or by any
other person to identify the said person; whereas the principles of
protection shall not apply to data rendered anonymous in such a way that
the data subject is no longer identifiable; whereas codes of conduct
within the meaning of Article 27 may be a useful instrument for
providing guidance as to the ways in which data may be rendered
anonymous and retained in a form in which identification of the data
subject is no longer possible; 

Those are my assumptions and definitions as I want to keep P3P somewhat
in sync with the Directive (see Art. 10 Taskforce for this goal)

So the EU-Directive, if you look carefully at the definitions, does not
have recurse on the intent of the data controller (or processor) to
process data (use in your case) to identify. To trigger the application
of data protection law (scope) it is sufficient that with likely and
reasonable means, a person is able to say: This is person X (e.g. Ari
Schwartz). The scope is not limited to the data processor. 
For this email, identifying is easy, as your email-address is a unique
identifier making clear that you're the particular Ari Schwartz who sent
the email cited.  So if you would use 12345@w3.org, the W3-Systeam could
still identify you. So it is still identifiable data. (any person has to
be taken into account)

This makes the scope too large to give a basis for a technology tool.
While it might be understandable that a law wants to define a very broad
scope, it is unusable for most of the decisions that P3P wants to help
with. That's why we have invented/taken-up the 'identified' term in 
P3P 1.0, mostly because of the access-issues (access identified and not
all potential identifiable data, which would require a lot of processing
and identify the requester by doing his request)

So despite all the discussions in this world what PII is or what it
isn't, despite the misconceptions floating around the world, we better
ought to have a conscise concept of identity behind P3P. This concept of
identity distinguishes IMHO between non-identifiable, identifiable and
identified. I want to prevent that we lose overview by packing
processing, purpose, disclosure, sharing, retention, sensitive data
types and other things into this one term and definition. Doing so could
only bring us into confusion.

So for the sake of clarity, I propose the following definitions for the
suggested three terms. 

1/ Identifiable
We always operated under the assumption of the definition in whereas 26
of the EU-Directive. If there is a possibility to get to 'who is that
guy' with likely reasonable means by any party, than we have the
assumption of identifiable information, thus information that makes
assertions about a natural person. The likely and reasonable prevent
to take science fiction scenarios into account like invading Iraq to
find out who has joked on the phone about those wappons. (I think you
made this point in our earlier discussions ;)

2/ non-identifiable
This is the flip-side of 1/. This is reflected in the <non-identifiable>
element, which is based directly on the definition. (second phrase
after colon) of whereas 26. P3P gives a hint to render IP-addresses
anonymous. This could be taken up by a code of conduct e.g.

3/ identified
The means (processing of data) described in 1/ have already been
applied. This makes access to assertions about a natural person easier
and faster. This is particularly important for access rights but also
for the overall risk analysis. So identified is data, that does not need
further processing in this direction (I think the technical term is
'crossing')

Those three are all relative to 'data'. Data is an object here. The
attributes to the term data are the three terms above describing the
relationsship between a given natural person (also an object here) and
the object 'data'. This relationsship has different levels and
qualities. We deliberatly choosed the three described (IMHO)

Now, like a cross-roads, there is another concept: The data processing life
cycle:
Data is 
a/ collected
b/ processed
c/ transferred 
d/ deleted

(Retention is a sub-group to collection, transfer is a subgroup to
processing)

Now if you suggest to describe 'identified data' as 'how the information
can be or is being used', or if you describe it as 'reasonably can be
used by the data collector to identify an individual, your definition
mixes up identifiable, the processing of some raw data to identify
an individual. 

In fact you say: We use this raw traffic data to mine it and poke until
we know who you are, where you live and how you shop. 

This mixes up the model as it describes a purpose of processing with the
collection of data and also with the relationsship between our objects
'natural person' and 'data'. That's why I suggest to keep this
relationsship and the purpose of processing separate. This is done in
my definitions. So in non-identifiable, they can't find out, in
identifiable, they could find out and in identified, they found out.

If I collect identifiable information like IP (not domain), we have to
separate the question whether this is used to find out or not. The
question of whether they want to find out or not is an important
question, but it has to be treated separatly from the term identity as
it is a purpose and normally not a primary purpose ;). 

A/You might say: We collect identifiable information. This is for the user
-agent to decide to use an anonymizer. 

B/You might say: We use the data in A to find out who you are. 
(The problem is, today, we can't say that as the purpose section is
still too poor)

It doesn't make sense to say: 

A/ We collect identifiable information
B/ We do not use them to identify you (but we could, so trust us and our
sysadmin ;)

negative assertions just don't fit into the P3P model.

Another cross-roads is so called 'sensitive' data. 

According to Shannon/Weaver, any data can be sensitive, if it carries
enough information. Did Clinten have sex with Lewinsky? A simple 'yes'
from Mister C., a simple bit, could be sufficient.

So the EU-Directive takes some info up that is 'usually' very senitive,
like financial info, health info, religion and politics. Those are just
defined in the law. I don't think we have sufficient technology today to
define whether data is sensitive or not. We can take those from the OECD
and the EU-Directive up as it seems to make sense in front of such a
broad agreement. But outside this narrow scope, it is on the user
himself, confronted with the data, to judge whether data is sensitive or
not. And it should surely not mix into the definition of identity,
identifiable or identified.

That's why I think we should stick with my definitions described in 
1/ 2/ and 3/ and revise the Spec accordingly as far as we can get in 1.1

Best, 

Rigo




On Fri, Jun 20, 2003 at 12:09:27PM -0400, Ari Schwartz wrote:
> 
> Here is my draft text for addressing the confusion around identity 
> terms in the spec.
> 
> 
> 
> >http://www.w3.org/Bugs/Public/show_bug.cgi?id=167
> 
> 
> 
> Identity Definitions in the P3P Specification
> 
> In privacy regulations, guidelines and papers about privacy a variety 
> of terms are used to describe data that identifies an individual to 
> varying degrees.  Some common terms such as "personally identifiable 
> information (PII)" are often not defined or the cause for heated 
> debate.  In different documents, "identity" can be tied to:
> 
> 1) how the information can be or is being used,
> 2) how the information is stored, or
> 3) the type of information.
> 
> The P3P Specification Working Group tried to capture all three of 
> these ideas so that different implementers and users can make 
> decisions based on the importance they place on these various 
> definitions of identity. (1)
> 
> Identity Through Usage ("identified" data)
> 
> The most common term in the specification is "identified data" and 
> focuses on how the information can be or is being used.
> 
> "Identified data" is information that reasonably can be used by the 
> data collector to identify an individual.  Admittedly, this is a 
> somewhat subjective standard.  For example, a data collector storing 
> Internet Protocol (IP) addresses  (which can be created dynamically 
> or could be static and therefore tied to a particular computer used 
> by a single individual) should consider the IP address "identified 
> data" only when an attempt is made to tie the exact addresses to past 
> records or work with others to identify the specific individual or 
> computer over a long period of time.  In the more common case, where 
> data collectors use IP addressing information in the aggregate or 
> make no attempt to tie the IP address to a specified individual or 
> computer over a long period of time, IP addresses are not considered 
> identified even though it is possible for someone (eg, law 
> enforcement agents with proper subpoena powers) to identify the 
> individual based on the stored data.
> 
> 
> Identity Through Storage ("non-identifiable" and "linked" data)
> 
> The working group also felt that data collectors should be able 
> acknowledge when they make specific attempts to anonymize what would 
> otherwise be identifiable in its storage.
> 
> The term "non-identifiable" data refers to how the information is 
> stored.  For example, a data collector collecting and storing IP 
> addresses but not using them should NOT call this data 
> "non-identifiable" even in the common case where they have no plans 
> to identify an actual individual or computer. However, if a Web site 
> collects IP addresses, but actively deletes all but the last four 
> digits of this information in order to determine short term use, but 
> insure that a particular individual or computer cannot be 
> consistently identified, then the data collector can and should call 
> this information "non-identifiable."  Also, non-identifiable can be 
> used in cases where no information is being collected at all.  Since 
> most Web servers are designed to keep Web logs for maintenance, this 
> would most likely mean that the data collector has taken specific 
> efforts to ensure the anonymity of users.
> 
> Under the above definitions, a lot of information could be 
> "identifiable" (not specifically made anonymous), but not 
> "identified" (reasonably able to be tied to an individual or 
> computer).
> 
> Similarly, the term "linked" refers to how information is being 
> stored in connection with a cookie. All data in a cookie or linked to 
> a particular user must be disclosed in the cookie's policy. Using the 
> terminology above, if the data collector collects "identifiable" 
> information about the user it is generally "linked" data.
> 
> Identity Through Information Type
> 
> The Working Group felt that different user agent implementations 
> could be created to focus on different concerns around data type. 
> Therefore, the working group enabled the creation of a robust data 
> schema including broad categories of information that may be 
> considered sensitive by certain user groups.  The Working Group hopes 
> that a diverse set of user agents will be created to allow users the 
> ability to make identity decisions based on specific collections and 
> types of collects if they desire to do so.  For example, a user agent 
> could allow users to opt to be prompted when medical or financial 
> identifier is being collected, independent of how that information is 
> being used.
> 
> (1)   More information on the debate and the definitions can be found 
> in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002.
> 
> 
> 
> -- 
> ------------------------------------
> Ari Schwartz
> Associate Director
> Center for Democracy and Technology
> 1634 I Street NW, Suite 1100
> Washington, DC 20006
> 202 637 9800
> fax 202 637 0968
> ari@cdt.org
> http://www.cdt.org
> ------------------------------------
Received on Friday, 18 July 2003 12:58:09 UTC