- From: Rigo Wenning <rigo@w3.org>
- Date: Fri, 18 Jul 2003 18:55:27 +0200
- To: public-p3p-spec@w3.org
- Cc: Ari Schwartz <ari@cdt.org>
Ari, I owe you the response to your very stringent analysis of the word and definition of 'identifiable' and 'identified' from a US cultural perspective. So if you read the following, keep in mind that I try to do the same from a european perspective. I try to explain more about the model behind the european definitions and understanding of identity and identifiable. Note that 'identified' does not exist in there, but it is an original take-up by the P3P Specification WG that might make it's way into data protection (as it makes a lot of sense) So I risk to be long, but I tried to be short in Bugzilla and was not understood. So I try from to explain from Adam and Eve on. The EU-Directive defines identifiable at two places: 1/ The scope of data protection as such is limited to identifiable data Article 2 a states: 'personal data' shall mean any information relating to an identified or identifiable natural person ('data subject'); an identifiable person is one who can be identified, directly or indirectly, in particular by reference to an identification number or to one or more factors specific to his physical, physiological, mental, economic, cultural or social identity; 2/ Identifiable is explained further in consideration 26 of the Directive: (26) Whereas the principles of protection must apply to any information concerning an identified or identifiable person; whereas, to determine whether a person is identifiable, account should be taken of all the means likely reasonably to be used either by the controller or by any other person to identify the said person; whereas the principles of protection shall not apply to data rendered anonymous in such a way that the data subject is no longer identifiable; whereas codes of conduct within the meaning of Article 27 may be a useful instrument for providing guidance as to the ways in which data may be rendered anonymous and retained in a form in which identification of the data subject is no longer possible; Those are my assumptions and definitions as I want to keep P3P somewhat in sync with the Directive (see Art. 10 Taskforce for this goal) So the EU-Directive, if you look carefully at the definitions, does not have recurse on the intent of the data controller (or processor) to process data (use in your case) to identify. To trigger the application of data protection law (scope) it is sufficient that with likely and reasonable means, a person is able to say: This is person X (e.g. Ari Schwartz). The scope is not limited to the data processor. For this email, identifying is easy, as your email-address is a unique identifier making clear that you're the particular Ari Schwartz who sent the email cited. So if you would use 12345@w3.org, the W3-Systeam could still identify you. So it is still identifiable data. (any person has to be taken into account) This makes the scope too large to give a basis for a technology tool. While it might be understandable that a law wants to define a very broad scope, it is unusable for most of the decisions that P3P wants to help with. That's why we have invented/taken-up the 'identified' term in P3P 1.0, mostly because of the access-issues (access identified and not all potential identifiable data, which would require a lot of processing and identify the requester by doing his request) So despite all the discussions in this world what PII is or what it isn't, despite the misconceptions floating around the world, we better ought to have a conscise concept of identity behind P3P. This concept of identity distinguishes IMHO between non-identifiable, identifiable and identified. I want to prevent that we lose overview by packing processing, purpose, disclosure, sharing, retention, sensitive data types and other things into this one term and definition. Doing so could only bring us into confusion. So for the sake of clarity, I propose the following definitions for the suggested three terms. 1/ Identifiable We always operated under the assumption of the definition in whereas 26 of the EU-Directive. If there is a possibility to get to 'who is that guy' with likely reasonable means by any party, than we have the assumption of identifiable information, thus information that makes assertions about a natural person. The likely and reasonable prevent to take science fiction scenarios into account like invading Iraq to find out who has joked on the phone about those wappons. (I think you made this point in our earlier discussions ;) 2/ non-identifiable This is the flip-side of 1/. This is reflected in the <non-identifiable> element, which is based directly on the definition. (second phrase after colon) of whereas 26. P3P gives a hint to render IP-addresses anonymous. This could be taken up by a code of conduct e.g. 3/ identified The means (processing of data) described in 1/ have already been applied. This makes access to assertions about a natural person easier and faster. This is particularly important for access rights but also for the overall risk analysis. So identified is data, that does not need further processing in this direction (I think the technical term is 'crossing') Those three are all relative to 'data'. Data is an object here. The attributes to the term data are the three terms above describing the relationsship between a given natural person (also an object here) and the object 'data'. This relationsship has different levels and qualities. We deliberatly choosed the three described (IMHO) Now, like a cross-roads, there is another concept: The data processing life cycle: Data is a/ collected b/ processed c/ transferred d/ deleted (Retention is a sub-group to collection, transfer is a subgroup to processing) Now if you suggest to describe 'identified data' as 'how the information can be or is being used', or if you describe it as 'reasonably can be used by the data collector to identify an individual, your definition mixes up identifiable, the processing of some raw data to identify an individual. In fact you say: We use this raw traffic data to mine it and poke until we know who you are, where you live and how you shop. This mixes up the model as it describes a purpose of processing with the collection of data and also with the relationsship between our objects 'natural person' and 'data'. That's why I suggest to keep this relationsship and the purpose of processing separate. This is done in my definitions. So in non-identifiable, they can't find out, in identifiable, they could find out and in identified, they found out. If I collect identifiable information like IP (not domain), we have to separate the question whether this is used to find out or not. The question of whether they want to find out or not is an important question, but it has to be treated separatly from the term identity as it is a purpose and normally not a primary purpose ;). A/You might say: We collect identifiable information. This is for the user -agent to decide to use an anonymizer. B/You might say: We use the data in A to find out who you are. (The problem is, today, we can't say that as the purpose section is still too poor) It doesn't make sense to say: A/ We collect identifiable information B/ We do not use them to identify you (but we could, so trust us and our sysadmin ;) negative assertions just don't fit into the P3P model. Another cross-roads is so called 'sensitive' data. According to Shannon/Weaver, any data can be sensitive, if it carries enough information. Did Clinten have sex with Lewinsky? A simple 'yes' from Mister C., a simple bit, could be sufficient. So the EU-Directive takes some info up that is 'usually' very senitive, like financial info, health info, religion and politics. Those are just defined in the law. I don't think we have sufficient technology today to define whether data is sensitive or not. We can take those from the OECD and the EU-Directive up as it seems to make sense in front of such a broad agreement. But outside this narrow scope, it is on the user himself, confronted with the data, to judge whether data is sensitive or not. And it should surely not mix into the definition of identity, identifiable or identified. That's why I think we should stick with my definitions described in 1/ 2/ and 3/ and revise the Spec accordingly as far as we can get in 1.1 Best, Rigo On Fri, Jun 20, 2003 at 12:09:27PM -0400, Ari Schwartz wrote: > > Here is my draft text for addressing the confusion around identity > terms in the spec. > > > > >http://www.w3.org/Bugs/Public/show_bug.cgi?id=167 > > > > Identity Definitions in the P3P Specification > > In privacy regulations, guidelines and papers about privacy a variety > of terms are used to describe data that identifies an individual to > varying degrees. Some common terms such as "personally identifiable > information (PII)" are often not defined or the cause for heated > debate. In different documents, "identity" can be tied to: > > 1) how the information can be or is being used, > 2) how the information is stored, or > 3) the type of information. > > The P3P Specification Working Group tried to capture all three of > these ideas so that different implementers and users can make > decisions based on the importance they place on these various > definitions of identity. (1) > > Identity Through Usage ("identified" data) > > The most common term in the specification is "identified data" and > focuses on how the information can be or is being used. > > "Identified data" is information that reasonably can be used by the > data collector to identify an individual. Admittedly, this is a > somewhat subjective standard. For example, a data collector storing > Internet Protocol (IP) addresses (which can be created dynamically > or could be static and therefore tied to a particular computer used > by a single individual) should consider the IP address "identified > data" only when an attempt is made to tie the exact addresses to past > records or work with others to identify the specific individual or > computer over a long period of time. In the more common case, where > data collectors use IP addressing information in the aggregate or > make no attempt to tie the IP address to a specified individual or > computer over a long period of time, IP addresses are not considered > identified even though it is possible for someone (eg, law > enforcement agents with proper subpoena powers) to identify the > individual based on the stored data. > > > Identity Through Storage ("non-identifiable" and "linked" data) > > The working group also felt that data collectors should be able > acknowledge when they make specific attempts to anonymize what would > otherwise be identifiable in its storage. > > The term "non-identifiable" data refers to how the information is > stored. For example, a data collector collecting and storing IP > addresses but not using them should NOT call this data > "non-identifiable" even in the common case where they have no plans > to identify an actual individual or computer. However, if a Web site > collects IP addresses, but actively deletes all but the last four > digits of this information in order to determine short term use, but > insure that a particular individual or computer cannot be > consistently identified, then the data collector can and should call > this information "non-identifiable." Also, non-identifiable can be > used in cases where no information is being collected at all. Since > most Web servers are designed to keep Web logs for maintenance, this > would most likely mean that the data collector has taken specific > efforts to ensure the anonymity of users. > > Under the above definitions, a lot of information could be > "identifiable" (not specifically made anonymous), but not > "identified" (reasonably able to be tied to an individual or > computer). > > Similarly, the term "linked" refers to how information is being > stored in connection with a cookie. All data in a cookie or linked to > a particular user must be disclosed in the cookie's policy. Using the > terminology above, if the data collector collects "identifiable" > information about the user it is generally "linked" data. > > Identity Through Information Type > > The Working Group felt that different user agent implementations > could be created to focus on different concerns around data type. > Therefore, the working group enabled the creation of a robust data > schema including broad categories of information that may be > considered sensitive by certain user groups. The Working Group hopes > that a diverse set of user agents will be created to allow users the > ability to make identity decisions based on specific collections and > types of collects if they desire to do so. For example, a user agent > could allow users to opt to be prompted when medical or financial > identifier is being collected, independent of how that information is > being used. > > (1) More information on the debate and the definitions can be found > in Lorrie Faith Cranor's book Web Privacy with P3P, O'Reilly, 2002. > > > > -- > ------------------------------------ > Ari Schwartz > Associate Director > Center for Democracy and Technology > 1634 I Street NW, Suite 1100 > Washington, DC 20006 > 202 637 9800 > fax 202 637 0968 > ari@cdt.org > http://www.cdt.org > ------------------------------------
Received on Friday, 18 July 2003 12:58:09 UTC