- From: Rolf H. Nelson <rnelson@tux.w3.org>
- Date: Sun, 1 Nov 1998 14:48:30 -0500
- To: www-privacy-evaluator@w3.org
http://www.w3.org/Privacy/19981101-evaluator.html
-----------
[1]W3C
Privacy Evaluator
Author:
Rolf Nelson (W3C) <[2]rnelson@w3.org>
Status of this Document:
This document may end up being submitted as a W3C NOTE. This
document
would then be a NOTE made available by W3C for discussion
only. This
indicates no endorsement of its content, nor that W3C has, is, or
will
be allocating any resources to the issues addressed by the NOTE.
Send
comments to [3]www-privacy-evaluator@w3.org. This list is publicly
archived at
[4]http://lists.w3.org/Archives/Public/www-privacy-evaluator/.
Abstract:
Some users are unaware that personal data that they send to Web
sites
is sometimes redistributed without their knowledge or explicit
permission. Negative consequences of this redistribution can range
from the subsequent reception of unwanted junk mail to the
nightmare
of identity theft. To inform the user of what the Web site will do
with the data it requests, Web sites can post privacy disclosures
that
describe what the Web site will do with the data it collects.
These
disclosures can take the form of human-readable natural language
explanations; alternatively, new technologies like P3P [[5]P3P]
will
allow machine-readable privacy disclosures. Unfortunately, some
Web
sites have no privacy policies posted whatsoever. [[6]FTC]
A "privacy critic" [[7]Critic] utility that can warn users of some
possible consequences of sending personal data to a Web site is a
valuable tool. Such a utility could be designed in many different
ways. This document describes one possible design, called Privacy
Evaluator. A defining feature of Privacy Evaluator is its use of
preset heuristics, or "rules of thumb," to determine if a user is
in
the process of submitting personal data through an HTML form. This
document also describes one existing prototype implementation of
Privacy Evaluator. This prototype implementation, called PJPS, is a
proof-of-concept. A polished implementation of Privacy Evaluator
would be more robust and would have a more polished user interface
than PJPS. Preliminary and unscientific tests show that PJPS can
detect the transmission of personal data correctly for 28 of 29
randomly chosen Web sites.
Overview:
Privacy Evaluator describes a specific class of Web user agents
(such
as Web browsers) that automatically provide the user with a certain
style of privacy information. P3P is a language Web sites can use
to
disclose their privacy practices in a machine-readable way.
Privacy
Evaluator can warn the user about a site's privacy not only when
P3P-compliant sites are accessed, but even when non-P3P-compliant
sites are accessed. PJPS (Privacy Jigsaw Proxy Server) is the W3C
[[8]W3C] prototype implementation of Privacy Evaluator.
With Privacy Evaluator, when a user submits data through an HTML
form
to a site, an alert may appear warning the user of some possible
consequences of submitting personal data to an unprotected Web
site.
This alert will appear if the following two conditions are both
met:
1. The Web page containing the HTML form does not have an adequate
machine-readable privacy disclosure (such as a P3P disclosure) that
would ensure the user's privacy. PJPS would check that either the
"id" field is "no", or that both the "recpnt" and "purp" fields are
sufficiently low. [[9]P3P]
2. Privacy Evaluator believes that the data being submitted is
"identifiable"; that is, it could be used to identify the user.
PJPS
would consider the data to be identifiable if the following two
sub-conditions were both met:
a. The HTML form looks like it is soliciting the user's name
or
electronic mail address. One way to determine this is to see if an
input field key substring matches "name" or "email". Another is to
see if the Web page contains the phrases "first name" and "last
name".
A third method is to see if the data the user entered looks like an
email address. These heuristics should match a majority of the
English-language sites on the Web that capture personally
identifiable
data.
b. The name of the submit button does not look like the button
for a search engine. The way to determine this is to see if the
submit value equals something other than "search" or "find". If
the
submit button is labeled "search" or "find", it is less likely that
the form is soliciting personally identifiable information about
the
user. This heuristic makes it less likely that search engines will
accidentally trigger a false alert.
Producing this alert for sites without P3P that appear to be
collecting identifiable data has two benefits. First,
inexperienced
users of Privacy Evaluator will get educated about the possible
consequences of submitting personal data on the Web. This will be
especially helpful to non-American users in countries with strong
data
protection norms who do not fully realize that they are visiting a
Web
site located in a different country that does not offer privacy
protection. Second, Web sites will have an additional incentive to
use a machine-readable privacy disclosure language like P3P. A Web
site that uses P3P and has an adequate privacy policy would be more
likely to convince a Privacy Evaluator user to submit data than
would
a site that does not use P3P. With Privacy Evaluator, a Web site
is
never punished and is sometimes rewarded for using P3P. This way,
a
Web site is never worse off for having used P3P.
The arbitrarily chosen goal is that most users who surf the Web
with
Privacy Evaluator should have a "false negative" rate of under 20%
and
a "false positive" rate of under 5%. A false negative is when a
Web
site that does collect identifiable information mistakenly does not
trigger an alert. A false positive is when a Web site that does
not
collect identifiable information mistakenly does trigger an alert.
Privacy Evaluator is not designed to prevent malicious Web
administrators from deliberately preventing the alert from
appearing.
These constraints should be loose enough that a working Privacy
Evaluator implementation is easy to create, but tight enough that
Privacy Evaluator is useful. A Privacy Evaluator implementation
should be tuned to the expected language of the Web sites that that
user is likely to visit. PJPS is designed to work well for
English-language Web sites.
Privacy Evaluator is designed to be privacy-friendly and
non-intrusive. Existing browsers that do not use P3P are
non-intrusive, but not privacy-friendly. A hypothetical user agent
that blocked every non-P3P site on the Web would be
privacy-friendly
but would not be non-intrusive. Privacy Evaluator is
privacy-friendly
because the rate of false negatives is under 20%, and is
non-intrusive
because of the low rate of false positives.
Implementation Details:
A Privacy Evaluator implementation can include a parser, a trust
engine, a sniffer, and a user interface. The trust engine has not
yet
been implemented in PJPS as of this writing.
The parser module would need to look for a link in the HTML head to
a
separate document containing a P3P disclosure. It would then need
to
follow this link, retrieve the P3P document, and parse it. The
parser
would need to understand either XML, RDF, P3P, or a relevant subset
of
P3P. Conceivably the parser could be very crude and merely look
for
the P3P <STATEMENT> tag.
The trust engine, which consists of a set of privacy preference
rules,
would take the parsed P3P disclosure and would return a boolean
stating whether the privacy statement is strong enough to suppress
the
P3P alert. It produces this boolean by evaluating at least three
fields: the "id" field, the "purp" field, and the "recpnt"
field. One
possible implementation would be a database listing every
acceptable
combination of these enumerated values. A simpler possibility
would
be to hardwire in that only the following proposals are acceptable:
a. proposals with "id" field equal to "no"; or
b. proposals with "purp" fields in the range 0 to 3 and "recpnt"
fields in the range 0 to 1. For example, a "recpnt" field equal to
"0, 3" would be unacceptable to this trust engine.
Alternatively, a very trusting trust engine could search the Web
page
for the mere presence of a P3P proposal or a link to a privacy
policy,
or even for a mention of the word "privacy" in any language
somewhere
in the HTML.
The sniffer decides whether the information being transmitted looks
identifiable. It can use heuristics that analyze the data being
transmitted. For example, it can check whether one of the key
values
has "name" or "address" as a substring. Given the data being sent
through CGI and the contents of the originating Web page, the
sniffer
returns a boolean stating whether it thinks identifiable
information
is being sent. If the sniffer decides that the data is
identifiable,
Privacy Evaluator should invoke the user interface to bring up an
alert.
The user interface's alert can consist of a dialogue containing a
text
which is read from a configuration file. This text can be a
warning
that no adequate machine-readable privacy disclosure was found, and
that there may be no guarantee that personal data submitted to the
site will not be sold to other parties. The text may also suggest
the
user look for a human-readable privacy disclosure. This dialogue
box
is similar in spirit to the warning issued by many browsers when
sending data through an insecure channel that does not use HTTPS.
The
user can elect to continue the transaction, or cancel. Inside this
dialogue a box can be checked if the user does not want to see this
warning again.
An alternative design decision would have been to produce an alert
when a web page is downloaded rather than when the form is
submitted.
This would have had the disadvantage of bringing up alerts for web
pages that the user has no desire to submit data to anyway.
Therefore
the decision was made to only alert the user about that minority of
Web pages where the user has actually filled in the Web form and is
in
the process of submitting data to. If the user is not submitting
data, then the privacy policy of the Web page is not as relevant.
PJPS runs as a proxy server and therefore cannot directly produce
an
alert dialogue on the user's computer in the way that a local
client
application like a Web browser can. PJPS could have been designed
to
produce an alert using Java, but this would have required the
user's
Web browser to support Java. PJPS instead embeds the alert
directly
in the HTML document returned by the proxy. Here is an example
transaction where the user begins to send data to a site, PJPS
produces an alert, and the user elects to ignore the alert and
finish
sending data to the Web site.
Browser sends to PJPS proxy: GET /foo.cgi?bar=buz
PJPS proxy sends back a privacy alert embedded in a form:
<FORM ACTION="/foo.cgi">
<INPUT TYPE="hidden" NAME="data" VALUE="/foo.cgi?bar=buz">
<INPUT TYPE="submit" VALUE="go ahead anyway">
User clicks "go ahead anyway" and browser sends to PJPS proxy:
GET /foo.cgi?submit=go+ahead&data=%2Ffoo.cgi%3Fbar%3Dbuz
Proxy then sends on to Web server: GET /foo.cgi?bar=buz and returns
the fetched Web document to the user.
With PJPS, if the user checks the box indicating not to show the
dialogue again, a second dialogue may appear explaining that since
this is a prototype, checking the box does not actually do
anything.
In contrast, in a real non-prototype Privacy Evaluator
implementation,
checking the box would have disabled Privacy Evaluator
functionality.
By not implementing this check box, this proxy is saved from having
to
keep state for each user. Besides, PJPS would become very
uninteresting after the box is checked.
The dialogue should also have a help button, and ideally a link to
an
explanation of why exactly this document triggered the alert.
PJPS, is layered on top of the W3C Jigsaw [[10]Jigsaw] server and
takes a form of a proxy server. The alternative would have been to
implement PJPS as a browser. Implementation as a proxy server had
two
advantages. First, development of PJPS on top of Jigsaw proxy
server
was fast and easy, partly because jigsaw already has an XML parser.
Second, a proxy server is more accessible; if an interested
outsider
wishes to see Privacy Evaluator in action, he or she would merely
have
to configure his or her existing browser to use our PJPS proxy at
p3p.w3.org. If this person were instead required to download,
install, and run a browser, that would create a serious obstacle.
The
main disadvantages of this proxy approach are worse response time,
less UI control, and a reduction in user information. The
advantages
of this proxy approach were judged to outweigh the disadvantages
for
the purposes of the prototype. A widely deployed and polished
implementation of Privacy Evaluator would probably need to be
implemented within the browser rather than as a proxy.
Because PJPS runs as a proxy, it cannot directly access the HTML
form
that the user submitted data from. PJPS therefore relies on the
"Referer" field to determine what HTML document produced the
request
so that it can scan that document for "first name" and "last name."
This has two disadvantages. First, in theory, a single URL may map
to
more than one document. For example, posting two different sets of
data to a single URL may yield two different return documents
containing two different HTML forms. Second, PJPS does not work
correctly with browser configurations that do not emit the
"Referer"
field. As of this writing, both Netscape and Microsoft browsers
emit
the "Referer" field by default. A more sophisticated alternative
would have been to keep a database of the "action" fields contained
in
Web pages. For the sake of rapid development, PJPS lacks this
sophisticated database.
To speed development, several important aspects of P3P have been
omitted in Privacy Evaluator. HTTP support and the transmission of
data solicited through P3P methods are elements that were deemed
desirable but not necessary for Privacy Evaluator. Privacy
Evaluator
also lacks a sophisticated trust engine and a way of downloading
customized privacy preferences over the Web. These are important
items, nevertheless they are not required for Privacy Evaluator.
The implementation of PJPS will be considered a success if it meets
the stated goals of false positives and false negatives, and does
not
crash, during user tests. User tests could consist of two randomly
chosen individuals who could be asked to browse a series of Web
pages
and submit data to those pages. The pages could be determined
through
analyzing user trace data to find representative sites. A tally
could
manually be kept of false positives and false negatives. In
addition,
multiple people could use PJPS during the course of a week of
normal
Web browsing to verify there are no unexpected problems. See the
section on Implementation Status for information on some
unscientific
manual tests.
The design of Privacy Evaluator will be considered a success if the
following three criteria are met: the implementation of PJPS is a
success as described above; Privacy Evaluator is useful; and
Privacy
Evaluator is usable. Privacy Evaluator is useful if a significant
percent of user agent distributors, including ISPs, make plans to
deploy Privacy Evaluator or a variant of Privacy Evaluator, and if
users of those implementations generally evaluate them as useful.
Privacy Evaluator is sufficiently usable if user tests fail to
produce
any showstopper user interface problems.
Details of Current PJPS Heuristics:
Below is the current process for using the PJPS heuristics for
determining if an attempted data transmission through an HTML form
carries personally identifiable information:
1. (Search Rule) Does the submit button have a value like
"find"
or "search"? If so, the transaction is NOT suspect. If not, go to
step 2.
2. (Key Rule) Does the CGI key in one of the INPUT element
tags
have as a substring "name" or "email"? If so, the transaction is
suspect. If not, go to step 3. See the HTML specification
[[11]HTML]
for the syntax of HTML element tags.
3. (Text Rule) Does the full text of the HTML document (not
just
the tags, not just the form, but the entire HTML document) contain
both the phrase "first name" AND the phrase "last name"? If so,
the
transaction is suspect. If not, go to step 4.
4. (Value Rule) Does one of the values that the user typed in
and
is submitting contain the character "@"? If so, the user is
probably
submitting an email address and the transaction is suspect. If
not,
the transaction is NOT suspect.
The string comparisons in all of these steps must be
case-insensitive.
Rule 3, the Text Rule, could also look for synonyms such as "given
name" and "family name".
These four heuristics do not exhaust the set of all possible useful
heuristics. Other possible useful heuristics that are not used by
PJPS include a more refined email match, a postal address match, a
search for registration synonyms, and support for languages other
than
English. A more refined email match, rather than looking for the
simple presence of the "@" character, could do a pattern match on
legal RFC822 [[12]RFC822] email addresses, and even try to look up
the
domain name of the entered email address to check for validity. A
postal address match, for users in the United States, could look
for
one of the two-letter state abbreviations. A search through the
Web
page for registration synonyms would flag phrases like "user
registration". Support for non-English languages would involve
developing separate heuristics for each language.
If a transaction is suspect, Privacy Evaluator should produce a
warning dialog alerting the user unless Privacy Evaluator has found
an
adequate P3P disclosure protecting the privacy of the transaction.
These heuristics are believed to satisfy the design goals of less
than
5% false positives and less than 20% false negatives. Tests could
be
developed to verify or disprove this belief.
Below are some examples of the heuristics in action.
Suppose Web form A has the following tag:
<INPUT TYPE=submit VALUE="Search">
Transactions produced by form A would NOT be suspect because of
rule
1, the "Search Rule."
Suppose Web form B includes the following tag:
<INPUT NAME="Your_Name">
Transactions produced by form B would be suspect because of Rule
2,
the "Key Rule." (Unless, of course, Rule 1 about "search" and
"find"
transactions not being suspect contradicted this.)
Suppose Web page 1 includes the following text:
Enter Your First Name: <INPUT NAME="FN">
Enter Your Last Name: <INPUT NAME="LN">
Transactions produced by page 1 would be suspect because of Rule 3,
the "Text Rule." (Unless, of course, this contradicts Rule 1.)
Suppose Web form C does not match any of the first three rules.
Suppose further the user enters into one of the INPUT fields the
data
"Joe@foo.com". When the user clicks the submit button, the
transaction should be flagged as suspect because of Rule 4, the
"Value
Rule." (Unless, of course, this contradicts Rule 1.)
Interoperability with P3P:
Privacy Evaluator implementations should interoperate with P3P
implementations. The simplest way to ensure this is to allow the
trust engine functionality to manually be disabled when the user
also
has a separate P3P utility running a more sophisticated trust
engine.
A more complicated but more powerful solution is to feed the binary
output of the Privacy Evaluator sniffer into a fully implemented
P3P
trust engine.
Implementation Status:
As of Oct 14, 1998, PJPS is up and running at p3p.w3.org:8080. It
has
not been exhaustively tested and is known to work only with POST
and
not with GET CGI queries. An unscientific test of the heuristics
found that 8 out of 9 popular Web sites that collect personally
identifiable information produce PJPS alerts. 20 out of 20
randomly
chosen Web sites of only average popularity that collect personally
identifiable information produce PJPS alerts. This indicates a
satisfyingly low rate of false positives. No false negatives were
found.
Mailing List:
Public comments and discussion about Privacy Evaluator or about
PJPS
should go to www-privacy-evaluator@w3.org. Instructions for
subscribing are available:
<url:[13]http://www19.w3.org/Archives/Public/www-privacy-evaluator/199
8Oct/0000.html> Archives of this list are at the following URL:
[14]http://lists.w3.org/Archives/Public/www-privacy-evaluator/
Future Work:
The heuristics suggested in this document should be systematically
tested to determine the rate of false positives and false
negatives.
Usability tests should be conducted to find the best way to
communicate privacy information to users.
PJPS does not work on .shtml, https, or GET CGI transactions. The
percentage of Web sites that collect personal data through such
transactions is believed to be low. This should be verified or
refuted empirically, and if the percentage is sufficiently high
PJPS
should be modified to support these transactions.
A P3P trust engine should be added to PJPS.
PJPS could be made more user-configurable by allowing users to
configure sites that should not produce an alert. For example,
when
an alert is produced, there could be a checkbox that makes PJPS
stop
producing alerts for that Web site. Users should also be able to
totally disable Privacy Evaluator functionality if they desire.
PJPS could be ported to another language; possible candidates for a
good first language to port to include French and
Spanish. Discussion
of internationalization issues is available in the thread starting
at
<[15]http://lists.w3.org/Archives/Public/www-privacy-evaluator/1998Oct
/0001.html>.
Privacy Evaluator could be extended to access third-party
machine-readable information about privacy policies. One method
would
be to use PICS to mark Web sites that a third party judges to have
inadequate privacy protection. A better method would be for P3P to
be
extended to allow third-party label bureaus to serve P3P
disclosures.
For privacy reasons, these bureaus should be as close to the user
as
possible; if the bureau is small and just lists a few popular
sites,
it could be bundled in with Privacy Evaluator and sit on the user's
desktop.
To discourage malicious Web site administrators from tuning their
Web
pages to not alert Privacy Evaluator's fixed heuristics, the
heuristics could be made variable rather than fixed and could be
downloaded daily from a central database of heuristics that could
change to counter common workarounds by malicious site
administrators.
It is unclear who would win this arms race between malicious Web
site
administrators and Privacy Evaluator.
Conclusion:
Privacy Evaluator is a design for building a user agent that can
detect the transmission of personally identifiable information
through
HTML forms with what appears to be a large degree of accuracy.
PJPS
is a proof of concept that shows a Privacy Evaluator is
feasible. When
a user is in the process of transmitting personal identifiable
information, an implementation of Privacy Evaluator can warn the
user
if the Web site does not have an adequate machine-readable privacy
policy.
Versioning and Authorship:
1.4 Nov 1 1998 Rolf Nelson additional input from Martin Duerst
1.3 Oct 25 1998 Rolf Nelson additional input from Haym Hirsh,
Marja-Riitta Koivunen, Eric Prud'hommeaux, Joseph Reagle, Daniel
Veillard.
1.2 Oct 12 1998 Rolf Nelson additional input from Lorrie Cranor
1.1 Sep 20 1998 Rolf Nelson additional input from Jason Catlett and
Massimo Marchiori
1.0 Aug 19 1998 Rolf Nelson original version, with input from Eric
Prud'hommeaux, Joseph Reagle, Janne Saarela, Ralph Swick, Daniel
Veillard. Additional thanks to Dan Connolly, Jim Gettys and
Marja-Ritta Koivunen. Mistakes are mine, brilliant observations
are
theirs.
PJPS, the Privacy Evaluator implementation, was coded amazingly
quickly by Janne Saarela.
References:
[Critic]
[16]http://www.ics.uci.edu/~ackerman/pub/98i11/privacy-critics.pdf
[FTC] "Privacy Online: A Report to Congress,"
[17]http://www.ftc.gov/reports/privacy3/toc.htm
[HTML] "HTML 4.0 Specification,"
[18]http://www.w3.org/TR/REC-html40/
[Jigsaw] "Jigsaw Overview," [19]http://www.w3.org/Jigsaw/
[P3P] "Platform for Privacy Preferences P3P Project,"
[20]http://www.w3.org/P3P/
[RFC822] "Standard for the Format of ARPA Internet Text Messages,"
[21]http://info.internet.isi.edu:80/in-notes/rfc/files/rfc822.txt
[W3C] "About the World Wide Web Consortium,"
[22]http://www.w3.org/Consortium/
______________________________________________________________________
To Do: , validate as HTML compliant, table of contents
______________________________________________________________________
[23]Copyright ) 1998 [24]W3C ([25]MIT, [26]INRIA, [27]Keio ), All
Rights Reserved. W3C [28]liability, [29]trademark, [30]document use
and [31]software licensing rules apply.
______________________________________________________________________
[32]Rolf Nelson <[33]rnelson@w3.org>
References
1. http://www.w3.org/
2. mailto:rnelson@w3.org
3. mailto:www-privacy-evaluator@w3.org
4. http://lists.w3.org/Archives/Public/www-privacy-evaluator/
5. http://www.w3.org/Privacy/19981101-evaluator.html#P3P
6. http://www.w3.org/Privacy/19981101-evaluator.html#FTC
7. http://www.w3.org/Privacy/19981101-evaluator.html#Critic
8. http://www.w3.org/Privacy/19981101-evaluator.html#W3C
9. http://www.w3.org/Privacy/19981101-evaluator.html#P3P
10. http://www.w3.org/Privacy/19981101-evaluator.html#Jigsaw
11. http://www.w3.org/Privacy/19981101-evaluator.html#HTML
12. http://www.w3.org/Privacy/19981101-evaluator.html#RFC822
13. http://www19.w3.org/Archives/Public/www-privacy-evaluator/1998Oct/0000.html
14. http://lists.w3.org/Archives/Public/www-privacy-evaluator/
15. http://lists.w3.org/Archives/Public/www-privacy-evaluator/1998Oct/0001.html
16. http://www.ics.uci.edu/~ackerman/pub/98i11/privacy-critics.pdf
17. http://www.ftc.gov/reports/privacy3/toc.htm
18. http://www.w3.org/TR/REC-html40/
19. http://www.w3.org/Jigsaw/
20. http://www.w3.org/P3P/
21. http://info.internet.isi.edu/in-notes/rfc/files/rfc822.txt
22. http://www.w3.org/Consortium/
23. http://www.w3.org/Consortium/Legal/ipr-notice.html#Copyright
24. http://www.w3.org/
25. http://www.lcs.mit.edu/
26. http://www.inria.fr/
27. http://www.keio.ac.jp/
28. http://www.w3.org/Consortium/Legal/ipr-notice.html#Legal
Disclaimer
29. http://www.w3.org/Consortium/Legal/ipr-notice.html#W3C
Trademarks
30. http://www.w3.org/Consortium/Legal/copyright-documents.html
31. http://www.w3.org/Consortium/Legal/copyright-software.html
32. http://www.w3.org/People/#nelson
33. mailto:rnelson@w3.org
--
| Rolf Nelson (rolf@w3.org), Project Manager, W3C at MIT
| "Try to learn something about everything
| and everything about something." --Huxley
Received on Sunday, 1 November 1998 14:48:32 UTC