Demographic Data Collection (was Re: Automatic Entry and Forms)

Brian Behlendorf (brian@organic.com)
Wed, 28 Feb 1996 02:10:20 -0800 (PST)


Date: Wed, 28 Feb 1996 02:10:20 -0800 (PST)
From: Brian Behlendorf <brian@organic.com>
To: "Phillip M. Hallam-Baker" <hallam@w3.org>
Cc: www-html@w3.org
Subject: Demographic Data Collection (was Re: Automatic Entry and Forms)
In-Reply-To: <312D1EE7.3F54@w3.org>
Message-Id: <Pine.SGI.3.91.960228001333.11520p-100000@fully.organic.com>


This is a continuation of a thread which has been raging on www-html over 
the last few days, and since I believe that the answer belongs at a 
different level than HTML, I'm continuing it here.  Feel free to see its 
beginnings at 
<URL:http://www.eit.com/www.lists/www-html.1996q1/index.html> under 
"Automatic Entry and Forms", at least when the archive gets updated since 
it doesn't appear to be automatic.... if someone has a better pointer 
please post it.

A particular user posted a complaint about having to constantly re-enter 
his name and password into HTML forms, and wondered whether some sort of 
automatic form entry system could be developed for common fields.  A 
lively debate ensued as to whether such a system could be built to 
respect privacy, what privacy means, etc.  

I contend that information of a private matter - your name, your email 
address, your zip code, etc - has real value to the party you would give 
it to, and thus, it should be incorporated into the payments and 
authentication layer rather than the application layer.

Information is currency, it has a value.  Content providers should assume no
right to be able to detect information about their visitors surreptitiously -
but likewise, the expectation that consumers have "imminent domain" to
content on the web is unrealistic.  If we want to create a world-wide *web*
(instead of a world-wide-hash-table, like we have now) we really should 
facilitate the giving of information (, payments, credentials) from 
client to server instead of just the other way around.

What the original poster was complaining about, I believe, was that he is 
quite happy and willing to give his personal name and email address to 
most places which ask for it, in exchange for obtaining something of 
value, but having to constantly type it in by hand is a pain. 

Similarly, many content providers have run into usability problems when they
try and ask for more information from their users - today CPs can prevent
access to a site for people who don't "register", but those who have
implemented such a system (like the one I built at hotwired) find that the
process of filling out the form and remembering a name and password can be
daunting and unnatural to the average user, in addition to be inscalable and
encouraging fraudulant information sharing.  In most cases, the individual is
happy to give the necessary information, it's the process that is daunting. 

So, we have a set of users who want to be able to always give some 
types of information automatically, and some types of CPs who want to 
make it easy for people to give that information.


I'm specifically interested in enabling the following scenario: a certain
content provider makes their content available for free, advertiser
supported, on the following condition: you give the CP your zip code and
country code, and only your zip code and country code.  The CP uses this to
give their advertiser information about the audience.  Combined with
databases mapping zip codes to demographic data, the CP can easily determine
what age ranges the site appeals to, average income, etc, all of which makes
for a happy advertiser (who can get some assurances that their dollars are
going towards the right market) and a more informed content provider.  
The most important thing is that this information is only usable in 
aggregate form - I'll get into this later.


What this suggests is that users have access to a small set of common 
bits of information about themselves, and are able to set a small set of 
different policies regarding the level of "automatic-ness" to which the 
information is given.  For example, to take the list from Dan Connolly's 
revision of his Business-Card Authentication proposal 
<URL:http://www.w3.org/pub/WWW/Demographics/Proposals.html>:

> profile-full-name
> profile-first-name
> profile-last-name
> profile-email-address
> profile-home-url
> profile-affiliation
> profile-affiliation-url
> profile-postal-street
> profile-postal-street-2
> profile-postal-city
> profile-postal-state
> profile-postal-zip
> profile-business-phone
(to which I'd add:)
profile-postal-country
profile-age
profile-age-dec (for those who'd rather say they were in their 50's than 
                 they were 53)

So, let's say the scenario I postulated above is common, and I as a user 
consider my zip code relatively non-private; it's personal information, 
since it's about me, but it's not private information, since it's public 
knowlege.  Thus, I have no problem always giving that information out to 
*anyone* who wants it, if it means that I will be able to obtain 
information I wouldn't be able to otherwise for free.  

However, I might feel differently about my email address - while I want 
to give that out on a fairly regular basis (I'm a public person, I enjoy 
meeting new people, etc) but I don't want it to be given out to everyone 
who asks, I want to be prompted when it's asked for by a particular 
entity.  

Thus, this profile is actually a matrix, of variables and policies.  For 
usability reasons it makes sense to keep both dimensions as small as 
possible.  I envision just two policies right now - always give when 
asked, and give with prompting.  "Never give" is the same as not 
filling in the entry in the profile.  This is *purely* a client issue - 
from a protocol perspective, the server simply tells the client it needs 
that info in exchange for a particular resource.  The specification 
should also strongly suggest that clients have this tunable by system- or 
network-wide configuration files, just as Java capabilities and policies 
will hopefully be.

So here's how something like that might work: using the Authorization: 
header, much like Dan's anonymous auth proposal (also at 
<URL:http://www.w3.org/pub/WWW/Demographics/Proposals.html>).  I.e.

  WWW-Authenticate: profile profile-postal-zip

to which the client responds

  Authorization: profile profile-postal-zip=94107

However, this proposal defeats caching, since caches can not cache the 
result of an "authenticated" request.  It may make sense to use another 
header for this purpose, one which the proxies may accumulate and 
transfer to the origin server in bulk later on.  Another reason to use a 
different header is that the server might want to be able to ask for it 
on a voluntary basis, not a mandatory one.

The privacy implications from a client software perspective are easy to 
address - no information goes out unapproved, just as I would expect a 
browser with integrated "wallet" to not send my cash and CC number to 
every site that asked for it.  In other words, since the UI issues of 
user authorization of release of information are already being address, 
this can "piggyback" on top of that, and hopefully use the same 
interface, so users can really feel in control of to whom their 
information flows.

The privacy implications on the server side are admittedly murkier, but I 
content no murkier than we currently have.  Today, content providers can 
throw up authentication on their sites and restrict access to only those 
users who give them personal information anyways, through clunky HTML 
forms and names and passwords.  Thus, the use of data in that instance is 
the same as we would have if submission of that information were 
automated - it is up to the ethical policies of the content provider to 
determine what gets done with that information.  I will contend that no 
protocol can enforce policies once a content provider has personal 
information, and that our best hope in this area is to see adoption of 
data privacy laws similar to those in Europe.  Technology can not solve 
all of society's problems, not even cryptography.  :)

Should this database of information be available to applications?  I.e.,
should we have a mechanism for auto-inclusion in forms, for Java apps to
access that information, etc?  I would cede this as a possibility but *only*
if the same policies would apply about when to automatically give certain
information and when to prompt.  For example, a well-designed forms
implementation could detect the presence of the "profile-email-address"
anywhere in the <FORM> tags and handle that as per the policies - same 
thing with a Java app accessing applet.browser.profile.email-address or 
whatever namespace Java apps will use for client-side resources like that.
Certainly there is danger from badly designed applications, but 
not necessarily from malicious network objects, which is the real 
concerns, since badly designed applications get CERT warnings and NYTimes 
stories.

So, for HTML forms, there can be a direct mapping between the variable 
names and SGML entities, such that "&profile-email-address;" will insert 
my address.  This will be useful elsewhere, too - imagine being able to 
put a "Good morning, &profile-first-name;!" at the top of a page.  In 
that case security restrictions can be avoided since that content's not 
going back to a server somewhere.

Note that the specific application I outlined way back is just one thing 
that could be enabled using this mechanism - if say a user were to make 
their zip code available for all, then a visit to www.bigbook.com could 
place them square at their neighborhood for their first search, instead 
of having to hunt for it like now.

I am interested in holding discussion on this idea if we can keep it
civil and make sure that issues of privacy, security, and functionality 
are discussed at levels respecting the state of things as they stand 
today, and how we can make certain parts of the system better without 
making other parts of the system worse.  I will put my heart on my sleave 
and state that yes, I am in the "webvertising" business, though we are 
more interested in creating compelling content, and getting paid for it, 
than we are in selling jeans or toothpaste.  

I sincerely hope people agree that pulling this functionality out of 
the application layer into the payments/authorization layer makes sense.

	Brian

--=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--
brian@organic.com brian@hyperreal.com http://www.[hyperreal,organic].com/