Re: Some studies on the visibility of EV sites. from Rachna Dhamija on 2008-03-09 (public-wsc-wg@w3.org from March 2008)

From: Rachna Dhamija <rachna.w3c@gmail.com>
Date: Sat, 8 Mar 2008 17:03:38 -0800
To: "Hallam-Baker, Phillip" <pbaker@verisign.com>
Cc: "Serge Egelman" <egelman@cs.cmu.edu>, public-wsc-wg@w3.org
Message-ID: <20abbc510803081703v1aa19a01mbf1dc6986efa0fc3@mail.gmail.com>
On Thu, Mar 6, 2008 at 7:18 AM, Hallam-Baker, Phillip <pbaker@verisign.com>
wrote:

>
> All studies are flawed.


I disagree with this.  More accurately stated, studies can only measure what
they are designed to measure.  A study is only "flawed" when it claims to
measure something for which it is does not have the power to measure or
because there are corrupting influences that were not taken into account.

There is a tendency among computer scientists to think that any study that
is performed in a lab or with a small number of participants is "flawed"
because it can not measure human behavior and can not predict how users will
act in the real world.  This is false.

Let's take another famous "security" experiment- the Stanford Prison
Experiment.  That study took 24 students, paid them $15 and put them in a
simulated prison environment in the basement of the Stanford psychology
building.  12 of the students were assigned to be prisoners and 12 guards.
 Would you say that just because it was a small lab study in an artificial
environment that it was "flawed"?  (you could argue that it had ethical
problems, but I'll skip that for now).  On the contrary, it perfectly
predicted how young people, when thrown into positions of  dominating
prisoners, would behave in Abu Ghraib.  The researchers took great pains to
simulate the particular conditions they thought might influence human
behavior, and they measured their effect in a very controlled way, with
opportunity for close observation.

The shopping cart studies are studies of real world users under field
> conditions. They are very large sample sizes and they demonstrate that
> the users are in fact noticing the green bar and modifying their
> behavior in response.


This study measures how many people notice the green bar and act on it when
they are educated about it.  The location or number of users does not change
what you set out to measure in the study.  This study does not have the
power to predict how people will notice and use EV indicators without any
education.

Now that is not the same as demonstrating that the users will make the
> right choice when faced with an attack. But I don't know how to measure
> that accurately. I don't think anyone else does either.


What do you want to measure?

Testing how users with no education will use EV indicators isn't that hard.
 Simply study users that are representative of your sample population and
observe them in a usability study of "Bizy Bank".  You could do this
remotely or in the lab, or do both to compare.  You could have two groups:
one with EV and one without and EV and compare which group abandons their
shopping carts, or whatever behavior you are trying to induce with
the presence of the indicator.

The question is whether these studies are more likely to illustrate the
> conditions that we care about than laboratory subjects with five test
> users.


Again, that study is less likely to illustrate conditions you care about.
 An in-lab study of 5 users that realistically simulated the training
available to to users today would give you more realistic data than a 5,000
person study where participants are told to look for the EV indicator.
 Better would be a 5,000 person study that simulated training conditions
today.  The numbers of participants are important to show statistical
significance- they don't change your pre-conditions or what you are
measuring!


> The scenario I care about is the one in which we have a user who
> receives a phishing bait email in their inbox, is suspicious but follows
> the link. In that scenario the user is primed to be security aware by
> definition.
>
>
> We do have a large number of users who are suspicious when they get
> these emails. The problem is that there are two possible outcomes
> (caught / not caught) and the attacker can deliberately confuse the
> user.
>
> I do not expect to get perfect results. A bank told me that 15% of
> customers who are suspicious enough to call the bank then go ahead and
> give their details AFTER being warned that it was a scam (!)


I have observed this in users immediately after an attack.  Attackers are
good at mimicking security instructions and when the user is in a heightened
state of panic or alert, they are either not trusting at all, or too willing
to comply with security advice.


> >From a loss management point of view what we need to do here is to draw
> out the state graph of the user with all the possible transitions,
> assign costs and probabilities to each arc. The user getting phished is
> not necessarily the highest cost outcome. Only a small percentage of
> phished users results in a loss, every customer who calls the call
> center represents a cost.
>
>
> I think that you draw flawed conclusions from the laboratory tests.
>
>
> Take Rachma's test on a certain bank to user authentication technique
> with Doug Tygar (I would prefer we do not refer to it by name as I don't
> want to comment on a competitor's product). The lab tests show that the
> measure is ineffective under lab conditions when the users have nothing
> to lose and are primed to expect that they might be working with a
> prototype that may not function correctly.


Did you read the study?  We used real Bank of America users testing the real
Bank of America website, and all of them had experience using SiteKey and
therefore were pre-educated about it.  There was no "prototype that may not
function correctly" involved.  The study showed that both user with
something to lose (using their own accounts) and those who were playing a
role were not deterred when their SiteKey was missing.  The group with
something to lose, behaved a little bit better, but not much.

The participants thought that they were testing the Bank of America website
as part of usability study or focus group.  Many were not aware that we were
academics and some even asked us how we as Bank of America employees could
convince Harvard to use their classrooms.  That is, even though they signed
a consent agreement, many thought that we were using the classrooms as a
convenience rather than as part of a laboratory study.  Many users who saw
security warnings and error dialogs were not suspicious at all - in fact
some warned us "The Bank of America website is having some problems today.
 You might want to wait until it is working before you bring in more
people".  This study, IMO, did one of the best jobs of not priming users
that they were in a security study.  We did have a condition where we primed
one group of users that it was about security to compare the results.


> We can also be pretty sure
> that the banks who continue to pay to use the measure are seeing an
> increase in consumer confidence.


The banks are definitely seeing an increase in confidence, and this was
something we observed in the study.  Even when other security errors or
signs of attack were present, users felt reassured to see their SiteKey
(even when they should not have).  The importance of the "feeling" of
security is something that we as security designers do not pay enough
attention to IMO.


> Hypothesis 1: The user is confused about computer security
>
> Support: Numerous, users state that they are confused, attacks are
> predicated on confusion, security measures presented are confusing to
> security specialists.
>
>
> Hypothesis 2: A user that has become confused as a result of an
> inconsistent i/f will not be any less confused after an hour or even a
> day of use when presented with a consistent user i/f .
>
> Support: In the short term the internally consistent user interface
> represents something that is different to their experience and thus
> merely another data point that is inconsistent with the existing data
> points. We should not expect the user to be any less confused.
>
> Support: I am still learning how to use my MacBook after two weeks and
> that is generally reconed to be a more consistent than average system.
>
>
> Hypothesis 3: some precentage of confused users will become less
> confused if presented with a consistent interface over a prolonged
> period of time.
>
> Support: None at this time, the reasoning behing the hypothesis is that
> over time a user builds a mental model of the system. If the system is
> inconsistent or too complex, so is the model. The user is unable to
> predict the outcome of a particular interaction and this is reported as
> 'confusion'. If the user has experience of a consistent interface over a
> prolonged period of time the are able to build a consistent mental model
> which allows them to predict the outcome of interactions. The user gains
> confidence.
>
>
> Note that all our existing data points are essentially consistent with
> these hypotheses. We would expect lab tests over short periods of time
> to result in the user being more confused and more susceptible to
> attack.
>
> Such studies can still be valid. I think that Rachma's test is valid
> because the system under test is essentially a bolt-on extra. It is only
> applied at one Web site, I don't expect that the user is going to be
> very much better at generating a consistent mental model over time than
> in the lab. There is also some pretty compelling data from phishing
> attacks using some of the proposed vectors.
>
> But even if valid, tests can still be damaging. Within a few hours of
> the results being presented at a closed door meeting we started to see
> live attacks using precisely the vectors described.
>
>
> What I am looking for here is a way to obtain an independent test of
> hypothesis 3. I suspect we are going to need a somewhat significant
> quantity of money to do this and that is going to need to come from an
> independent source.
>
>
> -----Original Message-----
> From: Serge Egelman [mailto:egelman@cs.cmu.edu]
> Sent: Wednesday, March 05, 2008 11:54 AM
> To: Hallam-Baker, Phillip
> Cc: public-wsc-wg@w3.org
> Subject: Re: Some studies on the visibility of EV sites.
>
> This entire study is flawed: participants were primed for security.  Any
> legitimate security study would not begin by telling participants it is
> a security study, and telling them which browser features to look for!
>
> "The simple description participants heard in this study was: "The green
> address bar in Internet Explorer 7 means that this website is an
> Extended Validation website. Extended Validation, or EV, means that the
> website owner has gone through extra, rigorous steps with an authorized
> Certificate Authority to prove they are a secure site."
>
> Of course you're going to get favorable results when you tell them ahead
> of time what to look for!  In every study where participants have *not*
> been told to look for SSL indicators, they *rarely* notice them.  With
> the EV indicator specifically, I had 0 of 60 participants notice it.
>
> This study has not shown that the EV indicators are effective.  It has
> shown that when some schmuck calls you on the phone and tells people
> what to look for, they generally follow those instructions.  Hooray,
> VeriSign just paid to confirm the Hawthorne Effect.  It's too bad this
> has been known for over 50 years.
>
> serge
>
> Hallam-Baker, Phillip wrote:
> > OK, these are vendor studies, but they are much bigger sample sizes
> > and under field conditions. The Tec-Ed study is an independent study
> > we comissioned. These are the only studies I am aware of that VeriSign
>
> > has commissioned.
> >
> >
> > I don't think that the small sample size is the real problem in lab
> > tests. Its the lab itself. I have been using computers for 25 years, I
>
> > used a Mac every day at MIT. It has taken me over two weeks to get
> > used to my MacBook Air and I am still finding things out now.
> >
> > Nielsen's usability tests seem to me to be exactly right if your
> > objective is to design something in order to sell it. I was in the
> > Apple store for a total of about 30 minutes. I did not intend to buy
> > that particular model going in (I was going to buy a more expensive
> > model but they didn't have it in stock - thankfully).
> >
> > But what matters for stopping Internet crime is the long term user
> > interaction.
> >
> >
> > ___http://www.verisign.com/static/040655.pdf_
> > <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
> > isign.com/static/040655.pdf>
> >
> > January 2007, Tec-Ed researched usage and attitudes of 384 online
> > shoppers
> >
> >           o Measured their responses to Web sites with and without
> green
> >             bars
> >                       # 100% of participants notice whether a site
> shows
> >                         the green EV bar
> >                       # 93% of participants prefer to shop on sites
> that
> >                         show the green bar
> >                       # 97% are likely to share their credit card
> >                         information on sites with the green EV bar, as
> >                         opposed to only 63% with non-EV sites
> >                       # 77% of participants report that they would
> >                         hesitate to shop at a site that previously
> >                         showed the green EV bar and no longer does so
> >
> > *DebtHelp: 11% increase in transactions*
> > ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
> > Secured_Seal/debthelp.html___
> > <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
> > isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/debt
> > help.html>__
> >
> >
> > *Overstock: 8.6% decrease in abandoned shopping cart rate*
> > ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
> > Secured_Seal/overstock.html___
> > <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
> > isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/over
> > stock.html>__
> >
> >
> > *Scribendi: 27% increase in transactions*
> > ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
> > Secured_Seal/scribendi.html___
> > <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
> > isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/scri
> > bendi.html>__
> >
> >
>
> --
> /*
> PhD Candidate
> Carnegie Mellon University
>
> "Whoever said there's no such thing as a free lunch was never a grad
> student."
>
> All views contained in this message, either expressed or implied, are
> the views of my employer, and not my own.
> */
>
>
Received on Sunday, 9 March 2008 01:03:49 UTC