Re: Some studies on the visibility of EV sites. from Serge Egelman on 2008-03-06 (public-wsc-wg@w3.org from March 2008)

From: Serge Egelman <egelman@cs.cmu.edu>
Date: Thu, 06 Mar 2008 09:29:06 -0800
To: "Hallam-Baker, Phillip" <pbaker@verisign.com>
CC: public-wsc-wg@w3.org
Message-ID: <47D029E2.80605@cs.cmu.edu>
Hallam-Baker, Phillip wrote:
> All studies are flawed.
> 
> The shopping cart studies are studies of real world users under field
> conditions. They are very large sample sizes and they demonstrate that
> the users are in fact noticing the green bar and modifying their
> behavior in response.

I wasn't talking about the shopping cart studies.  The main focus of 
your message was the "field" study performed by Tec-Ed.  I'm sorry if it 
wasn't clear that that was what I was commenting on.

That study was not performed under field conditions.  A field study 
involves observing participants in their natural setting.  Calling them 
up and walking them through a series of artificial tasks is not a field 
study.  That's a laboratory study, it's a laboratory study being 
performed in the home.

Furthermore, just because you have a large sample size doesn't 
automatically make a study scientifically sound.  A large sample size is 
used to create statistical significance, and that can happen in the 
laboratory or in an observational field study.  However, since there 
doesn't appear to be any statistical analysis in any of the studies that 
you have posted, talking about sample size is rather moot.

> 
> Now that is not the same as demonstrating that the users will make the
> right choice when faced with an attack. But I don't know how to measure
> that accurately. I don't think anyone else does either.
> 

You could start by attacking the users and observing what they do.  Of 
course, telling them ahead of time that the study is about security and 
that they will be attacked is going to confound your results.  I would 
hope that everyone can agree on this very basic point.

> 
> The question is whether these studies are more likely to illustrate the
> conditions that we care about than laboratory subjects with five test
> users.

I'm unfamiliar with any ecologically valid observational studies that 
have found statistical significance with five participants.

> 
> The scenario I care about is the one in which we have a user who
> receives a phishing bait email in their inbox, is suspicious but follows
> the link. In that scenario the user is primed to be security aware by
> definition.

No, you clearly don't understand the definition of priming.  Priming is 
when the subject has been tipped off as to the purpose of the study and 
is therefore likely to exhibit behaviors that will differ from how they 
would normally behave.  For instance, telling subjects to look for EV 
indicators, and then "observing" whether they notice them does not make 
for an ecologically valid study.  In this case, the subjects have been 
primed for security and the results of the study will be completely 
bogus (assuming the purpose was to examine how many people notice the 
indicators under natural conditions).

The case that you are describing above is not priming.  Because the user 
has received a phishing message, that doesn't mean they've been primed 
for security.  That means they've received a phishing message.  There 
might have been priming involved, but what you've described is not priming.

> 
> 
> We do have a large number of users who are suspicious when they get
> these emails. The problem is that there are two possible outcomes
> (caught / not caught) and the attacker can deliberately confuse the
> user.
> 
> I do not expect to get perfect results. A bank told me that 15% of
> customers who are suspicious enough to call the bank then go ahead and
> give their details AFTER being warned that it was a scam (!)

Doesn't that cause great concern regarding the number of users who don't 
call the bank?

> 
>>From a loss management point of view what we need to do here is to draw
> out the state graph of the user with all the possible transitions,
> assign costs and probabilities to each arc. The user getting phished is
> not necessarily the highest cost outcome. Only a small percentage of
> phished users results in a loss, every customer who calls the call
> center represents a cost.
> 
> 
> I think that you draw flawed conclusions from the laboratory tests. 

I didn't write this, you're quoting someone else.

> 
> 
> Take Rachma's test on a certain bank to user authentication technique
> with Doug Tygar (I would prefer we do not refer to it by name as I don't
> want to comment on a competitor's product). The lab tests show that the
> measure is ineffective under lab conditions when the users have nothing
> to lose and are primed to expect that they might be working with a
> prototype that may not function correctly. We can also be pretty sure
> that the banks who continue to pay to use the measure are seeing an
> increase in consumer confidence.

I'm not sure which paper you mean.  I assume you're talking about 
SiteKey, but that wasn't with Doug Tygar (see, using proper nouns has 
its uses).  In that particular study, the subjects used their own 
account credentials, so saying the had nothing to lose is a bit of an 
overstatement (If you disagree, feel free to send me your account 
credentials).  Furthermore, if in fact the subjects knew the real 
purpose of the study and only behaved a certain way because of this 
knowledge, all of the psychology literature indicates that the subjects 
should have behaved more securely.

Furthermore, you seem to conflate consumer confidence with actual 
security.  Why does the bank want increased security?  To minimize costs 
and increase consumer confidence.  If faced with the decision to spend a 
small amount of money to increase consumer confidence, or spending a lot 
more money to provide actual security, the smart business decision is to 
do the former.  For a business, security isn't the end goal, the end 
goal is to increase revenues/lower costs.  If a completely ineffective 
system can accomplish this for a small amount of money, of course it's 
going to be adopted.  But please don't interpret this to mean that it's 
effective.

The laboratory study wasn't examining whether this system would improve 
Bank of America's bottom line.  The point of the study was to examine 
how much security it really provided.  Clearly whether it provides real 
security and whether it's a cost effective measure are two completely 
different questions.


serge

> 
> 
> Hypothesis 1: The user is confused about computer security
> 
> Support: Numerous, users state that they are confused, attacks are
> predicated on confusion, security measures presented are confusing to
> security specialists.
> 
> 
> Hypothesis 2: A user that has become confused as a result of an
> inconsistent i/f will not be any less confused after an hour or even a
> day of use when presented with a consistent user i/f .
> 
> Support: In the short term the internally consistent user interface
> represents something that is different to their experience and thus
> merely another data point that is inconsistent with the existing data
> points. We should not expect the user to be any less confused.
> 
> Support: I am still learning how to use my MacBook after two weeks and
> that is generally reconed to be a more consistent than average system.
> 
> 
> Hypothesis 3: some precentage of confused users will become less
> confused if presented with a consistent interface over a prolonged
> period of time.
> 
> Support: None at this time, the reasoning behing the hypothesis is that
> over time a user builds a mental model of the system. If the system is
> inconsistent or too complex, so is the model. The user is unable to
> predict the outcome of a particular interaction and this is reported as
> 'confusion'. If the user has experience of a consistent interface over a
> prolonged period of time the are able to build a consistent mental model
> which allows them to predict the outcome of interactions. The user gains
> confidence.
> 
> 
> Note that all our existing data points are essentially consistent with
> these hypotheses. We would expect lab tests over short periods of time
> to result in the user being more confused and more susceptible to
> attack.
> 
> Such studies can still be valid. I think that Rachma's test is valid
> because the system under test is essentially a bolt-on extra. It is only
> applied at one Web site, I don't expect that the user is going to be
> very much better at generating a consistent mental model over time than
> in the lab. There is also some pretty compelling data from phishing
> attacks using some of the proposed vectors.
> 
> But even if valid, tests can still be damaging. Within a few hours of
> the results being presented at a closed door meeting we started to see
> live attacks using precisely the vectors described.
> 
> 
> What I am looking for here is a way to obtain an independent test of
> hypothesis 3. I suspect we are going to need a somewhat significant
> quantity of money to do this and that is going to need to come from an
> independent source. 
> 
> 
> -----Original Message-----
> From: Serge Egelman [mailto:egelman@cs.cmu.edu] 
> Sent: Wednesday, March 05, 2008 11:54 AM
> To: Hallam-Baker, Phillip
> Cc: public-wsc-wg@w3.org
> Subject: Re: Some studies on the visibility of EV sites.
> 
> This entire study is flawed: participants were primed for security.  Any
> legitimate security study would not begin by telling participants it is
> a security study, and telling them which browser features to look for!
> 
> "The simple description participants heard in this study was: "The green
> address bar in Internet Explorer 7 means that this website is an
> Extended Validation website. Extended Validation, or EV, means that the
> website owner has gone through extra, rigorous steps with an authorized
> Certificate Authority to prove they are a secure site."
> 
> Of course you're going to get favorable results when you tell them ahead
> of time what to look for!  In every study where participants have *not*
> been told to look for SSL indicators, they *rarely* notice them.  With
> the EV indicator specifically, I had 0 of 60 participants notice it.
> 
> This study has not shown that the EV indicators are effective.  It has
> shown that when some schmuck calls you on the phone and tells people
> what to look for, they generally follow those instructions.  Hooray,
> VeriSign just paid to confirm the Hawthorne Effect.  It's too bad this
> has been known for over 50 years.
> 
> serge
> 
> Hallam-Baker, Phillip wrote:
>> OK, these are vendor studies, but they are much bigger sample sizes 
>> and under field conditions. The Tec-Ed study is an independent study 
>> we comissioned. These are the only studies I am aware of that VeriSign
> 
>> has commissioned.
>>  
>>  
>> I don't think that the small sample size is the real problem in lab 
>> tests. Its the lab itself. I have been using computers for 25 years, I
> 
>> used a Mac every day at MIT. It has taken me over two weeks to get 
>> used to my MacBook Air and I am still finding things out now.
>>  
>> Nielsen's usability tests seem to me to be exactly right if your 
>> objective is to design something in order to sell it. I was in the 
>> Apple store for a total of about 30 minutes. I did not intend to buy 
>> that particular model going in (I was going to buy a more expensive 
>> model but they didn't have it in stock - thankfully).
>>  
>> But what matters for stopping Internet crime is the long term user 
>> interaction.
>>  
>>  
>> ___http://www.verisign.com/static/040655.pdf_
>> <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
>> isign.com/static/040655.pdf>
>>
>> January 2007, Tec-Ed researched usage and attitudes of 384 online 
>> shoppers
>>
>>           o Measured their responses to Web sites with and without
> green
>>             bars
>>                       # 100% of participants notice whether a site
> shows
>>                         the green EV bar
>>                       # 93% of participants prefer to shop on sites
> that
>>                         show the green bar
>>                       # 97% are likely to share their credit card
>>                         information on sites with the green EV bar, as
>>                         opposed to only 63% with non-EV sites
>>                       # 77% of participants report that they would
>>                         hesitate to shop at a site that previously
>>                         showed the green EV bar and no longer does so
>>
>> *DebtHelp: 11% increase in transactions* 
>> ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
>> Secured_Seal/debthelp.html___ 
>> <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
>> isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/debt
>> help.html>__
>>
>>
>> *Overstock: 8.6% decrease in abandoned shopping cart rate* 
>> ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
>> Secured_Seal/overstock.html___ 
>> <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
>> isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/over
>> stock.html>__
>>
>>
>> *Scribendi: 27% increase in transactions* 
>> ___http://www.verisign.com/Resources/success-stories/SSL_and_VeriSign_
>> Secured_Seal/scribendi.html___ 
>> <https://webmail.verisign.com/exchweb/bin/redir.asp?URL=http://www.ver
>> isign.com/Resources/success-stories/SSL_and_VeriSign_Secured_Seal/scri
>> bendi.html>__
>>
>>
> 
> --
> /*
> PhD Candidate
> Carnegie Mellon University
> 
> "Whoever said there's no such thing as a free lunch was never a grad
> student."
> 
> All views contained in this message, either expressed or implied, are
> the views of my employer, and not my own.
> */
> 

-- 
/*
PhD Candidate
Carnegie Mellon University

"Whoever said there's no such thing as a free lunch was never a grad 
student."

All views contained in this message, either expressed or implied, are 
the views of my employer, and not my own.
*/
Received on Thursday, 6 March 2008 17:30:07 UTC