18

Its often said that 5-10 usability testing participants is what you should aim for to identify the bulk of issues in a system. There's loads of research behind this. This is the right way to do usability testing, its about uncovering issues rather than proving anything.

However...it is often demanded from stakeholders that you provide proof that things are an issue. That your participants having trouble with something are not just a handful of stupid people out of their userbase of half a million.

I wonder, is there any way maths can be brought into usability testing analysis on top of the useful stuff of identfying issues and their likely causes to provide numbers for the liklihood of a person encountering an issue?

Yes. If I have 10 participants and 8 have the issue I can say its a 80% chance, but thats relying on the people I tested with. How can this be blown up to take into account the size of the overall user base? Would the direction to take be a separate calculation on my 80% to show the liklihood of my participants actually being representative users?

3
  • 2
    This article should give you a better insight in quant user tests: nngroup.com/articles/how-many-test-users . With 5 - 10 users you can not say '80%' of our users. What you can say is, a large amount of users tested experienced issues with the design. Qual research (5 - 7 users) is used to get insights in your design. Quant user tests are expensive and destroy your ROI. You can get better results with other tools that are quant.
    – Kevin M.
    Commented Jan 9, 2020 at 11:48
  • I would argue that people you ask to test something are still the ones more likely to complete the task than an average user of any application aimed at the masses. It's completely different for application aimed at highly technical people of course. Commented Jan 10, 2020 at 11:44
  • Showing that your sample is representative of the population of interest is incredibly hard and cannot be done by simple calculations. I like to use election polling as an analogy. You can take all the samples you want, but if your sample doesn't have the same socioeconomic/demographic distribution as the eventual voting population none of the math matters. The biggest 'misses' in prediction usually arise from your sample being skewed from the population of interest in some non-obvious way.
    – eps
    Commented Jan 10, 2020 at 21:13

6 Answers 6

22

You asked if there is any statistics that can be used to essentially determine if the 10 people you are testing with are especially dumb. Essentially what you're asking is:

What is the likelihood that my sample deviates significantly from the population?

Looking at it simply you could assume all your users (let's say 500,000) are drawn from a normal distribution of intelligence. That would tell us that roughly 68% of the 500,000 (340,000) are within one standard deviation of the mean. Let's assume (somewhat arbitrarily) that anything below one standard deviation below the mean is considered dumb, that means any time you pick a person at random there's a 16% chance they are dumb (100-68 = 32, but half of those would be unusually intelligent). 16% is pretty high, until you start getting more and more people involved.

  • 1 person = 16% chance they're dumb

  • 2 people = 2.56% chance they're both dumb

  • 3 people = 0.4% chance they're all dumb

  • ...

  • 5 people = 0.01%

  • ...

  • 10 people = 0.000001%

You get the idea. Each time you draw another sample the odds are less and less likely. (The values above are, 100 x 0.16^n, where n is the number of people and 0.16 is the decimal representation of 16 percent).

EDIT:

I figured I could add a little more to this. I thought I'd add a little more detail to the latter part of your question. You said if 8 our of 10 participants have an issue, how can this be blown up to take into account the whole user base. We can't simply say it's 8 dumb users so 100 * 0.16^8, because there isn't just 8 dumb users, there's also 2 smart users. There's a few ways to handle this, but I find this the most intuitive:

If we have 10 users, and we assume the distribution we discussed above (dumb is more than 1 standard divination below the mean) then we'd expect there to be between 1 and 2 dumb users per 10 people (16% = 1 or 2 out of 10). What's the chance we get 8 out of 10 as dumb users? Well we could first say what the chance we get 8 dumb people, followed by 2 smart people - 100*(0.16^8 * 0.84^2) or 0.00003% chance. But that's not really correct, because here we're saying that we specifically must get the dumb people as the first 8, and then the smart people as the last 2. What if we got 1 smart person, 8 dumb people and then 1 smart person again? Well that would be 100 * (0.84 * 0.16^8 * 0.84) which again gives us 0.00003%, but it's a different allotment of the possibilities so these percentages add up. So now the question is, how many arragements of 2 and 8 are there? There are 45. There's a few different ways to work out the 45, but I like the slow way:

  1. ['s', 's', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd']

  2. ['s', 'd', 's', 'd', 'd', 'd', 'd', 'd', 'd', 'd']

  3. ['s', 'd', 'd', 's', 'd', 'd', 'd', 'd', 'd', 'd']

  4. ['s', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 'd']

  5. ['s', 'd', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd']

  6. ['s', 'd', 'd', 'd', 'd', 'd', 's', 'd', 'd', 'd']

  7. ['s', 'd', 'd', 'd', 'd', 'd', 'd', 's', 'd', 'd']

  8. ['s', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 's', 'd']

  9. ['s', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 's']

  10. ['d', 's', 's', 'd', 'd', 'd', 'd', 'd', 'd', 'd']

  11. ['d', 's', 'd', 's', 'd', 'd', 'd', 'd', 'd', 'd']

  12. ['d', 's', 'd', 'd', 's', 'd', 'd', 'd', 'd', 'd']

  13. ['d', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd']

  14. ['d', 's', 'd', 'd', 'd', 'd', 's', 'd', 'd', 'd']

  15. ['d', 's', 'd', 'd', 'd', 'd', 'd', 's', 'd', 'd']

  16. ['d', 's', 'd', 'd', 'd', 'd', 'd', 'd', 's', 'd']

  17. ['d', 's', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 's']

  18. ['d', 'd', 's', 's', 'd', 'd', 'd', 'd', 'd', 'd']

  19. ['d', 'd', 's', 'd', 's', 'd', 'd', 'd', 'd', 'd']

  20. ['d', 'd', 's', 'd', 'd', 's', 'd', 'd', 'd', 'd']

  21. ['d', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd', 'd']

  22. ['d', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd', 'd']

  23. ['d', 'd', 's', 'd', 'd', 'd', 'd', 'd', 's', 'd']

  24. ['d', 'd', 's', 'd', 'd', 'd', 'd', 'd', 'd', 's']

  25. ['d', 'd', 'd', 's', 's', 'd', 'd', 'd', 'd', 'd']

  26. ['d', 'd', 'd', 's', 'd', 's', 'd', 'd', 'd', 'd']

  27. ['d', 'd', 'd', 's', 'd', 'd', 's', 'd', 'd', 'd']

  28. ['d', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd', 'd']

  29. ['d', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's', 'd']

  30. ['d', 'd', 'd', 's', 'd', 'd', 'd', 'd', 'd', 's']

  31. ['d', 'd', 'd', 'd', 's', 's', 'd', 'd', 'd', 'd']

  32. ['d', 'd', 'd', 'd', 's', 'd', 's', 'd', 'd', 'd']

  33. ['d', 'd', 'd', 'd', 's', 'd', 'd', 's', 'd', 'd']

  34. ['d', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's', 'd']

  35. ['d', 'd', 'd', 'd', 's', 'd', 'd', 'd', 'd', 's']

  36. ['d', 'd', 'd', 'd', 'd', 's', 's', 'd', 'd', 'd']

  37. ['d', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd', 'd']

  38. ['d', 'd', 'd', 'd', 'd', 's', 'd', 'd', 's', 'd']

  39. ['d', 'd', 'd', 'd', 'd', 's', 'd', 'd', 'd', 's']

  40. ['d', 'd', 'd', 'd', 'd', 'd', 's', 's', 'd', 'd']

  41. ['d', 'd', 'd', 'd', 'd', 'd', 's', 'd', 's', 'd']

  42. ['d', 'd', 'd', 'd', 'd', 'd', 's', 'd', 'd', 's']

  43. ['d', 'd', 'd', 'd', 'd', 'd', 'd', 's', 's', 'd']

  44. ['d', 'd', 'd', 'd', 'd', 'd', 'd', 's', 'd', 's']

  45. ['d', 'd', 'd', 'd', 'd', 'd', 'd', 'd', 's', 's']

(Where 'd' = dumb and 's' = smart)

Phew. Now that that's done, and we've already proven the order doesn't effect the percentage change of the specific outcome, we can say the the chance that your 8 out of 10 users getting the wrong answer because they are dumb is: 100 * (0.84^2 * 0.16^8 ) * 45 or 0.00136373801582592%. Now, how low is this number? Well... it's about as likely than getting a straight flush in poker.

11
  • 25
    Of course none of this fancy math will convince your stakeholders/managers if THEY are the dumb ones. Commented Jan 10, 2020 at 11:47
  • 3
    @TomášZato-ReinstateMonica I guess the only way around that is to get enough managers that it's unlikely they could all be dumb! :D Commented Jan 10, 2020 at 16:50
  • 8
    The problem with this answer, as it pertains to the question in the OP, is that it assumes the people chosen for the study are an unbiased sample of the overall population. None of this math matters if the selection methods for choosing the 10 people is biased, because you are assuming a binomial distribution which assumes statistical independence and so on. The real question the OP needs to address before they can use this answer is that the people used are representative of the population of interest, which is much harder to show.
    – eps
    Commented Jan 10, 2020 at 20:53
  • 1
    It's like doing election polling -- all the math behind it is useless if you are only polling land-line users, because people with landlines are not representative of the overall voting population.
    – eps
    Commented Jan 10, 2020 at 20:55
  • 1
    Please don't use comments to get into extended conversations. There is a User Experience Chat feature for such purposes.
    – JonW
    Commented Jan 13, 2020 at 14:00
18

In one study I discovered "5 out of 5 users couldn't complete the task" in the past, and my stakeholders simply didn't believe the data. So this is a something to consider.

While Nielsen advocates 5 people will discover your more serious problems, he also goes on to say that you should address those findings, then retest with a different set of 5 users. He explains that if you keep doing this, the number of new problems will deminish very quickly. https://www.nngroup.com/articles/why-you-only-need-to-test-with-5-users/

Elaborate usability tests are a waste of resources. The best results come from testing no more than 5 users and running as many small tests as you can afford

He later followed up on that article with this one: https://www.nngroup.com/articles/how-many-test-users/

Testing with 5 people lets you find almost as many usability problems as you'd find using many more test participants

In gerenal you need to think through your testing (mix quantitative with qualitative) and you need to spend as much time on planning as analysing results - the testing is the smallest part in time.

Remember, you need to present your results to your stakeholders, and they need to see how rigourous you have been in discovering those insights (the learned insights are your output to your stakeholders). If your sample size is small, we generally do NOT convert numbers into percentages and don't forget you can use statistical techiques like calculating the confidence interval, e.g. so they can see "value x plus or minus y couldn't complete this task."

2
  • 3
    Sadly, a lot of our work with things like this is to 'cover our ass' so to speak. You can present all the findings to clients that you want, but it's their money and they can choose how to spend it. If they don't want to pay (money or time) to fix issues then all you can do is let them know you think it's going to cause them issues.
    – JonW
    Commented Jan 9, 2020 at 15:31
  • I consider it a victory if I can discover just one new insight - anything that improves the solution is the ultimate goal. Most of the time I learn something new and these things are often the small things that make the difference between "its ok" and "it's great".
    – SteveD
    Commented Jan 9, 2020 at 15:36
4

From a statistical perspective, the larger your test group, the more accurate your result. With an increasing number of participants, your result in the limit value always approaches the real value. However, this is practically impossible to make. In my data sciene courses I often heard that 10000 people are a reliable number, but that should also be difficult to implement.

I would spontaneously remember to base your tests on the construction of neural AI networks. A possible approach would be, for example, to select 10 ** random ** people and divide them into two random groups. With the first 5 people you normally do your tests. You then determine the result of what you get from this first group in a hypothesis and specifically test this hypothesis with the second group. If the second group confirms your results, you have a good chance that other groups would do the same. If you repeat this procedure several times, you should be able to achieve very precise values ​​with a relatively small group of participants.

You can make the result even more precise if you change the group sizes over time. I personally achieved good results with the following values: Group 1 (lets call it batch group) 70-80% of the people. Group 2 (test group) 30-20% of the people. 10 runs. Cheers!

2

Most usability questions can be answered with small batch, qualitative testing, as mentioned above.

Some questions require large samples and quantitative proof to answer. A/B testing uses large samples of live traffic to answer narrow design questions. Many of the client-side A/B test platforms include a stats engine to determine "statistical significance". Optimizely, a leading A/B test platform, includes this handy calculator to figure out how many people you need for proof.

Another way to figure out what constitutes proof, is to calculate the margin of error.

2

Change the way you communicate with stakeholders

If you need evidence to resolve a deadlock between stakeholders, the battle is already lost. Good design is not about quantitative science, it's about getting stakeholders to communicate effectively.

First off, user tests are an informal procedure. They can be very informative to a designer or a design team, but they are not a scientific procedure and the results are strongly subjective. If everybody on the team shares the same view about what the application is supposed to be, and has the experience to interpret a test, they are a great tool. They inform the design process. However, they are not evidence, and if you have stakeholders with opposing views and different agendas, user tests are not the way to deal with that.

There are tests that are rigorous enough to provide genuine evidence. A/B testing, for instance. Using these to make design decisions is also a terrible idea. The reason is that you end up optimizing for some arbitrary goal (like conversion) because that goal can easily be expressed as a number, while all your other goals, like customer satisfaction, cognitive load, rate of return can less easy to measure.

The article Should designers trust their instinct or the data on the Google Ventures blog discusses this in greater detail. It includes the following image:

An extreme example of data-driven vs instinctive design

The checkout button on the right is great for conversion, but that doesn't make it good design, because design is about balancing different goals, not about blindly optimizing the ones that can easily be measured.

The challenge of designing for different stakeholders is not solved by deciding based on data or user testing. It's solved by taking every stakeholder's concerns seriously, mapping them out, investigating them honestly, and finding creative solutions that make everybody happy. In your case, if viewing a user test doesn't convince stakeholder X that Y is a serious issue, find out what problems they have with fixing the issue (time, money, adding complexity, lowering security etc). Maybe you can come with a solution that solves the issue without causing the problems X is worried about. Ideally, you solve it in a way that makes everybody happy.

Sometimes viewing a user test brings people on the same page, but if it doesn't, don't point to the user test as evidence.

0

This seems to me to be more of a statistics question than a UX one. If you have a binary classification (issue did or did not arise), then you have a binomial distribution. If you want a lower bound for the "true" probability of the issue arising, you can enter that lower bound into a binomial distribution to get the probability of getting what you saw. So, for instance, if you saw the issue 8 out of 10 times, you can enter 10 as the number of trials, 8 as the number of successes, and 0.1 as p. That will get you the probability of seeing those results if the issue really were present only 10% of the time. You can then say "If the true rate is less than 10%, the probability of seeing the results we got is less than [give number you got]". You could also find the confidence interval. Google or any statistics package (including Excel) should get you what you need.

And as an aside, "this is just stupid people" isn't much of an excuse. If you're doing UX, your job is to make an interface that your actual users can use, not one designed around an ideal user who always acts perfectly.

1
  • As I commented on another answer, there are other things to consider here. In order for the math to work you need to be assured that the test units are independent and that the people you choose for the study are reasonably representative of the population of interest.
    – eps
    Commented Jan 10, 2020 at 21:01

Not the answer you're looking for? Browse other questions tagged or ask your own question.