Understanding Performance Regions

8 min readJun 24, 2018

One of the first things I try to teach people about performance planning is that it’s important to understand how your customers will react to various levels of performance and be mindful that your ability to make change in different regimes will be limited and definitely have different costs.

I often ask the question, “Do you need an ‘A’ on this metric or will a ‘C’ do the job?” I think it’s a great question because it leads you to ask “What is an ‘A’ anyway? And what’s a ‘C’ for that matter?”

This is all good thinking.

To help motivate all of this I’ll refer to this hypothetical performance graph which I’m going to discuss presently.

A hypothetical time metric and corresponding user engagement

For purposes of this discussion it doesn’t matter what time is being measured here. Actually it doesn’t even need to be time, any resource would do. In fact a resource is probably a better choice than time for analysis, but no matter, let’s keep it simple. The engagement metric can be a measure of any user behavior that means they are “doing the thing.” Again it doesn’t matter what “the thing” is for this discussion, but more of it is better. Let’s dig right in.

Some things to notice

There is a number, usually not zero, below which there is no data. This is pretty typical and that limit represents the best you can do on the metric in question with the best hardware and circumstances.
A good looking time distribution often looks vaguely “log-normal” but shifted because of that minimum value. Sometimes I refer to that as the “speed of light” for the scenario.
There is some point in the PDF where user engagement stops getting better as you make things faster. This generally reflects the reality that at some point it’s fast enough and nobody will notice any further speedup.
In this particular chart engagement actually starts to go down below a certain value. This is not a mistake, it reflects the fact that in many cases the people that are getting the very best performance are doing so because they are barely using the product. Maybe they tend to visit only toy web pages, or they have only toy text files to edit, or toy solutions to load, or whatever the case may be. The engagement goes down because at least some of those people aren’t really using the product fully. Those usages are artificially fast.
There’s a sort of symmetric point beyond which things have gotten so bad that sensitivity to further changes is very low. This roughly corresponds to the people who are using the product only when they absolutely positively have to and at no other time. These people are basically getting hellish performance and are stuck with it.
The most common time experience (which is almost certainly not the mean or P50 or anything like that) likely doesn’t correspond to the place where engagement stops getting better. It doesn’t in my picture. It might but that would be a fluke.

OK so got all that? So for the letter labelling you can do this: Find the place where getting better stops helping, call that point an ‘A’ find the place where the slope of the engagement curve is sufficiently small, pick a number, call that an ‘F’. Divide what’s left into thirds and you have letters. Easy.

Why would we do this anyway?

Well, the gold standard in reporting this kind of thing is to literally show the distribution. Annotate it kind of like the above. Maybe show a couple of curves to compare this week to last week or some such and away you go. That would be just lovely. However, people tend to find this unsatisfying. They want to show how things are going with just a few numbers. They want to show trends for many weeks and distributions are awkward for that.

In the system I’m describing above you could give the present status by talking about the percentage of people that are in any given bucket. So five numbers. This is morally equivalent to showing a 5-bar histogram instead of the true distribution.

Wait what? Why not just one statistic?

Well let’s think about this a second. The goal here is to describe succinctly whether we are doing better or worse in any given week. This gives your engineers something to rally behind and your management a very easy goal to think about. Actually one of the most important things you can do with your reporting is make the goal clear and get people to move the right number. Maybe that’s the most important thing.

Generally any one statistic is really problematic here because one statistic won’t incent the right behavior. Let me pick a few popular ones.

Mean: You can commit any crime of variability and keep the mean constant, but a consistent experience is super important. More on this later.

P90: If you report only P90 you can commit any crime you like before or after the P90 as long as that one point stays fixed your fine. For instance if the best 50% all got somewhat worse that wouldn’t affect the P90 at all. Or if you moved the P90 down at the expense say P25 and P50 that isn’t good.

P50: Ibid mostly… improvements in the top half do not register, nor does worsening of the back half.

Percentage of people in the ‘A’ bucket: Not really any different. You can commit any crimes you like against the ‘B’, ‘C’, ‘D’, ‘F’ bucket without hurting ‘A’.

In fact any single statistic is highly problematic which is why I’m always saying “stop with the shortcuts and just look at the distribution already.” But if you have five buckets well, ok, it’s still possible to do something bad and have it not show up (or something good) but it’s much less likely because you’re looking at various places.

What’s so good about this system anyway?

Well two things. First it’s hard to cheat. Second, you can apply ‘A’, ‘B’, ‘C’, ‘D’, ‘F’ to pretty much any metric and any kind of engagement. So if you have 20 dissimilar scenarios you can still report them the same way. People get used to the idea that migration to better buckets is a good thing and this is not so hard to show. Most good performance changes will have at least some bucket migration. Plus it’s more visceral in the sense that you get a much stronger sense of whether or not users are actually going to feel the change. I moved the P75 by 2ms just doesn’t feel very satisfying.

“Feel the change?” What the hell is that supposed to mean?

I know, I know this is supposed to be science right? Ok, let me explain that too.

There are many situations where it is easy to show that even a small change in say the mean of a metric (average load time or something) directly correlates to improved engagement. The data doesn’t lie, this stuff really happens. But why? How does that even make sense given what we know about big regions being not very sensitive?

Let me make it concrete: suppose I had a 100ms load time metric and I make a 1% improvement, reducing the average to 99ms. So 1ms better. Generally its possible to measure an engagement improvement associated with that change. Maybe it’s 1% more engagement, though that would be just luck. But some improvement.

Now the question is this: how did I get that 1% time improvement? If I got the 1% by literally moving every user to the left on the distribution by 1ms that would give me my 1% but would anyone even notice that? I doubt it… I mean for the right metric 1ms is huge but for a load time metric probably not so much. But it’s almost never the case that we get improvements of that ilk.

Typical improvements tend to be more like this: “We found [these users] were having [this problem] and we made [that thing] better.” This basically translates to something more like “a bunch of people who were getting a ‘D’ are getting a ‘C’ now” the net of that is the average moved by 1%. The affected people got a much more noticeable lift. If something like 10% of your users got a 10% improvement that might be a very small change in the mean but it would definitely show up in engagement! In fact, that’s generally a much better strategy for doing good performance work.

So I should target some of the people?

Well, yes, but actually we’ve still made one important simplification. You can actually target some of the people some of the time. Yes I just said that.

You see, we’ve been talking about users like they get a consistent experience: “This user gets good load time, that user doesn’t.” It’s actually not like that. If I’m using a complex product my experience can be totally different from run to run depending on environmental factors. How many other applications am I running? How good is my network? How full is my disk? Is it Thursday? I never could get the hang of Thursdays.

The point is that a user’s situation is more likely to be something like “I get a ‘B’ 4 times out of 5 and an ‘F’ the other time” If you can make it so that they get ‘B’ 5 times out of 6 that actually means they have a session with engagement a lot more often. If you did that for some fraction of users you might get a small improvement in the mean time metric but it’s not just that you moved the mean, it’s how you moved it.

Buckets help you think in the right way

I’ll end on this note: another valuable thing about buckets is that they make you ask questions like “What’s different about the ‘B’ and ‘D’ bucket?” It’s entirely possible that ‘D’ is where things start getting to be disk bound or something like that. If you can find the correlations to underlying engineering metrics you will then be able to find ways to target those buckets for improvement. This is a far more enlightened approach than trying to move the mean.

Conclusion

use distributions for analysis whenever possible
use single statistics like never
think about how consumption metrics vary by user and by user-session
invest where work will make a difference

Be careful out there.

Post Script: If you really want one number, you could do worse than percent getting (A or B) minus percent getting (C, D, or F). Or some other linear blend…

More where this came from

This story is published in Noteworthy, where thousands come every day to learn about the people & ideas shaping the products we love.

Follow our publication to see more stories featured by the Journal team.