Trapped in the Present: How engagement bias in short-run experiments can blind you to long-run insights
Brian Karfunkel | Data Scientist, Experimentation and Metrics Sciences
Congratulations, you’ve just finished building a new feature! But is it good? To find out, you launch an experiment: show half the users your new feature and see how they behave compared to the other half. You run this A/B test for a few days (until you get your predetermined sample size) and the feature performs well! You decide to ship it, but just to be sure you run a long-term follow-up: you put a small fraction of users (say, 5%) into a holdout group who won’t be exposed to the treatment, and track them for a month. When the holdout test is over, you’re surprised to find that the lift is much smaller than what you measured in the original experiment. What happened?
Often, this pattern indicates engagement bias: your treatment doesn’t have the same effect on unengaged users as it does on engaged users, but the engaged users are the ones who show up first and therefore dominate the early experiment results¹. If you trust the short-term results without accounting for and trying to mitigate this bias, you risk being trapped in the present: building a product for the users you’ve already activated instead of the users you want to activate in the future.
Not paying enough attention to unengaged users can have negative consequences: First, even if these users constitute a small fraction of daily active users, over time they can have a large impact in aggregate. Second, it’s often much easier to engage existing users by increasing their product comprehension than it is to acquire new, high-intent users, so unengaged users are a huge growth opportunity. Perhaps most importantly, these users — precisely because they have room to become more active users — are best able to teach you how to make your product more useful to more people.
Like many companies, Pinterest uses experiments to make product decisions multiple times a day. But as Jed Bartlett said on The West Wing, “decisions are made by those who show up”: we make decisions based on the users who actually enter our experiments, and that means we have to understand engagement bias if we want to both make Pinterest work better for existing Pinners and also inspire the next generation of Pinners to create the life they love.
Example: Pinner attribution on close up
When a Pinner comes across a Pin that sparks their interest, they might tap on it to get more information; this is called a close up. Normally, beneath the image on the close up view is a section (which we call the attribution) that indicates who saved that Pin and which board they saved it to, along with any description they gave. If the user scrolls down, they’ll see Related Pins: similar ideas they might be interested in.
When a Pinner saves a Pin, it often signifies that they’ve discovered something interesting and want to be able to use that idea in real life; perhaps they’re saving places they want to visit on their next vacation, or dinner recipes they want to try during the week, or a new way to style their hair on their next date night. We know that the more ideas we can show a Pinner, the more likely they are to find something they want to save for later, so if we could enable users to see the related pins on every close up, we might be able to help them discover something truly inspiring.
We could set up an experiment to test this hypothesis: by removing the attribution on the close up view, we can move the Related Pins up onto the screen and therefore get more users to save Pins. Here is what the treatment and control experiences might look like:
After running the experiment for a few days, suppose we see a lift of nearly about 1.6% and decide to ship². In order to quantify the magnitude of the lift more precisely, we run a holdout for a long period of time. Each day, the lift estimate from the holdout gets smaller and smaller.
How engagement bias creeps in
To see how engagement bias might have come into play here, let’s suppose users all fall into one of three levels of engagement:
- Core users are active at least several times a week, and regularly save Pins
- Casual users are active most weeks, and occasionally save Pins
- Marginal users are periodically active, don’t have great product comprehension, and don’t often save Pins
For simplicity, we’ll assume all users in an engagement level are identical, with most registered users in the Marginal level and most of the users who perform a close up on a given day being Core:
We use what might be called exposure triggering: a user assigned to the treatment or control group only joins the experiment sample (“triggers into” the experiment) after they are first exposed to the treatment experience or its corresponding control experience³. For our close up experiment, this means that when a user performs a close up, we check whether they are assigned to the treatment or control group (and thus see the treatment or the control experience) and, if they are not already part of the experiment sample, we then log them as triggered into the experiment. This ensures that we are only comparing Pinners who actually saw the treatment close up view with users who would have seen that experience if they had been randomly assigned into the treatment group.
Let’s simulate how users will enter our experiment, making the assumption that the visitation rates and close up rates in the above table can be modeled as Bernoulli trials: every day, for every user, we flip a (biased) coin to see if that user both visits Pinterest and performs a close up, and each coin flip is independent across days and across users⁴.
For example, if we ramp up our experiment on January 1, with 50% of users allocated to each variant, then after the first day we’d expect (using the above table) 480 Core users to join the experiment, 240 Casual users, and 60 Marginal users: half get triggered into the control group, and half get triggered into the treatment group, but every user who performs a close up gets triggered into one group or the other. On January 2nd, however, there are only 520 Core users who have not already triggered in — nearly half the Core users have already been exposed to the treatment (or control) variant, and they can only trigger in once. Given the above assumption that a user’s behavior on one day is independent of their behavior on the previous day, this means we’d expect only about 250 Core users to trigger in on the second day of the experiment. The number of users who are in our sample over time thus looks like this:
After just a week, nearly all the Core users have joined the experiment, but by that point, just a small fraction of the Marginal users have joined. A week later, it’s clear that the number of Casual users joining each day is getting smaller and smaller, too. Even two weeks after starting the experiment, though, the triggering curve for Marginal users still looks nearly linear.
The implications of this are more clear if we look at the share of our experiment’s sample that are in each of the three engagement levels over time:
Core users are 62% of the sample after one day, but if we run the experiment for two weeks they would be just a third of the sample. Our Core users join quickly, so they are overrepresented in our sample early on, while Casual and Marginal users are underrepresented — this is the heart of engagement bias.
Why does engagement bias matter?
Let’s suppose our experiment has different effects for users in different engagement levels. Perhaps for Core users, who often use Related Pins, the attribution isn’t really needed. Maybe for Marginal users, however, the attribution block helps them understand what a Pin is: they’re able to see that users can save Pins to their own boards, and in so doing create their very own copy of that Pin, perhaps with their own description. Thus, for these users, removing the attribution block might substantially reduce their save propensity (especially since they already save much less frequently than Core users) because the drop in comprehension negates the positive effects of being exposed to more ideas in the Related Pins.
Let’s model what we would see if this were true, where treatment effects are positive for Core users, flat for Casual users, and negative for Marginal users (and, as noted above, we assume all users within the same engagement level are identical⁵):
Every day we run our experiment, the sample gets closer and closer to being representative of our whole user base: the share of Core users goes down and the share of Marginal users goes up. This also means that the estimate for the treatment effect also becomes more and more representative of the effect across all users:
At first, the treatment looks positive because we are mostly looking at Core users, for whom we’re removing barriers to finding Pins they want to save. But as we slowly add Marginal users, the fact that we are hurting their comprehension, and therefore propensity to save, gathers more and more weight. What if we kept running the experiment until nearly everyone was triggered in?
In the short term, the experiment shows a positive lift, but that’s only because our early sample is skewed to our more engaged users. In the long term, the lift eventually becomes negative, since we are now observing the negative effects on our less-engaged users. In other words, if we care about our unengaged users⁶, then looking only at the initial experiment results would lead to making the wrong decision and would hamper efforts to turn users who are Marginal today into more engaged users in the future.
How to mitigate engagement bias
The simplest way to limit engagement bias is to just run experiments longer: if you run experiments for a month, they will be less biased than if you run them for a week. But how long can you afford to run your experiments given the detrimental impact on experiment velocity? That depends on the population you care about. If you’re trying to build features that deepen the engagement of already-engaged users, you might not care enough about marginal users to make the slower iteration worthwhile. On the other hand, if you are focussing on growth and trying to attract more users, then you might be willing to trade off speed for a better understanding of those users you are trying to reach.
Another method is to run separate experiments for each engagement level whenever you have reason to believe that the treatment effect will be correlated with engagement. Instead of trying to make a single decision that works for everyone, you can make a series of more nuanced decisions so that you can ship the best variant for each group. For Core users, who enter experiments rapidly, you can gather sufficient sample quite quickly, and thus iterate quickly as well. Experiments targeting marginal users might take longer to reach a reasonable power, but waiting for signal from these users no longer blocks development on new features for more engaged users.
If your product is stable enough to develop reasonable priors about how treatment effects differ for users in different engagement levels, you might be able to use Bayesian methods to make reasonable corrections for engagement bias by modeling each engagement level separately. Or, you could use a method like Heckman correction to account for the bias in the experiment sample. For most teams running experiments, though, the most important thing is not trying to model away engagement bias, but to understand what it is and how it can impact decision making.
Engagement bias occurs when a treatment affects more-engaged users differently from less-engaged users, and when users enter an experiment sample at different rates based on their level of engagement. It means that early experiment results generally reflect the treatment’s effects on the most active users, and, over time, the estimate of the lift caused by the treatment will approach the overall effect you’d expect to see after shipping. While running experiments long enough to truly eliminate engagement bias is often not feasible, merely being aware of it can help teams think more intelligently about the effects on unengaged users and understand why experiments that seem to have large treatment effects, when shipped and evaluated over the long run, don’t produce the anticipated impact.
Acknowledgments: Thanks to Bo Zhao for his help developing the methodology, and to Pamela Clevenger, Ashim Datta, Phoebe Liu, Aidan Crook, Sarthak Shah, Mia Sandler, and Malorie Lucich for their feedback and editing assistance.
 This is generally only a problem when you are examining a per user metric, such as share of users performing a certain action (e.g. conversion rate) or an average across a set of users (e.g. average time spent on site per monthly active user). For volume metrics, like total revenue or number of conversions, engagement bias is usually not a problem, at least when it comes to predicting an experiment’s impact once it’s shipped, because the metric is fungible: if seven unengaged users each create $1 less in revenue once a week but one engaged user creates $1 more in revenue every day, the changes in weekly revenue offset even though there are more unengaged users affected.
 Even if you don’t typically run long-term holdout experiments to confirm results, you may experience a similarly puzzling pattern of lift that starts high and decreases over time if you’re peeking at experiment results while running them. Looking at the lift before achieving the predetermined sampled size, although common, is, of course, a big no-no if you use those results to decide to end experiments early (or run them longer). You can use techniques like sequential testing or various Bayesian methods to mitigate (although not eliminate) the problems of peeking if you want to allow early results to influence your decision making.
 The alternative might be called intent-to-treat triggering: every user is triggered in as soon as they are allocated to one group or another, even if they never actually see the experimental feature change. Intent-to-treat triggering makes it easier to extrapolate the effects for the whole population if the treatment were to be shipped, but it requires more care when performing analysis; otherwise, the treatment effect’s signal can be diluted by all the users whose experience is identical regardless of their group assignment, making it difficult to detect any effect at all.
 In other words, for example, if a Core user doesn’t visit on Tuesday, it doesn’t make them more likely to visit Wednesday, and vice versa.
 This assumption is, of course, not true in practice: engagement is a spectrum rather than a set of discrete levels. Thus, even if we had an experiment specifically targeting Marginal users, we would, in the first few days, get a sample where the most active Marginal users are overrepresented and the least active Marginal users are underrepresented.
 While growth — reaching new users and helping them find value in your product — is often a key goal for product-driven companies, it is not clear that all companies should value making lots of Marginal users slightly more engaged as much as they value making a smaller number of Core users much more engaged. In practice, you can’t build a product that does everything for everyone, and so you have to root product development in your mission and a coherent set of priorities and values.