This is a series of articles born after reading “Trustworthy Online Controlled Experiments” by Ron Kohavi, Diane Tang, Ya Xu.
In this three-part series, I delve into key takeaways from the book, exploring several topics:
- Organizational Metrics
- Complementary and Alternative Techniques for Controlled Experiments
- Advanced Topics for Analyzing Experiments
Complementary and Alternative Techniques to Controlled Experiments
Complementary Techniques
To illustrate the use of complementary methods, consider how you might assess user satisfaction—a notably challenging concept to quantify. You could conduct a survey to collect self-reported data on user satisfaction, then examine the data from instrumented logs to identify which large-scale observational metrics correlate with the survey findings. This approach could be further refined by conducting controlled experiments to confirm the accuracy of the suggested proxy metrics.
You also need to keep in mind that the opinions expressed by customers in surveys or focus group discussions may not always reflect their actual preferences. This discrepancy was famously highlighted by Philips Electronics during a focus group aimed at understanding what features teenagers valued in boom boxes. Participants in the focus group overwhelmingly favored yellow boom boxes, labeling the black ones as “conservative”. However, when given the opportunity to choose a boom box to take home as a thank you for their participation, the majority opted for black.
Causal Studies
The term observational (causal) studies are usually defined as research where there is no manipulation of units. The term quasi-experimental designs to referred to studies when units are allocated to different variants, but this allocation is not done randomly.
We make a clear distinction between observational causal studies and general observational or retrospective data analyses, though both rely on historical log data. The primary aim of an observational causal study is to approximate a causal outcome as accurately as possible. On the other hand, retrospective data analyses serve a variety of purposes. These include summarizing data distributions, assessing the frequency of certain behavioral patterns, exploring potential metrics, and identifying patterns that could indicate hypotheses for future testing in controlled experiments.
Interrupted Time Series
A more straightforward version of the Interrupted Time Series (ITS) involves adding a Treatment, then removing it, and possibly repeating this process several times. For instance, to estimate how police helicopter surveillance impacts home burglaries, researchers applied and removed surveillance multiple times over a few months. They observed that burglaries decreased whenever the helicopter surveillance was in place and increased when it was removed.
Sophisticated methods, like Bayesian Structural Time Series analysis, are often needed to understand the effects in an ITS. Another important aspect to consider in ITS is how it affects the user experience. If users notice their experience constantly changing, this inconsistency might annoy or frustrate them. In such cases, any observed effects might be due to the inconsistency itself, rather than the changes being tested.
Interleaved experiment design
The interleaved experiment design is often used to test changes in ranking algorithms, like those in search engines or website searches. In this design, you compare two algorithms, X and Y. Suppose algorithm X displays results as x1, x2, …, xn, and Y displays them as y1, y2, …, yn. In an interleaved experiment, these results are mixed together in a sequence like x1, y1, x2, y2, …, xn, yn, making sure to remove any duplicates. To evaluate which algorithm is better, you could look at which set of results gets more clicks.
Instrumental Variables
Instrumental Variables (IV) is a method used to approximate random assignment in studies. For instance, to compare earnings between veterans and non-veterans, the Vietnam War draft lottery, which randomly assigned individuals to the military, can serve as a good example. Similarly, since charter school seats are often assigned by lottery, they can be a useful IV in certain studies. In these cases, while the lottery doesn’t ensure participation, it significantly influences it. To measure the effect, researchers typically use a two-stage least-squares regression model.
In some cases, natural experiments that are almost like random ones can happen. A good example is in medicine, where studies on monozygotic twins can be treated as natural experiments.
Propensity Score Matching
Another method involves creating similar Control and Treatment groups. This is often done by dividing users into segments based on common factors, similar to stratified sampling. This method can be enhanced by using propensity score matching (PSM). PSM differs in that it matches groups based on a single calculated number, known as the propensity score, instead of matching them on various characteristics.
Difference in Differences
A method called difference in differences (DD or DID) is used to assess the impact of a Treatment. This method works under the assumption that any changes in trends between groups are due to the Treatment. It’s based on the idea that, even though groups might be different without the Treatment, their trends should move in parallel. This technique is often used in experiments that are based on geographic locations.
Pitfalls
A common issue in research is an unrecognized common cause. Take Microsoft Office 365 as an example: it seems users who encounter more errors are less likely to stop using it. However, this doesn’t mean that increasing errors will reduce customer churn. The observed correlation is due to a common factor: usage.
Drawing causal conclusions from observational (uncontrolled) data requires multiple assumptions that are hard to test and can be easily broken. Stanley Young and Alan Karr (2011) compared findings from medical research based on observational causal studies (uncontrolled) with results from more reliable randomized clinical trials. They looked at 52 claims from 12 papers and found that none of these claims were replicated in the randomized trials. In fact, in 5 out of 52 cases, the results were significantly opposite to those of the observational studies. Their conclusion was clear: “Any claim coming from an observational study is most likely to be wrong”.
Therefore, controlled experiments remain the scientific gold standard for establishing causality.