Organizational Metrics for Online Controlled Experiments

This is a series of articles born after reading “Trustworthy Online Controlled Experiments” by Ron Kohavi, Diane Tang, Ya Xu.

In this three-part series, I delve into key takeaways from the book, exploring several topics:

Organizational Metrics
Complementary and Alternative Techniques for Controlled Experiments
Advanced Topics for Analyzing Experiments

If you can’t measure it, you can’t improve it − Peter Drucker

When an organization transitions to the run-and-fly phase, it often encounters a multitude of metrics, sometimes numbering in the thousands. This is when you group metrics by tier (company-wide, product-specific, feature-specific) or by function (OEC, goal, guardrail, quality, debug).

Metrics Taxonomy

Goal metrics

Goal metrics, often referred to as success metrics or true north metrics, represent the core priorities of the organization.

These goal metrics typically consist of a single or very limited set of metrics that most accurately encapsulate the ultimate success the organization is striving to achieve. These metrics may not be easy to move in the short term because each initiative might have only a minimal impact on them or because it takes a considerable amount of time for their effects to become evident.

When endeavoring to establish a goal metric, it is recommended to start by expressing your objectives in plain language. Why does your product or service exist, and what does success mean for your organization? Answering these questions should involve the active participation of the organization’s leaders, and their responses are often closely linked to the organization’s mission statement.

Example: Long-term Revenue, Capped Revenue, CTR

Driver metrics

Driver metrics, also known as signpost or surrogate metrics, are typically shorter-term and more sensitive than goal metrics. They represent hypotheses about the factors driving organizational success, rather than just its definition.

For instance, the bounce rate, indicating users’ brief website visits, can reflect dissatisfaction when analyzed with a specific threshold (e.g., 1 pageview or 20 seconds). Another example is Netflix utilizing bucketized watch hours as driver metrics for their interpretability and their link to long-term user retention.

Frameworks like HEART (Happiness, Engagement, Adoption, Retention, Task Success) and PIRATE (Acquisition, Activation, Retention, Referral, Revenue) help to derive driver metrics.

Ensure that driver metrics are:

Aligned with the goal: Confirm that these metrics genuinely drive success. Experimentation can help validate this alignment.
Actionable and relevant: Teams must feel that they can influence metrics through actions like product feature improvements.
Sensitive: Driver metrics should serve as early indicators of goal metrics, detecting the impact of most initiatives effectively.
Resistant to manipulation: Design driver and goal metrics to be resistant to manipulation, considering the potential incentives and behavioral effects.

In selecting metrics, prioritize those reflecting user value and actions over vanity metrics. Avoid counting actions users often ignore, focusing on indicators of genuine user interest instead.

Example: Bounce Rate, Bucketized Watch Hours, Session Success Rate

Guardrail metrics

Guardrail metrics guard against violated assumptions and come in two types: metrics that protect the business and metrics that assess the trustworthiness and internal validity of experiment results.

Defining these metrics is crucial for determining what aspects of the organization remain unchanged, as strategies necessitate trade-offs and decisions about what not to alter.

While revenue-per-user has significant statistical variability and isn’t ideal as a guardrail, more sensitive alternatives like revenue indicator-per-user (yes/no revenue), capped revenue-per-user (capping revenue over a certain threshold), and revenue-per-page (utilizing more page units) can provide better options, though variance calculation accuracy is crucial.

Example: Page Load Time, Crash Rate, Abandonment Rate

Data quality metrics

Data quality metrics ensure the internal validity and trustworthiness of the underlying experiments.

An example data quality metric is Sample Ratio Mismatch. SRM’s often causing incorrect results that have been documented, e.g. at Microsoft about 6% of experiments exhibited an SRM.

Example: Sample Ratio Mismatch (SRM), Error Rates, Data Loss Rates

Debug metrics

Diagnosis or debug metrics are essential for troubleshooting issues indicated by goal, driver, or guardrail metrics.

For example, if click-through rate (CTR) is a key metric, you might need numerous metrics to track clicks in specific page areas. Similarly, for revenue, you can break it down into a revenue indicator (0/1, indicating purchase) and Conditional Revenue (actual revenue for purchasers, null otherwise). These metrics provide distinct insights into revenue, even though the average overall revenue is the product of both.

Example: Number of Purchases, Average Purchase Amount

Useful tips

Utilize a trusted historical experiment dataset to assess new metrics, focusing on sensitivity and causal alignment.
Engage in metric discussions to ensure clear goal articulation and alignment, recognizing that metrics may evolve over time with organizational growth and improved understanding.
Teams may adapt their goal, driver, and guardrail metrics, with some, like infrastructure teams, opting for performance or organizational guardrails.
Consider using “success” metrics, such as purchase, and time-to-success as sensitive key metrics for experimentation.
Be cautious about optimizing metrics such as time-on-site without specifying successful sessions, as it can lead to short-term gains but long-term user abandonment due to a slower site.
Consider prcactical significance when analysing metrics results. In case the result is statistically significant but not practically significan, you are confident about the magnitude of change, but that magnitude may not be sufficient to outweigh other factors such as cost. This change may not be worth launching.