Auditing the Public — Work with John Holbein

In Block, Crabtree, Holbein, and Monson (2021, PNAS), we emailed 250,000 Americans drawn from public voter registration lists with a simple, low-stakes request: help with a short survey. Each recipient could receive a request from an ostensibly white or an ostensibly Black sender. The public was systematically less likely to reply to the Black sender. The gap appeared among nearly every racial and ethnic group and in every region of the country. This is everyday discrimination, measured directly in the mass public rather than at an institutional gate. Each dot below stands for about 250 recipients, so the full grid is roughly the whole sample.

Sender signal: white-sounding name Replies 0 · rate —

These are the real reply rates from the study: 1.6% for white-sounding senders, 1.4% for Black-sounding senders. The per-email gap is deliberately small here because it is small in the data — but in odds terms a Black-sounding sender was 15.5% less likely to get any reply, and applied across a quarter of a million contacts, that subtle bias accumulates.

Treatment effect heterogeneity

The average effect hides as much as it shows

A single number, the average callback gap, is the least interesting thing an audit can produce. The harder and more useful question is where the effect is large, where it vanishes, and where it reverses. In Gaddis, Crabtree, Holbein, and Pfaff, we ran a correspondence audit of 52,792 US public school principals across 33 states and found exactly this. Pooled across every minority family, the gap would look modest; but the average is a fiction. Hispanic families faced a 3.7-point reply penalty and Chinese American families a 10.7-point penalty, while Black families on average faced almost none — a near-zero 1.0 points, not statistically distinguishable from no gap at all.

The catch is statistical. Detecting that variation takes far more data than detecting the average. An interaction or subgroup effect is estimated with much less precision than a main effect, so the sample you need to find heterogeneity is several times larger than the sample you need to find the average gap. It is also why the Black null is not a clean bill of health: in the same study, Black families did face significant discrimination once they signaled high resource needs, a subgroup the average buries. The bars below are the real per-group effects.

Reply-probability gap vs. an identical White family, by group (points) Largest — · Smallest —

Real estimates from Gaddis, Crabtree, Holbein, and Pfaff's audit of 52,792 principals. One average across these three groups would land near −5 points and describe none of them: Chinese American families carry a 10.7-point penalty while the Black-family average is a non-significant 1.0. A study powered only for the pooled mean cannot recover this.

Try it

How big a sample do you actually need?

A worked demonstration, not study data: the effect sizes are fixed at plausible audit magnitudes and the intervals are computed live from the standard-error formula as you drag. Watch two confidence intervals: one for the average gap, one for a subgroup difference (heterogeneity). A bar turns solid when its interval clears zero, meaning the effect is detectable. Notice how much later the heterogeneity bar gets there — that lag is the real lesson, and it is arithmetic, not an estimate.

2,000 applications per arm

Calculator

A real power calculator for your audit

This one runs the actual two-proportion power test. Audit outcomes are binary — a callback or none — so that is the right test for most correspondence designs. Enter a baseline callback rate, the gap you want to detect, the sample per arm, and the significance level; it returns the power you have, the sample per arm you would need for 80% power, and the smallest gap this sample can catch. Defaults are Bertrand and Mullainathan's (2004) numbers.

Baseline callback rate (control), % Gap to detect (control − treated), points Sample per arm (n) Significance level α (two-sided)

—

power at n = 2,435 / arm

—

n / arm for 80% power

—

smallest gap detectable at this n (80%)

Accumulation

From a finding to a fact about a society

Gaddis, Larsen, Crabtree & Holbein · meta-analysis

Pooling decades of audits

With Gaddis and Larsen, John and I meta-analyzed the correspondence-audit record and found discrimination against Black and Hispanic Americans concentrated in hiring and housing. One audit is an anecdote with a p-value. The accumulation of comparable audits is what lets you say something durable about a society, and where its gates are tightest.

Pfaff, Crabtree, Kern & Holbein · Public Administration Review

Street-level bureaucrats and religion

In related work we emailed public school principals posing as parents, varying a religious cue. Principals were less responsive to senders signaling Muslim or atheist identity. The same audit logic, the same everyday-discrimination construct, applied to the officials families actually deal with.

Protestant / Catholic cue

≈0 pts · n.s.

Muslim cue

−4.6 pts

Atheist cue

−4.7 pts

Real estimates from Pfaff, Crabtree, Kern & Holbein (2021): change in the probability a principal replies, relative to an identical email giving no religious information. Just signing the email as Muslim or atheist costs 4.6 and 4.7 points; Protestants and Catholics are indistinguishable from the no-information baseline. The penalty grows to −8.7 (Muslim) and −13.8 (atheist) when the family also asks about the school's compatibility with their beliefs. Bar lengths are scaled to the −4.6/−4.7-point low-intensity penalties.

The same collaboration produced the validated names dataset in this guide, which John coauthored, and a published exchange on what name-based designs can and cannot identify. Scale is what turns a callback gap into a claim about how a country treats its own people.