Eight Decisions Every Audit Must Get Right

Define the estimand

Decide what causal quantity you are after before anything else. A name-based audit identifies the effect of the perceived signal on gatekeeper behavior among the units you contact. That is not the effect of race writ large, and pretending otherwise invites trouble. Butler and Crabtree (2020) lay out what audit designs can and cannot identify; our PNAS reply to Mitterer works through what name manipulations specifically license.

Choose the treatment mode

Names, photos, and explicit labels all signal group membership, and all of them bundle other attributes. As John and I argued in the Mitterer reply, photos may bundle even more than names. The advantage of validated names is that the bundle is measured. The 600-name dataset in this guide exists so that the choice of mode and the choice of stimuli are evidence-based, not a matter of taste.

Set treatment delivery and dosage

How many applications per employer? Does each unit see one signal or both? Within-subjects delivery, where every unit can receive both treatments, buys power and lets you locate bias within individuals, at the cost of detection risk. The 250,000-person design in Block, Crabtree, Holbein, and Monson (2021) used this structure. Spacing, ordering, and timing rules must be fixed in advance, not improvised mid-field.

Power for what you actually want to find

Callback gaps are small, often a few points. Underpowered audits do not just miss effects; when they do find something, the estimate is exaggerated in magnitude and sometimes wrong in sign. And if you want heterogeneity, the kind we found across schools in Gaddis, Crabtree, Holbein, and Pfaff (2024), the bar is far higher. Subgroup and interaction effects are estimated with much less precision than the average, so detecting them can take several times the sample. Decide the smallest effect that would matter, power for it, and fund that. This is why the public audit went to a quarter of a million people.

Code outcomes simply

Use objective, binary, preregistered outcomes: any reply, a callback within a fixed window, an appointment offered. Do not use measures of response quality. Quality codings are subjective and coder-dependent, and they condition on having received a response at all, which the treatment itself affects. That makes them post-treatment quantities, and comparisons built on them are no longer protected by randomization.

See why

Why response quality breaks the randomization

White-coded arm

Black-coded arm

Probe mechanisms separately

A callback gap establishes differential treatment. It does not tell you whether the mechanism is animus or statistical inference from the signal. Adjudicating between them takes design work: varying applicant quality the way Bertrand and Mullainathan did, varying the information available to the gatekeeper, or using the perceived class and citizenship data attached to each validated name to test what the signal is actually carrying. In Hughes, Gell-Redman, Crabtree, and coauthors (2020, JEPS), we added a quieter instrument to an audit of local election officials: whether the email was even opened. Open rates tap a more implicit layer than replies do, and there the bias fell on Arab/Muslim senders.

Plan the logistics

Email domains, inboxes, phone numbers, monitoring schedules, deduplication, response harvesting. Crabtree (2018) walks through running email audits end to end in the Gaddis volume. Treat infrastructure as a threat to inference, not a nuisance. The modern version of this problem has teeth, and it gets its own section below.

Clear the ethics bar

Audits deceive people who never consented and consume their time. Crabtree and Dhima (2022) propose a cost-benefit framework: minimize the burden per subject, cap the sample at what your power analysis actually requires, and be able to state plainly why the knowledge is worth the imposition. The institutional review questions that follow from this get their own section too.

Logistics · the modern era

Running an audit in 2026

The infrastructure has changed more than the method. An email audit today runs through systems that did not exist when the classic studies were fielded, and each one is a place where your treatment and control conditions can diverge before a human ever sees them. If they diverge, you have differential attrition baked into the delivery layer, and randomization no longer saves you.

Deliverability and spam filtering

Filters score senders on reputation, content, and recipient behavior. If messages from your treatment names land in spam at different rates than control names, the gap you measure is partly a filter artifact. Warm the domains, send from matched infrastructure, and log delivery, not just replies.

Automated screening

Many employers now route applications through tracking systems that parse and rank before a recruiter looks. The relevant gatekeeper may be an algorithm. That is worth measuring, but it changes what your estimand means and demands you record the screening layer explicitly.

Detection and authenticity

CAPTCHAs, phone verification, and duplicate-detection can block fictional applicants unevenly. AI-generated text is now flagged by the same systems you might use to scale message production. Pilot heavily and confirm your stimuli actually arrive intact and look human.

Platform terms and IP reputation

Job boards, listing sites, and email providers have terms of service that audits routinely strain. Shared IP reputation can get a whole batch throttled. Build in redundancy, stagger volume, and assume any single channel can be cut off mid-field.

The principle is old and the failure modes are new. Every automated layer between you and the gatekeeper is a candidate for differential treatment of your conditions, so instrument all of them and pilot until the pipeline is boringly reliable.

Ethics · answering the board

The IRB questions you will get, and how to meet them

Audits sit awkwardly inside human-subjects regulation, because the people burdened, the employers and officials, are often not the people the rules were written to protect. Boards know this and ask hard questions. The cost-benefit framework in Crabtree and Dhima (2022) is built to answer them. Here are the ones that come up most.

Who is the human subject here?

Often the gatekeeper, who is studied without consent, and sometimes third parties whose time is consumed. Name the burdened parties explicitly rather than hiding behind the fact that you are studying institutions. Boards respond well to candor about who actually bears the cost.

Why is deception necessary?

Because the behavior of interest disappears the moment subjects know they are observed. That is the entire rationale for the design. State it plainly, and show there is no non-deceptive way to measure everyday discrimination at a decision point.

Is the risk minimal, and is the burden justified?

Quantify the per-subject burden: minutes of attention, an email read, a callback placed. Multiply by the sample. Then weigh that total against the social value of the evidence. If the sample your power analysis requires imposes more burden than the knowledge can justify, the honest answer is to redesign or not to field.

Can consent be waived, and should you debrief?

Audits typically seek a waiver of informed consent on minimal-risk grounds. Debriefing is contested: it can impose more burden than the study itself and tip off future subjects. Address it directly, justify your choice, and document the data-handling plan for any real identities incidentally collected.

The framework does not make the tension disappear. It forces the design to minimize burden, cap scale at what the question needs, and put the justification on the record. That is also what a good board wants to see.

Eight decisions every audit has to get right