The design is simple to describe and easy to run badly. These are the decisions that determine whether the study identifies anything, in the order you should make them.
Decide what causal quantity you are after before anything else. A name-based audit identifies the effect of the perceived signal on gatekeeper behavior among the units you contact. That is not the effect of race writ large, and pretending otherwise invites trouble. Butler and Crabtree (2020) lay out what audit designs can and cannot identify; our PNAS reply to Mitterer works through what name manipulations specifically license.
Names, photos, and explicit labels all signal group membership, and all of them bundle other attributes. As John and I argued in the Mitterer reply, photos may bundle even more than names. The advantage of validated names is that the bundle is measured. The 600-name dataset in this guide exists so that the choice of mode and the choice of stimuli are evidence-based, not a matter of taste.
How many applications per employer? Does each unit see one signal or both? Within-subjects delivery, where every unit can receive both treatments, buys power and lets you locate bias within individuals, at the cost of detection risk. The 250,000-person design in Block, Crabtree, Holbein, and Monson (2021) used this structure. Spacing, ordering, and timing rules must be fixed in advance, not improvised mid-field.
Callback gaps are small, often a few points. Underpowered audits do not just miss effects; when they do find something, the estimate is exaggerated in magnitude and sometimes wrong in sign. And if you want heterogeneity, the kind we found across schools in Gaddis, Crabtree, Holbein, and Pfaff (2024), the bar is far higher. Subgroup and interaction effects are estimated with much less precision than the average, so detecting them can take several times the sample. Decide the smallest effect that would matter, power for it, and fund that. This is why the public audit went to a quarter of a million people.
Use objective, binary, preregistered outcomes: any reply, a callback within a fixed window, an appointment offered. Do not use measures of response quality. Quality codings are subjective and coder-dependent, and they condition on having received a response at all, which the treatment itself affects. That makes them post-treatment quantities, and comparisons built on them are no longer protected by randomization.
A callback gap establishes differential treatment. It does not tell you whether the mechanism is animus or statistical inference from the signal. Adjudicating between them takes design work: varying applicant quality the way Bertrand and Mullainathan did, varying the information available to the gatekeeper, or using the perceived class and citizenship data attached to each validated name to test what the signal is actually carrying. In Hughes, Gell-Redman, Crabtree, and coauthors (2020, JEPS), we added a quieter instrument to an audit of local election officials: whether the email was even opened. Open rates tap a more implicit layer than replies do, and there the bias fell on Arab/Muslim senders.
Email domains, inboxes, phone numbers, monitoring schedules, deduplication, response harvesting. Crabtree (2018) walks through running email audits end to end in the Gaddis volume. Treat infrastructure as a threat to inference, not a nuisance. The modern version of this problem has teeth, and it gets its own section below.
Audits deceive people who never consented and consume their time. Crabtree and Dhima (2022) propose a cost-benefit framework: minimize the burden per subject, cap the sample at what your power analysis actually requires, and be able to state plainly why the knowledge is worth the imposition. The institutional review questions that follow from this get their own section too.
The infrastructure has changed more than the method. An email audit today runs through systems that did not exist when the classic studies were fielded, and each one is a place where your treatment and control conditions can diverge before a human ever sees them. If they diverge, you have differential attrition baked into the delivery layer, and randomization no longer saves you.
Filters score senders on reputation, content, and recipient behavior. If messages from your treatment names land in spam at different rates than control names, the gap you measure is partly a filter artifact. Warm the domains, send from matched infrastructure, and log delivery, not just replies.
Many employers now route applications through tracking systems that parse and rank before a recruiter looks. The relevant gatekeeper may be an algorithm. That is worth measuring, but it changes what your estimand means and demands you record the screening layer explicitly.
CAPTCHAs, phone verification, and duplicate-detection can block fictional applicants unevenly. AI-generated text is now flagged by the same systems you might use to scale message production. Pilot heavily and confirm your stimuli actually arrive intact and look human.
Job boards, listing sites, and email providers have terms of service that audits routinely strain. Shared IP reputation can get a whole batch throttled. Build in redundancy, stagger volume, and assume any single channel can be cut off mid-field.
The principle is old and the failure modes are new. Every automated layer between you and the gatekeeper is a candidate for differential treatment of your conditions, so instrument all of them and pilot until the pipeline is boringly reliable.
Audits sit awkwardly inside human-subjects regulation, because the people burdened, the employers and officials, are often not the people the rules were written to protect. Boards know this and ask hard questions. The cost-benefit framework in Crabtree and Dhima (2022) is built to answer them. Here are the ones that come up most.
The framework does not make the tension disappear. It forces the design to minimize burden, cap scale at what the question needs, and put the justification on the record. That is also what a good board wants to see.