AskDwightHow.org 365/24/7
Check out all these great menu options
14m 24s
We'll get your problem solved one way or the other. Open this door
Noise:
A Flaw in Human Judgment
by Daniel Kahneman, Oliver Sibony, and Cass Sunstein
After finishing this book in December of 2021, I wrote,
"Daniel Kahneman, author of 'Thinking Fast and Slow,' doesn't disappoint in addressing an entirely new area of how we humans, often unknowingly, err in our judgments. This book is so rich that I don't think I did it full justice in my notes below. Regardless, you'll find them rewarding."
My clippings below collapse a 386-page book into 15 pages, measured by using 12-point type in Microsoft Word.
See all my book recommendations.
Here are the selections I made:
Bias and noise—systematic deviation and random scatter—are different components of error.
Some judgments are biased; they are systematically off target. Other judgments are noisy, as people who are expected to agree end up at very different points around the target. Many organizations, unfortunately, are afflicted by both bias and noise.
Two men, neither of whom had a criminal record, were convicted for cashing counterfeit checks in the amounts of $58.40 and $35.20, respectively. The first man was sentenced to fifteen years, the second to 30 days.
For embezzlement actions that were similar to one another, one man was sentenced to 117 days in prison, while another was sentenced to 20 years.
Fifty judges from various districts were asked to set sentences for defendants in hypothetical cases summarized in identical pre-sentence reports. The basic finding was that “absence of consensus was the norm” and that the variations across punishments were “astounding.” A heroin dealer could be incarcerated for one to ten years, depending on the judge. Punishments for a bank robber ranged from five to eighteen years in prison. The study found that in an extortion case, sentences varied from a whopping twenty years imprisonment and a $65,000 fine to a mere three years imprisonment and no fine. Most startling of all, in sixteen of twenty cases, there was no unanimity on whether any incarceration was appropriate. This study was followed by a series of others, all of which found similarly shocking levels of noise. In 1977, for example, William Austin and Thomas Williams conducted a survey of forty-seven judges, asking them to respond to the same five cases, each involving low-level offenses. All the descriptions of the cases included summaries of the information used by judges in actual sentencing, such as the charge, the testimony, the previous criminal record (if any), social background, and evidence relating to character. The key finding was “substantial disparity.” In a case involving burglary, for example, the recommended sentences ranged from five years in prison to a mere thirty days (alongside a fine of $100). In a case involving possession of marijuana, some judges recommended prison terms; others recommended probation.
Every large branch of the company has several qualified underwriters. When a quote is requested, anyone who happens to be available may be assigned to prepare it. In effect, the particular underwriter who will determine a quote is selected by a lottery.
The exact value of the quote has significant consequences for the company. A high premium is advantageous if the quote is accepted, but such a premium risks losing the business to a competitor. A low premium is more likely to be accepted, but it is less advantageous to the company. For any risk, there is a Goldilocks price that is just right—neither too high nor too low—and there is a good chance that the average judgment of a large group of professionals is not too far from this Goldilocks number. Prices that are higher or lower than this number are costly—this is how the variability of noisy judgments hurts the bottom line.
An adjuster is assigned to the claim—just as the underwriter was assigned, because she happens to be available. The adjuster gathers the facts of the case and provides an estimate of its ultimate cost to the company. The same adjuster then takes charge of negotiating with the claimant’s representative to ensure that the claimant receives the benefits promised in the policy while also protecting the company from making excessive payments.
The early estimate matters because it sets an implicit goal for the adjuster in future negotiations with the claimant. The insurance company is also legally obligated to reserve the predicted cost of each claim (i.e., to have enough cash to be able to pay it). Here again, there is a Goldilocks value from the perspective of the company. A settlement is not guaranteed, as there is an attorney for the claimant on the other side, who may choose to go to court if the offer is miserly. On the other hand, an overly generous reserve may allow the adjuster too much latitude to agree to frivolous demands. The adjuster’s judgment is consequential for the company—and even more consequential for the claimant.
When we asked 828 CEOs and senior executives from a variety of industries how much variation they expected to find in similar expert judgments, 10% was also the median answer and the most frequent one (the second most popular was 15%). A 10% difference would mean, for instance, that one of the two underwriters set a premium of $9,500 while the other quoted $10,500. Not a negligible difference, but one that an organization can be expected to tolerate.
Our noise audit found much greater differences. By our measure, the median difference in underwriting was 55%, about five times as large as was expected by most people, including the company’s executives. This result means, for instance, that when one underwriter sets a premium at $9,500, the other does not set it at $10,500—but instead quotes $16,700. For claims adjusters, the median ratio was 43%. We stress that these results are medians: in half the pairs of cases, the difference between the two judgments was even larger.
Variability in judgments is also expected and welcome in a competitive situation in which the best judgments will be rewarded.
The same is true when multiple teams of researchers attack a scientific problem, such as the development of a vaccine: we very much want them to look at it from different angles.
But our focus is on judgments in which variability is undesirable. System noise is a problem of systems, which are organizations, not markets.
They asked forty-two experienced investors in the firm to estimate the fair value of a stock (the price at which the investors would be indifferent to buying or selling). The investors based their analysis on a one-page description of the business; the data included simplified profit and loss, balance sheet, and cash flow statements for the past three years and projections for the next two. Median noise, measured in the same way as in the insurance company, was 41%. Such large differences among investors in the same firm, using the same valuation methods, cannot be good news.
A large literature going back several decades has documented noise in professional judgment. Because we were aware of this literature, the results of the insurance company’s noise audit did not surprise us. What did surprise us, however, was the reaction of the executives to whom we reported our findings: no one at the company had expected anything like the amount of noise we had observed. No one questioned the validity of the audit, and no one claimed that the observed amount of noise was acceptable. Yet the problem of noise—and its large cost—seemed like a new one for the organization. Noise was like a leak in the basement. It was tolerated not because it was thought acceptable but because it had remained unnoticed.
Most of us, most of the time, live with the unquestioned belief that the world looks as it does because that’s the way it is. There is one small step from this belief to another: “Other people view the world much the way I do.” These beliefs, which have been called naive realism, are essential to the sense of a reality we share with other people. We rarely question these beliefs. We hold a single interpretation of the world around us at any one time, and we normally invest little effort in generating plausible alternatives to it. One interpretation is enough, and we experience it as true. We do not go through life imagining alternative ways of seeing what we see.
For the insurance company, the illusion of agreement was shattered only by the noise audit. How had the leaders of the company remained unaware of their noise problem? There are several possible answers here, but one that seems to play a large role in many settings is simply the discomfort of disagreement.
“Wherever there is judgment, there is noise—and more of it than we think.”
From the perspective of noise reduction, a singular decision is a recurrent decision that happens only once. Whether you make a decision only once or a hundred times, your goal should be to make it in a way that reduces both bias and noise. And practices that reduce error should be just as effective in your one-of-a-kind decisions as in your repeated ones.
A matter of judgment is one with some uncertainty about the answer and where we allow for the possibility that reasonable and competent people might disagree.
Matters of judgment differ from matters of opinion or taste, in which unresolved differences are entirely acceptable. The insurance executives who were shocked by the result of the noise audit would have no problem if claims adjusters were sharply divided over the relative merits of the Beatles and the Rolling Stones, or of salmon and tuna.
Matters of judgment, including professional judgments, occupy a space between questions of fact or computation on the one hand and matters of taste or opinion on the other. They are defined by the expectation of bounded disagreement.
If an event that was assigned a probability of 90% fails to happen, the judgment of probability was not necessarily a bad one. After all, outcomes that are just 10% likely to happen end up happening 10% of the time. The Gambardi exercise is an example of a nonverifiable predictive judgment, for two separate reasons: Gambardi is fictitious and the answer is probabilistic.
Many professional judgments are nonverifiable. Barring egregious errors, underwriters will never know, for instance, whether a particular policy was overpriced or underpriced. Other forecasts may be nonverifiable because they are conditional. “If we go to war, we will be crushed” is an important prediction, but it is likely to remain untested (we hope). Or forecasts may be too long term for the professionals who make them to be brought to account—like, for instance, an estimate of mean temperatures by the end of the twenty-first century.
The essential feature of this internal signal is that the sense of coherence is part of the experience of judgment. It is not contingent on a real outcome. As a result, the internal signal is just as available for nonverifiable judgments as it is for real, verifiable ones. This explains why making a judgment about a fictitious character like Gambardi feels very much the same as does making a judgment about the real world.
If a weather forecaster said today’s high temperature would be seventy degrees Fahrenheit and it is sixty-five degrees, the forecaster made an error of plus five degrees. Evidently, this approach does not work for nonverifiable judgments like the Gambardi problem, which have no true outcome. How, then, are we to decide what constitutes good judgment? The answer is that there is a second way to evaluate judgments. This approach applies both to verifiable and nonverifiable ones. It consists in evaluating the process of judgment. When we speak of good or bad judgments, we may be speaking either about the output (e.g., the number you produced in the Gambardi case) or about the process—what you did to arrive at that number.
Focusing on the process of judgment, rather than its outcome, makes it possible to evaluate the quality of judgments that are not verifiable, such as judgments about fictitious problems or long-term forecasts. We may not be able to compare them to a known outcome, but we can still tell whether they have been made incorrectly. And when we turn to the question of improving judgments rather than just evaluating them, we will focus on process, too. All the procedures we recommend in this book to reduce bias and noise aim to adopt the judgment process that would minimize error over an ensemble of similar cases.
We have contrasted two ways of evaluating a judgment: by comparing it to an outcome and by assessing the quality of the process that led to it. Note that when the judgment is verifiable, the two ways of evaluating it may reach different conclusions in a single case. A skilled and careful forecaster using the best possible tools and techniques will often miss the correct number in making a quarterly inflation forecast. Meanwhile, in a single quarter, a dart-throwing chimpanzee will sometimes be right.
So far in this chapter, we have focused on predictive judgment tasks, and most of the judgments we will discuss are of that type.
Sentencing a felon is not a prediction. It is an evaluative judgment that seeks to match the sentence to the severity of the crime.
You may have noticed that the decomposition of system noise into level noise and pattern noise follows the same logic as the error equation in the previous chapter, which decomposed error into bias and noise. This time, the equation can be written as follows: System Noise2 = Level Noise2 + Pattern Noise2
Level noise is variability in the average level of judgments by different judges. Pattern noise is variability in judges’ responses to particular cases.
“Level noise is when judges show different levels of severity. Pattern noise is when they disagree with one another on which defendants deserve more severe or more lenient treatment. And part of pattern noise is occasion noise—when judges disagree with themselves.”
For instance, an ongoing quarterly survey asks the chief financial officers of US companies to estimate the annual return of the S&P 500 index for the next year. The CFOs provide two numbers: a minimum, below which they think there is a one-in-ten chance the actual return will be, and a maximum, which they believe the actual return has a one-in-ten chance of exceeding. Thus the two numbers are the bounds of an 80% confidence interval. Yet the realized returns fall in that interval only 36% of the time. The CFOs are far too confident in the precision of their forecasts.
One method to produce aggregate forecasts is to use prediction markets, in which individuals bet on likely outcomes and are thus incentivized to make the right forecasts.
(A well-known response to this criticism, sometimes attributed to John Maynard Keynes, is, “When the facts change, I change my mind. What do you do?”)
In short, what distinguishes the superforecasters isn’t their sheer intelligence; it’s how they apply it. The skills they bring to bear reflect the sort of cognitive style we described in chapter 18 as likely to result in better judgments, particularly a high level of “active open-mindedness.” Recall the test for actively open-minded thinking: it includes such statements as “People should take into consideration evidence that goes against their beliefs” and “It is more useful to pay attention to people who disagree with you than to pay attention to those who agree.” Clearly, people who score high on this test are not shy about updating their judgments (without overreacting) when new information becomes available.
They like a particular cycle of thinking: “try, fail, analyze, adjust, try again.”
Speaking of Selection and Aggregation “Let’s take the average of four independent judgments—this is guaranteed to reduce noise by half.” “We should strive to be in perpetual beta, like the superforecasters.” “Before we discuss this situation, what is the relevant base rate?” “We have a good team, but how can we ensure more diversity of opinions?”
Speaking of Structure in Hiring “In traditional, informal interviews, we often have an irresistible, intuitive feeling of understanding the candidate and knowing whether the person fits the bill. We must learn to distrust that feeling.” “Traditional interviews are dangerous not only because of biases but also because of noise.” “We must add structure to our interviews and, more broadly, to our selection processes. Let’s start by defining much more clearly and specifically what we are looking for in candidates, and let’s make sure we evaluate the candidates independently on each of these dimensions.”
Overall, he explained, the goal was to make evaluations as comparative as possible, because relative judgments are better than absolute ones.
Table 4: Main steps of the mediating assessments protocol 1. At the beginning of the process, structure the decision into mediating assessments. (For recurring judgments, this is done only once.) 2. Ensure that whenever possible, mediating assessments use an outside view. (For recurring judgments: use relative judgments, with a case scale if possible.) 3. In the analytical phase, keep the assessments as independent of one another as possible. 4. In the decision meeting, review each assessment separately. 5. On each assessment, ensure that participants make their judgments individually; then use the estimate-talk-estimate method. 6. To make the final decision, delay intuition, but don’t ban it.
“We have a structured process to make hiring decisions. Why don’t we have one for strategic decisions? After all, options are like candidates.” “This is a difficult decision. What are the mediating assessments it should be based on?” “Our intuitive, holistic judgment about this plan is very important—but let’s not discuss it yet. Our intuition will serve us much better once it is informed by the separate assessments we have asked for.”
Speaking of Dignity “People value and even need face-to-face interactions. They want a real human being to listen to their concerns and complaints and to have the power to make things better. Sure, those interactions will inevitably produce noise. But human dignity is priceless.” “Moral values are constantly evolving. If we lock everything down, we won’t make space for changing values. Some efforts to reduce noise are just too rigid; they would prevent moral change.” “If you want to deter misconduct, you should tolerate some noise. If students are left wondering about the penalty for plagiarism, great—they will avoid plagiarizing. A little uncertainty in the form of noise can magnify deterrence.”
“If we eliminate noise, we might end up with clear rules, which wrongdoers will find ways to avoid. Noise can be a price worth paying if it is a way of preventing strategic or opportunistic behavior.” “Creative people need space. People aren’t robots. Whatever your job, you deserve some room to maneuver. If you’re hemmed in, you might not be noisy, but you won’t have much fun and you won’t be able to bring your original ideas to bear.” “In the end, most of the efforts to defend noise aren’t convincing. We can respect people’s dignity, make plenty of space for moral evolution, and allow for human creativity without tolerating the unfairness and cost of noise.”
Still, it’s complicated. Rules may be straightforward to apply once they are in place, but before a rule is put in place, someone has to decide what it is. Producing a rule can be hard. Sometimes it is prohibitively costly. Legal systems and private companies therefore often use words such as reasonable, prudent, and feasible. This is also why terms like these play an equally important role in fields such as medicine and engineering. The costs of errors refer to the number and the magnitude of mistakes. A pervasive question is whether agents are knowledgeable and reliable, and whether they practice decision hygiene. If they are, and if they do, then a standard might work just fine—and there might be little noise. Principals need to impose rules when they have reason to distrust their agents. If agents are incompetent or biased and if they cannot feasibly implement decision hygiene, then they should be constrained by rules. Sensible organizations well understand that the amount of discretion they grant is closely connected with the level of trust they have in their agents.
Speaking of Rules and Standards “Rules simplify life, and reduce noise. But standards allow people to adjust to the particulars of the situations.” “Rules or standards? First, ask which produces more mistakes. Then, ask which is easier or more burdensome to produce or work with.” “We often use standards when we should embrace rules—simply because we don’t pay attention to noise.” “Noise reduction shouldn’t be part of the Universal Declaration of Human Rights—at least not yet. Still, noise can be horribly unfair. All over the world, legal systems should consider taking strong steps to reduce it.”
Errors: Bias and Noise We say that bias exists when most errors in a set of judgments are in the same direction. Bias is the average error, as, for example, when a team of shooters consistently hits below and to the left of the target; when executives are too optimistic about sales, year after year; or when a company keeps reinvesting money in failing projects that it should write off. Eliminating bias from a set of judgments will not eliminate all error. The errors that remain when bias is removed are not shared. They are the unwanted divergence of judgments, the unreliability of the measuring instrument we apply to reality. They are noise. Noise is variability in judgments that should be identical. We use the term system noise for the noise observed in organizations that employ interchangeable professionals to make decisions, such as physicians in an emergency room, judges imposing criminal penalties, and underwriters in an insurance company. Much of this book has been concerned with system noise.
Noise in a system can be assessed by a noise audit, an experiment in which several professionals make independent judgments of the same cases (real or fictitious). We can measure noise without knowing a true value, just as we can see, from the back of the target, the scatter of a set of shots.
Noise Is a Problem Variability as such is unproblematic in some judgments, even welcome. Diversity of opinions is essential for generating ideas and options. Contrarian thinking is essential to innovation. A plurality of opinions among movie critics is a feature, not a bug. Disagreements among traders make markets. Strategy differences among competing start-ups enable markets to select the fittest.
In what we call matters of judgment, however, system noise is always a problem. If two doctors give you different diagnoses, at least one of them is wrong.
The surprises that motivated this book are the sheer magnitude of system noise and the amount of damage that it does. Both of these far exceed common expectations. We have given examples from many fields, including business, medicine, criminal justice, fingerprint analysis, forecasting, personnel ratings, and politics. Hence our...
This highlight has been truncated due to consecutive passage length restrictions.
System noise can be broken down into level noise and pattern noise. Some judges are generally more severe than others, and others are more lenient; some forecasters are generally bullish and others bearish about market prospects; some doctors prescribe more antibiotics than others do. Level noise is the variability of the average judgments made by different individuals. The ambiguity of judgment scales is one of the sources of level noise.
System noise includes another, generally larger component. Regardless of the average level of their judgments, two judges may differ in their views of which crimes deserve the harsher sentences. Their sentencing decisions will produce a different ranking of cases. We call this variability pattern noise (the technical term is statistical interaction).
This stable pattern noise reflects the uniqueness of judges: their response to cases is as individual as their personality. The subtle differences among people are often enjoyable and interesting, but the differences become problematic when professionals operate within a system that assumes consistency.
Still, judges’ distinctive attitudes to particular cases are not perfectly stable. Pattern noise also has a transient component, called occasion noise. We detect this kind of noise if a radiologist assigns different diagnoses to the same image on different days or if a fingerprint examiner identifies two prints as a match on one occasion but not on another. As these examples illustrate, occasion noise is most easily measured when the judge does not recognize the case as one seen before. Another way to demonstrate occasion noise is to show the effect of an irrelevant feature of the context on judgments, such as when judges are more lenient after their favorite football team won, or when doctors prescribe more opioids in the afternoon.
There is a limit to the accuracy of our predictions, and this limit is often quite low. Nevertheless, we are generally comfortable with our judgments. What gives us this satisfying confidence is an internal signal, a self-generated reward for fitting the facts and the judgment into a coherent story. Our subjective confidence in our judgments is not necessarily related to their objective accuracy.
As we subjectively experience it, judgment is a subtle and complex process; we have no indication that the subtlety may be mostly noise. It is difficult for us to imagine that mindless adherence to simple rules will often achieve higher accuracy than we can—but this is by now a well-established fact.
Large individual differences emerge when a judgment requires the weighting of multiple, conflicting cues.
Noise is not a prominent problem. It is rarely discussed, and it is certainly less salient than bias. You probably had not given it much thought. Given its importance, the obscurity of noise is an interesting phenomenon in and of itself.
Bias has a kind of explanatory charisma, which noise lacks. If we try to explain, in hindsight, why a particular decision was wrong, we will easily find bias and never find noise. Only a statistical view of the world enables us to see noise, but that view does not come naturally—we prefer causal stories. The absence of statistical thinking from our intuitions is one reason that noise receives so much less attention than bias does.
Another reason is that professionals seldom see a need to confront noise in their own judgments and in those of their colleagues. After a period of training, professionals often make judgments on their own. Fingerprint experts, experienced underwriters, and veteran patent officers rarely take time to imagine how colleagues might disagree with them—and they spend even less time imagining how they might disagree with themselves.
One strategy for error reduction is debiasing. Typically, people attempt to remove bias from their judgments either by correcting judgments after the fact or by taming biases before they affect judgments. We propose a third option, which is particularly applicable to decisions made in a group setting: detect biases in real time, by designating a decision observer to identify signs of bias
A noise-reduction effort in an organization should always begin with a noise audit (see appendix A). An important function of the audit is to obtain a commitment of the organization to take noise seriously. An essential benefit is the assessment of separate types of noise.
The goal of judgment is accuracy, not individual expression. This statement is our candidate for the first principle of decision hygiene in judgment.
A radical application of this principle is the replacement of judgment with rules or algorithms.
Think statistically, and take the outside view of the case. We say that a judge takes the outside view of a case when she considers it as a member of a reference class of similar cases rather than as a unique problem. This approach diverges from the default mode of thinking, which focuses firmly on the case at hand and embeds it in a causal story. When people apply their unique experiences to form a unique view of the case, the result is pattern noise. The outside view is a remedy for this problem: professionals who share the same reference class will be less noisy. In addition, the outside view often yields valuable insights.
People cannot be faulted for failing to predict the unpredictable, but they can be blamed for a lack of predictive humility.
Structure judgments into several independent tasks. This divide-and-conquer principle is made necessary by the psychological mechanism we have described as excessive coherence, which causes people to distort or ignore information that does not fit a preexisting or emerging story. Overall accuracy suffers when impressions of distinct aspects of a case contaminate each other. For an analogy, think of what happens to the evidentiary value of a set of witnesses when they are allowed to communicate.
Resist premature intuitions. We have described the internal signal of judgment completion that gives decision makers confidence in their judgment. The unwillingness of decision makers to give up this rewarding signal is a key reason for the resistance to the use of guidelines and algorithms and other rules that tie their hands. Decision makers clearly need to be comfortable with their eventual choice and to attain the rewarding sense of intuitive confidence. But they should not grant themselves this reward prematurely. An intuitive choice that is informed by a balanced and careful consideration of the evidence is far superior to a snap judgment. Intuition need not be banned, but it should be informed, disciplined, and delayed.
This principle inspires our recommendation to sequence the information: professionals who make judgments should not be given information that they don’t need and that could bias them, even if that information is accurate. In forensic science, for example, it is good practice to keep examiners unaware of other information about a suspect. Control of discussion agendas, a key element of the mediating assessments protocol, also belongs here. An efficient agenda will ensure that different aspects of the problem are considered separately and that the formation of a holistic judgment is delayed until the profile of assessments is complete.
Obtain independent judgments from multiple judges, then consider aggregating those judgments. The requirement of independence is routinely violated in the procedures of organizations, notably in meetings in which participants’ opinions are shaped by those of others. Because of cascade effects and group polarization, group discussions often increase noise. The simple procedure of collecting participants’ judgments bef...
This highlight has been truncated due to consecutive passage length restrictions.
Favor relative judgments and relative scales. Relative judgments are less noisy than absolute ones, because our ability to categorize objects on a scale is limited, while our ability to make pairwise comparisons is much better. Judgment scales that call for comparisons will be less noisy than scales that require absolute judgments. For example, a case scale requires judges to locate a case on a scale that is defined by instances familiar to everyone.
The decision hygiene principles we have just listed are applicable not only to recurrent judgments but also to one-off major decisions, or what we call singular decisions.
Enforcing decision hygiene can be thankless. Noise is an invisible enemy, and a victory against an invisible enemy can only be an invisible victory. But like physical health hygiene, decision hygiene is vital. After a successful operation, you like to believe that it is the surgeon’s skill that saved your life—and it did, of course—but if the surgeon and all the personnel in the operating room had not washed their hands, you might be dead.
Of course, the battle against noise is not the only consideration for decision makers and organizations. Noise may be too costly to reduce: a high school could eliminate noise in grading by having five teachers read each and every paper, but that burden is hardly justified.
Perhaps most importantly, noise-reduction strategies may have unacceptable downsides. Many concerns about algorithms are overblown, but some are legitimate. Algorithms may produce stupid mistakes that a human would never make, and therefore lose credibility even if they also succeed in preventing many errors that humans do make.
The problem is that in the absence of noise audits, people are unaware of how much noise there is in their judgments. When that is the case, invoking the difficulty of reducing noise is nothing but an excuse not to measure it.
Bias leads to errors and unfairness. Noise does too—and yet, we do a lot less about it. Judgment error may seem more tolerable when it is random than when we attribute it to a cause; but it is no less damaging. If we want better decisions about things that matter, we should take noise reduction seriously.
Once the executives accept the design of the noise audit, the project team should ask them to state their expectations about the results of the study. They should discuss questions such as: • “What level of disagreement do you expect between a randomly selected pair of answers to each case?” • “What is the maximum level of disagreement that would be acceptable from a business perspective?” • “What is the estimated cost of getting an evaluation wrong in either direction (too high or low) by a specified amount (e.g., 15%)?” The answers to these questions should be documented to ensure that they are remembered and believed when the actual results of the audit come in.
The managers of the audited unit should be, from the beginning, informed in general terms that their unit has been selected for special study. However, it is important that the term noise audit not be used to describe the project. The words noise and noisy should be avoided, especially as descriptions of people. A neutral term such as decision-making study should be used instead. The managers of the unit will be immediately in charge of the data collection and responsible for briefing the participants about the task, with the participation of the project manager and members of the project team. The intent of the exercise should be described to the participants in general terms, as in “The organization is interested in how [decision makers] reach their conclusions.” It is essential to reassure the professionals who participate in the study that individual answers will not be known to anyone in the organization, including the project team. If necessary, an outside firm may be hired to anonymize the data. It is also important to stress that there will be no specific consequences for the unit, which was merely selected as representative of units that perform judgment tasks on behalf of the organization.
Bias Observation Checklist 1. APPROACH TO JUDGMENT 1a. Substitution “Did the group’s choice of evidence and the focus of their discussion indicate substitution of an easier question for the difficult one they were assigned?” “Did the group neglect an important factor (or appear to give weight to an irrelevant one)?” 1b. Inside view “Did the group adopt the outside view for part of its deliberations and seriously attempt to apply comparative rather than absolute judgment?” 1c. Diversity of views “Is there any reason to suspect that members of the group share biases, which could lead their errors to be correlated? Conversely, can you think of a relevant point of view or expertise that is not represented in this group? 2. PREJUDGMENTS AND PREMATURE CLOSURE 2a. Initial prejudgments “Do (any of) the decision makers stand to gain more from one conclusion than another?” “Was anyone already committed to a conclusion? Is there any reason to suspect prejudice?” “Did dissenters express their views?” “Is there a risk of escalating commitment to a losing course of action?” 2b. Premature closure; excessive coherence “Was there accidental bias in the choice of considerations that were discussed early?” “Were alternatives fully considered, and was evidence that would support them actively sought?” “Were uncomfortable data or opinions suppressed or neglected?”
3. INFORMATION PROCESSING 3a. Availability and salience “Are the participants exaggerating the relevance of an event because of its recency, its dramatic quality, or its personal relevance, even if it is not diagnostic?” 3b. Inattention to quality of information “Did the judgment rely heavily on anecdotes, stories, or analogies? Did the data confirm them?” 3c. Anchoring “Did numbers of uncertain accuracy or relevance play an important role in the final judgment?” 3d. Nonregressive prediction “Did the participants make nonregressive extrapolations, estimates, or forecasts?” 4. DECISION 4a. Planning fallacy “When forecasts were used, did people question their sources and validity? Was the outside view used to challenge the forecasts?” “Were confidence intervals used for uncertain numbers? Are they wide enough?” 4b. Loss aversion “Is the risk appetite of the decision makers aligned with that of the organization? Is the decision team overly cautious?” 4c. Present bias “Do the calculations (including the discount rate used) reflect the organization’s balance of short- and long-term priorities?”
Averaging two guesses by the same person does not improve judgments as much as does seeking out an independent second opinion. As Vul and Pashler put it, “You can gain about 1/10th as much from asking yourself the same question twice as you can from getting a second opinion from someone else.” This is not a large improvement. But you can make the effect much larger by waiting to make a second guess.
When Vul and Pashler let three weeks pass before asking their subjects the same question again, the benefit rose to one-third the value of a second opinion.
This request required the subjects to think actively of information they had not considered the first time. The instructions to participants read as follows: First, assume that your first estimate is off the mark. Second, think about a few reasons why that could be. Which assumptions and considerations could have been wrong? Third, what do these new considerations imply? Was the first estimate rather too high or too low? Fourth, based on this new perspective, make a second, alternative estimate.
Herzog and Hertwig then averaged the two estimates thus produced. Their technique, which they named dialectical bootstrapping, produced larger improvements in accuracy than did a simple request for a second estimate immediately following the first. Because the participants forced themselves to consider the question in a new light, they sampled another, more different version of themselves—two “members” of the “crowd within” who were further apart. As a result, their average produced a more accurate estimate of the truth. The gain in accuracy with two immediately consecutive “dialectical” estimates was about half the value of a second opinion.
And here, the effects are not those you might imagine. Being in a good mood is a mixed blessing, and bad moods have a silver lining. The costs and benefits of different moods are situation-specific.
In a negotiation situation, for instance, good mood helps. People in a good mood are more cooperative and elicit reciprocation. They tend to end up with better results than do unhappy negotiators.
On the other hand, a good mood makes us more likely to accept our first impressions as true without challenging them.
As you can guess, this is a test of the readers’ vulnerability to stereotypes: do people rate the essay more favorably when it is attributed to a middle-aged man than they do when they believe that a young woman wrote it? They do, of course. But importantly, the difference is larger in the good-mood condition. People who are in a good mood are more likely to let their biases affect their thinking.
Inducing good moods makes people more receptive to bullshit and more gullible in general; they are less apt to detect deception or identify misleading information. Conversely, eyewitnesses who are exposed to misleading information are better able to disregard it—and to avoid false testimony—when they are in a bad mood.
However, when the subjects were placed in a positive mood—induced by watching a five-minute video segment—they became three times more likely to say that they would push the man off the bridge. Whether we regard “Thou shalt not kill” as an absolute principle or are willing to kill one stranger to save five should reflect our deepest values. Yet our choice seems to depend on what video clip we have just watched.
A study of nearly seven hundred thousand primary care visits, for instance, showed that physicians are significantly more likely to prescribe opioids at the end of a long day.
Other studies showed that, toward the end of the day, physicians are more likely to prescribe antibiotics and less likely to prescribe flu shots.
Bad weather is associated with improved memory; judicial sentences tend to be more severe when it is hot outside; and stock market performance is affected by sunshine.
Uri Simonsohn showed that college admissions officers pay more attention to the academic attributes of candidates on cloudier days and are more sensitive to nonacademic attributes on sunnier days. The title of the article in which he reported these findings is memorable enough: “Clouds Make Nerds Look Good.”
after a streak, or a series of decisions that go in the same direction, they are more likely to decide in the opposite direction than would be strictly justified. As a result, errors (and unfairness) are inevitable. Asylum judges in the United States, for instance, are 19% less likely to grant asylum to an applicant when the previous two cases were approved. A person might be approved for a loan if the previous two applications were denied, but the same person might have been rejected if the previous two applications had been granted. This behavior reflects a cognitive bias known as the gambler’s fallacy: we tend to underestimate the likelihood that streaks will occur by chance.
How large is occasion noise relative to total system noise?
As noted, for instance, the chance that an asylum applicant will be admitted in the United States drops by 19% if the hearing follows two successful ones by the same judge. This variability is certainly troubling. But it pales in comparison with the variability between judges: in one Miami courthouse, Jaya Ramji-Nogales and her co-authors found that one judge would grant asylum to 88% of applicants and another to only 5%.
Similarly, fingerprint examiners and physicians sometimes disagree with themselves, but they do so less often than they disagree with others. In every case we reviewed in which the share of occasion noise in total system noise could be measured, occasion noise was a smaller contributor than were differences among individuals.
It is very likely that intrinsic variability in the functioning of the brain also affects the quality of our judgments in ways that we cannot possibly hope to control.
Speaking of Occasion Noise “Judgment is like a free throw: however hard we try to repeat it precisely, it is never exactly identical.” “Your judgment depends on what mood you are in, what cases you have just discussed, and even what the weather is. You are not the same person at all times.” “Although you may not be the same person you were last week, you are less different from the ‘you’ of last week than you are from someone else today. Occasion noise is not the largest source of system noise.”
How Groups Amplify Noise
Groups can go in all sorts of directions, depending in part on factors that should be irrelevant. Who speaks first, who speaks last, who speaks with confidence, who is wearing black, who is seated next to whom, who smiles or frowns or gestures at the right moment—all these factors, and many more, affect outcomes.
It might seem odd to emphasize this point, since we noted in the previous chapter that aggregating the judgments of multiple individuals reduces noise. But because of group dynamics, groups can add noise, too. There are “wise crowds,” whose mean judgment is close to the correct answer, but there are also crowds that follow tyrants, that fuel market bubbles, that believe in magic, or that are under the sway of a shared illusion.
They were testing for a particular driver of noise: social influence. The key finding was that group rankings were wildly disparate: across different groups, there was a great deal of noise. In one group, “Best Mistakes” could be a spectacular success, while “I Am Error” could flop. In another group, “I Am Error” could do exceedingly well, and “Best Mistakes” could be a disaster. If a song benefited from early popularity, it could do really well. If it did not get that benefit, the outcome could be very different. To be sure, the very worst songs (as established by the control group) never ended up at the very top, and the very best songs never ended up at the very bottom.
Remarkably, this effect persisted over time. After five months, a single positive initial vote artificially increased the mean rating of comments by 25%. The effect of a single positive early vote is a recipe for noise. Whatever the reason for that vote, it can produce a large-scale shift in overall popularity.
Research has revealed exactly that problem. In simple estimation tasks—the number of crimes in a city, population increases over specified periods, the length of a border between nations—crowds were indeed wise as long as they registered their views independently. But if they learned the estimates of other people—for example, the average estimate of a group of twelve—the crowd did worse. As the authors put it, social influences are a problem because they reduce “group diversity without diminishing the collective error.”
The irony is that while multiple independent opinions, properly aggregated, can be strikingly accurate, even a little social influence can produce a kind of herding that undermines the wisdom of crowds.
To find out, we followed up the first experiment with another, this one involving more than three thousand jury-eligible citizens and more than five hundred six-person juries. The results were straightforward. Looking at the same case, deliberating juries were far noisier than statistical juries—a clear reflection of social influence noise. Deliberation had the effect of increasing noise.
There was another intriguing finding. When the median member of a six-person group was only moderately outraged and favored a lenient punishment, the verdict of the deliberating jury typically ended up more lenient still. When, on the contrary, the median member of a six-person group was quite outraged and expressed a severe punitive intent, the deliberating jury typically ended up more outraged and more severe still.
Indeed, 27% of juries chose an award as high as, or even higher than, that of their most severe member. Not only were deliberating juries noisier than statistical juries, but they also accentuated the opinions of the individuals composing them.
The explanations for group polarization are, in turn, similar to the explanations for cascade effects. Information plays a major role. If most people favor a severe punishment, then the group will hear many arguments in favor of severe punishment—and fewer arguments the other way. If group members are listening to one another, they will shift in the direction of the dominant tendency, rendering the group more unified, more confident, and more extreme. And if people care about their reputation within the group, they will shift in the direction of the dominant tendency, which will also produce polarization.
The study we used for these cases was a clear example of Meehl’s pattern. As we noted, clinical predictions achieved a .15 correlation (PC = 55%) with job performance, but mechanical prediction achieved a correlation of .32 (PC = 60%). Think about the confidence that you experienced in the relative merits of the cases of Monica and Nathalie. Meehl’s results strongly suggest that any satisfaction you felt with the quality of your judgment was an illusion: the illusion of validity.
“Massive and consistent” is a fair description. A 2000 review of 136 studies confirmed unambiguously that mechanical aggregation outperforms clinical judgment.
Since the model is a crude approximation of the judge, we could sensibly assume that it cannot perform as well. How much accuracy is lost when the model replaces the judge? The answer may surprise you. Predictions did not lose accuracy when the model generated predictions. They improved. In most cases, the model out-predicted the professional on which it was based. The ersatz was better than the original product.
The model-of-the-judge studies reinforce Meehl’s conclusion that the subtlety is largely wasted. Complexity and richness do not generally lead to more accurate predictions.
In addition, a simple model of you will not represent the pattern noise in your judgments. It cannot replicate the positive and negative errors that arise from arbitrary reactions you may have to a particular case. Neither will the model capture the influences of the momentary context and of your mental state when you make a particular judgment. Most likely, these noisy errors of judgment are not systematically correlated with anything, which means that for most purposes, they can be considered random.
In short, replacing you with a model of you does two things: it eliminates your subtlety, and it eliminates your pattern noise. The robust finding that the model of the judge is more valid than the judge conveys an important message: the gains from subtle rules in human judgment—when they exist—are generally not sufficient to compensate for the detrimental effects of noise. You may believe that you are subtler, more insightful, and more nuanced than the linear caricature of your thinking. But in fact, you are mostly noisier.
Still, the fact that mechanical adherence to a simple rule (Yu and Kuncel call it “mindless consistency”) could significantly improve judgment in a difficult problem illustrates the massive effect of noise on the validity of clinical predictions.
Speaking of Judgments and Models “People believe they capture complexity and add subtlety when they make judgments. But the complexity and the subtlety are mostly wasted—usually they do not add to the accuracy of simple models.” “More than sixty years after the publication of Paul Meehl’s book, the idea that mechanical prediction is superior to people is still shocking.” “There is so much noise in judgment that a noise-free model of a judge achieves more accurate predictions than the actual judge does.”
The loss of accuracy in cross-validation is worst when the original sample is small, because flukes loom larger in small samples. The problem Dawes pointed out is that the samples used in social science research are generally so small that the advantage of so-called optimal weighting disappears. As statistician Howard Wainer memorably put it in the subtitle of a scholarly article on the estimation of proper weights, “It Don’t Make No Nevermind.” Or, in Dawes’s words, “we do not need models more precise than our measurements.” Equal-weight models do well because they are not susceptible to accidents of sampling.
The immediate implication of Dawes’s work deserves to be widely known: you can make valid statistical predictions without prior data about the outcome that you are trying to predict. All you need is a collection of predictors that you can trust to be correlated with the outcome.
The final sentence of the seminal article that introduced the idea offered another pithy summary: “The whole trick is to decide what variables to look at and then to know how to add.”
Because, in real life, predictors are almost always correlated to one another, this statistical fact supports the use of frugal approaches to prediction, which use a small number of predictors.
The model that the researchers built uses just two inputs known to be highly predictive of a defendant’s likelihood to jump bail: the defendant’s age (older people are lower flight risks) and the number of past court dates missed (people who have failed to appear before tend to recidivate). The model translates these two inputs into a number of points, which can be used as a risk score. The calculation of risk for a defendant does not require a computer—in fact, not even a calculator. When tested against a real data set, this frugal model performed as well as statistical models that used a much larger number of variables. The frugal model did better than virtually all human bail judges did in predicting flight risk.
As humans, we are keenly aware that we make mistakes, but that is a privilege we are not prepared to share. We expect machines to be perfect. If this expectation is violated, we discard them.
Speaking of Rules and Algorithms “When there is a lot of data, machine-learning algorithms will do better than humans and better than simple models. But even the simplest rules and algorithms have big advantages over human judges: they are free of noise, and they do not attempt to apply complex, usually invalid insights about the predictors.” “Since we lack data about the outcome we must predict, why don’t we use an equal-weight model? It will do almost as well as a proper model, and will surely do better than case-by-case human judgment.” “You disagree with the model’s forecast. I get it. But is there a broken leg here, or do you just dislike the prediction?” “The algorithm makes mistakes, of course. But if human judges make even more mistakes, whom should we trust?”
Both intractable uncertainty (what cannot possibly be known) and imperfect information (what could be known but isn’t) make perfect prediction impossible. These unknowns are not problems of bias or noise in your judgment; they are objective characteristics of the task.
In general, however, you can safely expect that people who engage in predictive tasks will underestimate their objective ignorance. Overconfidence is one of the best-documented cognitive biases.
Giving up the emotional reward of intuitive certainty is not easy. Tellingly, leaders say they are especially likely to resort to intuitive decision making in situations that they perceive as highly uncertain. When the facts deny them the sense of understanding and confidence they crave, they turn to their intuition to provide it. The denial of ignorance is all the more tempting when ignorance is vast.
When you explain an unexpected but unsurprising outcome in this way, the destination that is eventually reached always makes sense. This is what we mean by understanding a story, and this is what makes reality appear predictable—in hindsight. Because the event explains itself as it occurs, we are under the illusion that it could have been anticipated.
The search for causes is almost always successful because causes can be drawn from an unlimited reservoir of facts and beliefs about the world.
This continuous causal interpretation of reality is how we “understand” the world. Our sense of understanding life as it unfolds consists of the steady flow of hindsight in the valley of the normal.
As we know from classic research on hindsight, even when subjective uncertainty does exist for a while, memories of it are largely erased when the uncertainty is resolved.
At this point, all we need to emphasize is that the causal mode comes much more naturally to us. Even explanations that should properly be treated as statistical are easily turned into causal narratives.
As we will see, the preference for causal thinking also contributes to the neglect of noise as a source of error, because noise is a fundamentally statistical notion.
Speaking of the Limits of Understanding “Correlations of about .20 (PC = 56%) are quite common in human affairs.” “Correlation does not imply causation, but causation does imply correlation.” “Most normal events are neither expected nor surprising, and they require no explanation.” “In the valley of the normal, events are neither expected nor surprising—they just explain themselves.” “We think we understand what is going on here, but could we have predicted it?”
The answers should be clearly different, but they are not, suggesting that a factor that should influence judgments is ignored. (This psychological bias is called scope insensitivity.)
base-rate neglect. Thinking of base rates is no more automatic for the authors of this book than for anyone else.)
Substituting a judgment of how easily examples come to mind for an assessment of frequency is known as the availability heuristic. The substitution of an easy judgment for a hard one is not limited to these examples. In fact, it is very common. Answering an easier question can be thought of as a general-purpose procedure for answering a question that could stump you. Consider how we tend to answer each of the following questions by using its easier substitute:
Do I think this surgeon is competent? Does this individual speak with confidence and authority? Will the project be completed on schedule? Is it on schedule now? Is nuclear energy necessary? Do I recoil at the word nuclear? Am I satisfied with my life as a whole? What is my mood right now? Regardless of the question, substituting one question for another will lead to an answer that does not give different aspects of the evidence their appropriate weights, and incorrect weighting of the evidence inevitably results in error.
This example illustrates a different type of bias, which we call conclusion bias, or prejudgment. Like Lucas, we often start the process of judgment with an inclination to reach a particular conclusion. When we do that, we let our fast, intuitive System 1 thinking suggest a conclusion. Either we jump to that conclusion and simply bypass the process of gathering and integrating information, or we mobilize System 2 thinking—engaging in deliberate thought—to come up with arguments that support our prejudgment. In that case, the evidence will be selective and distorted: because of confirmation bias and desirability bias, we will tend to collect and interpret evidence selectively to favor a judgment that, respectively, we already believe or wish to be true.
This experiment illustrates excessive coherence: we form coherent impressions quickly and are slow to change them.
Speaking of Heuristics, Biases, and Noise “We know we have psychological biases, but we should resist the urge to blame every error on unspecified ‘biases.’” “When we substitute an easier question for the one we should be answering, errors are bound to occur. For instance, we will ignore the base rate when we judge probability by similarity.” “Prejudgments and other conclusion biases lead people to distort evidence in favor of their initial position.” “We form impressions quickly and hold on to them even when contradictory information comes in. This tendency is called excessive coherence.” “Psychological biases cause statistical bias if many people share the same biases. In many cases, however, people differ in their biases. In those cases, psychological biases create system noise.”
It appears that people are much more sensitive to the relative value of comparable goods than to their absolute value.
The law explicitly prohibits any communication to the jury of the size of punitive awards in other cases. The assumption implicit in the law is that jurors’ sense of justice will lead them directly from a consideration of an offense to the correct punishment. This assumption is psychological nonsense—it assumes an ability that humans do not have. The institutions of justice should acknowledge the limitations of the people who administer it.
First, the choice of a scale can make a large difference in the amount of noise in judgments, because ambiguous scales are noisy. Second, replacing absolute judgments with relative ones, when feasible, is likely to reduce noise.
Noise, on the other hand, is unpredictable error that we cannot easily see or explain. That is why we so often neglect it—even when it causes grave damage. For this reason, strategies for noise reduction are to debiasing what preventive hygiene measures are to medical treatment: the goal is to prevent an unspecified range of potential errors before they occur.
Noise is an invisible enemy, and preventing the assault of an invisible enemy can yield only an invisible victory.
Speaking of Debiasing and Decision Hygiene “Do you know what specific bias you’re fighting and in what direction it affects the outcome? If not, there are probably several biases at work, and it is hard to predict which one will dominate.” “Before we start discussing this decision, let’s designate a decision observer.” “We have kept good decision hygiene in this decision process; chances are the decision is as good as it can be.”
Even the FBI, in its internal investigation of the Mayfield case, noted that “latent print examiners routinely conduct verifications in which they know the previous examiners’ results and yet those results do not influence the examiner’s conclusions.” These remarks essentially amount to a denial of the existence of confirmation bias. Even when they are aware of the risk of bias, forensic scientists are not immune to the bias blind spot: the tendency to acknowledge the presence of bias in others, but not in oneself. In a survey of four hundred professional forensic scientists in twenty-one countries, 71% agreed that “cognitive bias is a cause for concern in the forensic sciences as a whole,” but only 26% thought that their “own judgments are influenced by cognitive bias.” In other words, about half of these forensic professionals believe that their colleagues’ judgments are noisy but that their own are not.
Speaking of Sequencing Information “Wherever there is judgment, there is noise—and that includes reading fingerprints.” “We have more information about this case, but let’s not tell the experts everything we know before they make their judgment, so as not to bias them. In fact, let’s tell them only what they absolutely need to know.” “The second opinion is not independent if the person giving it knows what the first opinion was. And the third one, even less so: there can be a bias cascade.” “To fight noise, they first have to admit that it exists.”