Articles and Analysis


A Surrender of Judgment? (Conclusion)

Topics: 2006 , The 2006 Race

[This post concludes my comments started yesterday in response to a column by Washington Post polling director Jon Cohen.]

We chose to average poll results here on Pollster -- even for dissimilar surveys that might show "house effects" due to differences in methodology -- because we believed it would help lessen the confusion that results from polling's inherent variability. We had seen the way the simple averaging used by the site RealClearPolitics had worked in 2004, in particularly the way their final averages in battleground states proved to be a better indicator of the leader in each state than the leaked exit poll estimates that got everyone so excited on Election Day.

As Carl Bialik's "Numbers Guy" column on Wall Street Journal Online shows, that approach proved itself again this year:

Taking an average of the five most recent polls for a given state, regardless of the author -- a measure compiled by Pollster.com -- yielded a higher accuracy rate than most individual pollsters.

And in fairness, while I have not crunched the numbers that Bialik has, I am assuming that the RealClearPolitics averages performed similarly this year.

Readers have often suggested more elaborate or esoteric alternatives and we considered many. But given the constraints of time and budget and the need to automate the process of generating the charts, maps and tables, we ultimately opted to stick with a relatively simple approach.

Regardless, our approach reflected our judgment about how to best aggregate many different polls while also minimizing the potential shortcomings of averaging. The important statistical issues are fairly straightforward. If a set of polls uses an identical methodology, averaging those polls will effectively pool the sample size and reduces random error, assuming no trend occurs to change attitudes of the time period in which those polls were fielded.

In reality, of course, all polls are different and those differences sometimes produce house effects in the results. In theory, if we knew for certain that Pollsters A, B, C and D always produce "good" and accurate results, and Pollster E always produces skewed or biased results, then an average of all five would be less accurate than looking at any of the first four alone. The problem is that things are rarely that simple or obvious in the real world. In practice, house effects are usually only evident in retrospect. And in most cases, it is not obvious either before or after the election whether a particular effect -- such as a consistently higher or lower percentage of undecided voters -- automatically qualifies as inherently "bad."

So one reason we opted to average five polls (rather than a smaller number) is that any one odd poll would have a relatively small contribution to the average. Also, looking at the pace of polling in 2002, five polls seemed to be the right number to assure a narrow range of field dates toward the end of the campaign.

We also decided from the beginning that the averages used to classify races (as toss-up, lean Democrat, etc.) would not include Internet surveys drawn from non-random panels. This judgment was based on our analysis of the Internet panel polls in 2004, which had shown a consistent statistical bias in favor of the Democrats. One consequence was that our averages excluded the surveys conducted by Polimetrix, Pollster.com's primary sponsor, a decision that did not exactly delight the folks who pay our bills and keep our site running smoothly. The fact that we made that call under those circumstances is one big reason why the "surrender judgment" comment irks me as much as it does.

Again, as many comments have already noted, we put a lot of effort into identifying and charting pollster house effects as they appeared in the data. On the Sunday before the election, we posted pollster comparison charts for Senate race with at least 10 polls (22 in all). On that day, my blog post gave special attention to the fairly clear "house effect," involving SurveyUSA:

A good example is the Maryland Senate race (copied below). Note that the three automated polls by SurveyUSA have all shown the race virtually tied, while other polls (including the automated surveys from Rasmussen Reports) show a narrowing race, with Democrat Ben Cardin typically leading by roughly five percentage points.



Which brings me to Maryland. Jon Cohen is certainly right to point out that the Washington Post's survey ultimately provided a more accurate depiction of voters' likely preferences than the average of surveys released at about the same time. Democrat Ben Cardin won by ten percentage points (54% to 44%). The Post survey, conducted October 22-26, had Cardin ahead by 11 (54% to 43% with just 1% undecided and 1% choosing Green party candidate Kevin Zeese). Our final "last five poll average" had Cardin ahead by just three points (48.4% to 45.2%), a margin narrow enough to merit a "toss-up" rating.

So why were the averages of all the polls less accurate than one poll by the Washington Post? Unfortunately, in this case, one contributing factor was the mechanism we used to calculate the averages. As it happened, two of the "last 5" polls came from SurveyUSA, whose polls showed a consistently closer race than any of the other surveys. Had we simply omitted the two SurveyUSA polls and averaged the other three, we would have shown Cardin leading by four-point, enough to classify the race as "lean Democrat." Had we added in the two previous survey releases from the Baltimore Sun/Potomac Research and the Washington Post, the average would have shown Cardin leading by six.

John Cohen seems to imply that no one would have considered the Maryland races competitive had they adhered to polling's "gold standard.... interviewers making telephone calls to people randomly selected from a sample of a definable, reachable population." That standard would have omitted the Internet surveys, the automated surveys, and possibly the Baltimore Sun/Potomac Research poll (because it sampled from a list of registered voters rather than using a "random digit dial" sample). But it would have left the Mason-Dixon surveys standing, and they showed Cardin's lead narrowing to just three points (47% to 44%) just days before the election.

We are hoping to take a closer look at how the pollsters did in Maryland and across the country over the next month or so, and especially at cases where the results differed from the final poll averages. I suspect that the story will have less to do with the methods of sampling or interviewing and more to do with more classic questions of how hard to push uncertain voters and what it means to be "undecided" on the final survey.


Gary Kilbride:

Perhaps Jon Cohen could have done us a big favor by including a notation with final Post poll; "Pssst. This is a deep-blue state. Democratic year. Nothing will change from summer to ballot box. Ignore all subsequent polls. The most accurate polling is always in the October 20-25 range."

Cohen may not have been polling director in 2002, but I thought I remembered the Post and its gold standard missing the Ehrlich advantage in the gov race vs. Kathleen Kennedy Townsend. The final Post poll in '02 had it tied 47-47 with likely voters 48-46 toward Townsend. That poll was October 22-25, almost identical to the two-weeks out approach the Post used for its final poll this year. Actual result: Ehrlich 51 Townsend 48.

SurveyUSA polled twice after the Post packed up in 2002. Their first one, days after the Post poll, gave Townsend a 1 point lead, but their final poll provided Ehrlich a 51-46 edge, apparently catching the late move toward the GOP and Ehrlich. A late Maryland Poll for the Baltimore Sun similarly gave Ehrlich a 4 point edge, 48-44. The final SurveyUSA poll was by far the latest of the cycle and was the only one with Ehrlich at 50% or above.

How are we supposed to know the final two weeks are irrelevant? That was hardly the case in Maryland 2002, with the Beltway sniper story still in progress when the Post did its final polling. Gun control had been a major theme in that campaign. The suspects were arrested about ten days before election day yet the Post never polled the gov race again, perhaps missing a changing dynamic in the race with a small but detectable, and decisive, move toward Ehrlich.

Occasional polling with the final one two weeks out, regardless of changing circumstance. Marvelous. That's apparently the Post's polling approach and this year it proved accurate, enabling Jon Cohen to become a red board player. That's an old derisive horse racing term for someone who has all the answers, once the official results are in red on the tote board. Congratulations, Red Board Jon.



In their RCP average, Real Clear Politics omitted both internet surveys and surveys from partisan polling organizations, which sounds reasonable to me given that (1) the partisan polling organizations rarely release their data unless the data are more favorable to their candidate than that of other public polls, and (2) often the other party's polling organization did not release their survey results which, if they did, might counter-balance any skewed results.

Question: why did Pollster.com elect to include partisan surveys in their 5 day averages?



Disclaimer: Republican conservative from Maryland

I've read your site for 2+ years, and I've always appreciated your dispassionate data-based posts. Although I disagreed with several of your conclusions this election season, I didn't find fault w/ your methods. (I.e., I know that my desired results were influencing my perception of the validity of the conclusions you were drawing. This, of course, is always a danger for both blues and reds :)).

Another significant factor concerning Cohen's argument is that WaPo was *ONE* poll. Did they beat pollster.com across the board, or only in the one race? It's easy to cherry pick results, finding the one situation that makes a particular pollster look good. I'm more interested in a pollster's consistency. If "Q" is right on 17 of 17 races one year, but only 1 of 16 another, I'm not going to pay much attention to them. If "R" is right on 16 of 17 races, but w/ avg error of 9%, I'm not going to trust them as much as "S" who was only right on 11 of 17 but w/ avg error of 2%.


Mark Blumenthal:


Good question. The issue of whether to include partisan polls in the averages is one Franklin and I debated. It was a very close call. We were initially focused on the statewide contests and decided to include partisan polls because the alternative would have left us with very few to average in early September.

We also believed that any bias in the Democratic and Republican surveys would largely cancel out. I watched variation in the classifications very closely in the Senate races, and the occasional partisan poll typically made no difference.

It helped that Strategic Vision, a Republican affiliated firm, released surveys in many states that essentially countered the many poll released by Democrats over the summer.

At the House level, once we started averaging I was far more concerned, because the number of Democratic polls far exceeded those from Republicans. I watched this effect closely, but it made remarkably little difference in the classifications.

Obviously, this is an issue we want to examine carefully in our ongoing post election review.



The Wisdom Of Crowds.

Each pollster is following methodology with a goal of finding a "true" answer. They cannot all be right all the time, but they each can be right at different times, and they can all be wrong. As each emphasizes different different variables and weights, their collective judgment is superior than that of any individual.


Rick Brady:

repeat comment... didn't know there was a more current post...

MP, I tend to agree that averaging polls shouldn't be the tool for evaluating pollster performance. It's useful to help understand things leading up to an election, but all polls (and pollsters) are not created equal.

OTOH, I don't think it's apporpriate to evaluate an individual pollster's performance with a snapshot comparison of one poll taken shortly before the election (that darn sampling error thing...).

So, how do you: A) evaluate the performance of a single pollster?; and B) evaluate the performance of the industry as a whole?

A) Using appropriate statistical techniques (I'd argue the recent Martin-Traugott-Kennedy measure is the best thing on the market for this type of analysis right now), look at the history of a pollster's performance. Snapshots are no good.

B) Generate a top 5 or perhaps top 10 list of pollsters based on historical performance (again using the appropriate statistical tools). An argument could be made that these pollsters's polls are basically "alike" and there are other statistical tools that can be applied to evalaute the combined performance of all 5 or 10 of these polls for a single election. Beats the averaging.


college kid:

What exactly accounted for the difference b/w the other polls and the final result? I know WP assumed higher turnout from African Americans.


Post a comment

Please be patient while your comment posts - sometimes it takes a minute or two. To check your comment, please wait 60 seconds and click your browser's refresh button. Note that comments with three or more hyperlinks will be held for approval.