November 12, 2006 - November 18, 2006
[This post concludes my comments started yesterday in response to a column by Washington Post polling director Jon Cohen.]
We chose to average poll results here on Pollster -- even for dissimilar surveys that might show "house effects" due to differences in methodology -- because we believed it would help lessen the confusion that results from polling's inherent variability. We had seen the way the simple averaging used by the site RealClearPolitics had worked in 2004, in particularly the way their final averages in battleground states proved to be a better indicator of the leader in each state than the leaked exit poll estimates that got everyone so excited on Election Day.
As Carl Bialik's "Numbers Guy" column on Wall Street Journal Online shows, that approach proved itself again this year:
Taking an average of the five most recent polls for a given state, regardless of the author -- a measure compiled by Pollster.com -- yielded a higher accuracy rate than most individual pollsters.
And in fairness, while I have not crunched the numbers that Bialik has, I am assuming that the RealClearPolitics averages performed similarly this year.
Readers have often suggested more elaborate or esoteric alternatives and we considered many. But given the constraints of time and budget and the need to automate the process of generating the charts, maps and tables, we ultimately opted to stick with a relatively simple approach.
Regardless, our approach reflected our judgment about how to best aggregate many different polls while also minimizing the potential shortcomings of averaging. The important statistical issues are fairly straightforward. If a set of polls uses an identical methodology, averaging those polls will effectively pool the sample size and reduces random error, assuming no trend occurs to change attitudes of the time period in which those polls were fielded.
In reality, of course, all polls are different and those differences sometimes produce house effects in the results. In theory, if we knew for certain that Pollsters A, B, C and D always produce "good" and accurate results, and Pollster E always produces skewed or biased results, then an average of all five would be less accurate than looking at any of the first four alone. The problem is that things are rarely that simple or obvious in the real world. In practice, house effects are usually only evident in retrospect. And in most cases, it is not obvious either before or after the election whether a particular effect -- such as a consistently higher or lower percentage of undecided voters -- automatically qualifies as inherently "bad."
So one reason we opted to average five polls (rather than a smaller number) is that any one odd poll would have a relatively small contribution to the average. Also, looking at the pace of polling in 2002, five polls seemed to be the right number to assure a narrow range of field dates toward the end of the campaign.
We also decided from the beginning that the averages used to classify races (as toss-up, lean Democrat, etc.) would not include Internet surveys drawn from non-random panels. This judgment was based on our analysis of the Internet panel polls in 2004, which had shown a consistent statistical bias in favor of the Democrats. One consequence was that our averages excluded the surveys conducted by Polimetrix, Pollster.com's primary sponsor, a decision that did not exactly delight the folks who pay our bills and keep our site running smoothly. The fact that we made that call under those circumstances is one big reason why the "surrender judgment" comment irks me as much as it does.
Again, as many comments have already noted, we put a lot of effort into identifying and charting pollster house effects as they appeared in the data. On the Sunday before the election, we posted pollster comparison charts for Senate race with at least 10 polls (22 in all). On that day, my blog post gave special attention to the fairly clear "house effect," involving SurveyUSA:
A good example is the Maryland Senate race (copied below). Note that the three automated polls by SurveyUSA have all shown the race virtually tied, while other polls (including the automated surveys from Rasmussen Reports) show a narrowing race, with Democrat Ben Cardin typically leading by roughly five percentage points.
Which brings me to Maryland. Jon Cohen is certainly right to point out that the Washington Post's survey ultimately provided a more accurate depiction of voters' likely preferences than the average of surveys released at about the same time. Democrat Ben Cardin won by ten percentage points (54% to 44%). The Post survey, conducted October 22-26, had Cardin ahead by 11 (54% to 43% with just 1% undecided and 1% choosing Green party candidate Kevin Zeese). Our final "last five poll average" had Cardin ahead by just three points (48.4% to 45.2%), a margin narrow enough to merit a "toss-up" rating.
So why were the averages of all the polls less accurate than one poll by the Washington Post? Unfortunately, in this case, one contributing factor was the mechanism we used to calculate the averages. As it happened, two of the "last 5" polls came from SurveyUSA, whose polls showed a consistently closer race than any of the other surveys. Had we simply omitted the two SurveyUSA polls and averaged the other three, we would have shown Cardin leading by four-point, enough to classify the race as "lean Democrat." Had we added in the two previous survey releases from the Baltimore Sun/Potomac Research and the Washington Post, the average would have shown Cardin leading by six.
John Cohen seems to imply that no one would have considered the Maryland races competitive had they adhered to polling's "gold standard.... interviewers making telephone calls to people randomly selected from a sample of a definable, reachable population." That standard would have omitted the Internet surveys, the automated surveys, and possibly the Baltimore Sun/Potomac Research poll (because it sampled from a list of registered voters rather than using a "random digit dial" sample). But it would have left the Mason-Dixon surveys standing, and they showed Cardin's lead narrowing to just three points (47% to 44%) just days before the election.
We are hoping to take a closer look at how the pollsters did in Maryland and across the country over the next month or so, and especially at cases where the results differed from the final poll averages. I suspect that the story will have less to do with the methods of sampling or interviewing and more to do with more classic questions of how hard to push uncertain voters and what it means to be "undecided" on the final survey.
We interrupt the previous post still in progress to bring you a feature Pollster readers will definitely want to read in full. Carl Bialik, the "Numbers Guy" from Wall Street Journal Interactive did some comparisons of the performance of five pollsters that were particularly active in statewide elections: Rasmussen Reports, SurveyUSA, Mason Dixon and Zogby International (twice, once for its telephone surveys and once for Internet panel surveys).
The most important lesson in Bialik's piece is his appropriate reluctance to "crown a winner." As he puts it, "the science of evaluating polls remains very much a work in progress." That's one reason why we have not rushed to do our own evaluation of how the polls did in 2006. Bialik provides a concise but remarkably accessible review of the history of efforts to measure polling error (including a quote from Professor Franklin) and a clear explanation of his own calculations.
Again, the column -- which is free to all -- is worth reading in full, but I have to share what is for us, the "money graph:"
There were some interesting trends: Phone polls tended to be better than online surveys, and companies that used recorded voices rather than live humans in their surveys were standouts. Nearly everyone had some big misses, though, such as predicting that races would be too close to call when in fact they were won by healthy margins. Also, I found that being loyal to a particular polling outfit may not be wise. Taking an average of the five most recent polls for a given state, regardless of the author -- a measure compiled by Pollster.com -- yielded a higher accuracy rate than most individual pollsters.
Thanks Carl. We needed that today. Now do keep in mind the one obvious limitation of Bialik's approach. He only looked at polls by four organizations, including just one online pollster (Zogby) and just two that used live interviewers (Mason Dixon and Zogby). There were obviously many more "conventional pollsters," although few conducted anywhere near as many surveys as the four he looked at.
Another worthy excerpt involves Bialik's conclusions about the Zogby Interactive online surveys, especially since nearly all of those surveys were conducted by Zogby on behalf of the Wall Street Journal Interactive -- Bialik's employer.
But the performance of Zogby Interactive, the unit that conducts surveys online, demonstrates the dubious value of judging polls only by whether they pick winners correctly. As Zogby noted in a press release, its online polls identified 18 of 19 Senate winners correctly. But its predictions missed by an average of 8.6 percentage points in those polls -- at least twice the average miss of four other polling operations I examined. Zogby predicted a nine-point win for Democrat Herb Kohl in Wisconsin; he won by 37 points. Democrat Maria Cantwell was expected to win by four points in Washington; she won by 17.
Again...go read it all.
I had an unhappy experience yesterday morning while still down for the count with a persistent fever (it has broken finally and thanks to all for the kind get well wishes). As I lay shivering, achy and generally miserable, my wife kindly ventured outside to find me some distraction in the form of our dead-tree copy of the morning's Washington Post. It took me only a minute or two to discover that Jon Cohen, the new polling director as the Post, had penned a column that mounted a veiled but clear attack on this site and others like it:
One vogue approach to the glut of polls this year was to surrender judgment, assume all polls were equal and average their findings. Political junkies bookmarked Web sites that aggregated polls and posted five- and 10-poll averages.
But, perhaps unsurprisingly, averages work only "on average." For example, the posted averages on the Maryland governor's and Senate races showed them as closely competitive; they were not. Polls from The Post and Gallup showed those races as solidly Democratic in June, September and October, just as they were on Election Day.
These polls were not magically predictive; rather, they captured the main themes of the election that were set months before Nov. 7. Describing those Maryland contests as tight races in a deep-blue state, in what national pre-election polls rightly showed to be a Democratic year, misled election-watchers and voters, although cable news networks welcomed the fodder.
More fundamentally, averaging polls encourages the already excessive attention paid to horse-race numbers. Preelection polls are not meant to be crystal balls. Putting a number on the status of the race is a necessary part of preelection polls, but much is lost if it's the only one.
We need standards, not averages. There's certainly a place for averages. My investment portfolio, for example, would be in better shape today if I had invested in broad indexes of securities instead of fancying myself a stock-picker. At the same time, I'd be in a much tighter financial position if I took investment advice from spam e-mails as seriously as that from accredited financial experts.
This last point exaggerates the disparities among pollsters. But there are differences among pollsters, and they matter.
Pollsters sometimes disagree about how to conduct surveys, but the high-quality polling we should pay attention to is based on an established method undergirded by statistical theory.
The gold standard in news polling remains interviewers making telephone calls to people randomly selected from a sample of a definable, reachable population. To be sure, the luster on the method is not as shiny as it once was, but I'd always choose tarnished precious metals over fool's gold.
I want to say upfront that I find the charge that our approach was "to surrender judgment," "assume all polls were equal" and blindly peddle "fool's gold" to be both inaccurate and deeply offensive. While it is tempting to go all "blogger" and fire off an angry response in kind, I am going to try to assume that Mr. Cohen -- whom I do not know personally -- wrote his column with the best of intentions. At the same time, it is important to spell out why I fundamentally disagree with his broader conclusions about the value of examining and averaging different kinds of polls.
[Unfortunately, having lost a few days to the flu, I need to pay a few bills and attend to a few other details here at Pollster. I should be back to complete this post later this afternoon. Meanwhile, please feel free to post your own thoughts in the comments section].
Update (11/16): Since I dawdled, the second half of this post appears as a second entry
Today's Guest Pollster Corner Contribution comes from Alan Reifman of Texas Tech University, who takes a closer look at this fall's pre-election polls.
In the months leading up to the 2000 and 2004 general elections, presidential election polls showed considerable variation -- both across different pollsters and within the same pollster at different times -- in the percentages of self-identified Democrats, Republicans, and Independents comprising the samples. Sample composition itself probably would not concern many people, but when these sampling variations seemed to affect the polls' candidate vs. candidate "horse race" numbers, people got agitated.
Discrepancies in polls' partisan compositions almost inevitably raise the issue of whether survey samples should be weighted (i.e., post-stratified) to match party ID figures from sources such as previous elections' exit polls, much like polls are weighted to match gender and other demographic parameters from the U.S. Census. Underlying the question of whether pollsters should weight by party ID lies another question: How fixed and enduring are voters' identifications with a party? Again, experts differ. Zogby was the first major pollster to weight on party ID, with Rasmussen following suit later. Most, if not all, of the remaining pollsters do not weight by party.
I track polls' partisan compositions at my sample weighting website. I am neither a pollster nor a political scientist, but I am a social scientist who teaches research methods and statistics, and I've spent much time studying and collecting data on party identification. I also took a graduate statistics class many years ago from Pollster.com contributor Charles Franklin.
If I had to summarize developments on the sample composition/weighting front for 2006 (where the main point of interest was the Generic Congressional Ballot), I would identify two trends:
1. Thanks to the efforts of the Mystery Pollster himself and others who raised the issue over the past few years, full "topline" documents (also known as polls' "internals"), which included party ID numbers, were freely accessible via the web for most of the national pollsters during this past election season.
2. The margin between the percentages of self-identified Democrats and Republicans (D minus R) comprising most national polls over the final two months of the campaign season was pretty stable. As a consequence, questioning of polls' partisan breakdowns was relatively rare this year.
On my website, I used Rasmussen's party ID readings as my benchmark for comparison, due to the large numbers of interviews involved (500 daily interviews, aggregated over the 90 days preceding the start of each new month). Most of the time, Rasmussen had the D-R margin at roughly 4.5 percent. As shown in the major chart on my website, when multiple independent polls that were in the field during roughly the same time frame (and which released the necessary party ID numbers) were available, I averaged their partisan percentages. Four polls (not including any from Rasmussen) taken from October 18-22 inclusive showed averages of D 34.4 and R 29.8, well in line with Rasmussen's margin. (The average of five polls from an earlier period, October 5-8 inclusive, had a wider Democratic margin: D 36.9, R 29.6.)
In the chart, I also provided brief verbal commentary on how each poll's partisan breakdown matched up with Rasmussen's. As can be seen, polls' D-R margins were sometimes described as "about right," with instances of "D edge understated" and "D edge overstated" almost perfectly balancing out over the final two weeks.
In the end, the New York Times exit poll (N = 13,251) showed the national electorate for U.S. House races to consist of 39% Democrats and 36% Republicans. This 3-point difference is slightly smaller than would have been anticipated from some of the late polls, but only slightly. It should also be noted that, even with its huge sample size, the Times exit poll still is a sample survey and thus carries a small margin of error (about +/- 1).
One final note: As animated as I get by party ID percentages, I must acknowledge that they are not the whole story. For example, among the final batch of polls, FOX, Pew, and Time all had Democratic respondents outnumbering their Republican counterparts by either 3 or 4 percent. Yet these polls differed widely in their Generic Ballot readings, with FOX and Time having Democrats up 13-15 percent (with FOX's sample explicitly described as consisting of "likely" voters), whereas Pew had them up only 8 (among registered voters) or 4 (among likely voters). Other traditional issues of survey methodology -- such as question wording and order effects -- thus have to be examined for their possible role in these polls' varying D-R margins on the Generic Ballot.
A quick follow-up on Karl Rove's contention in his now well-known interview with NPR's Robert Siegel:
I'm looking at 68 polls a week . . . and adding them up. I add up to a Republican Senate and Republican House. You may end up with a different math but you are entitled to your math and I'm entitled to THE math.
Obviously, it didn't work out that way. I discussed the topic here and in a subsequent interview on NPR's On the Media. But now, thanks to Newsweek (via Kaus) we have the details on just Rove meant by "THE math:"
The polls and pundits pointed to a Democratic sweep, but Rove dismissed them all . . .He wasn't just trying to psych out the media and the opposition. He believed his "metrics" were far superior to plain old polls. Two weeks before the elections, Rove showed NEWSWEEK his magic numbers: a series of graphs and bar charts that tallied early voting and voter outreach. Both were running far higher than in 2004. In fact, Rove thought the polls were obsolete because they relied on home telephones in an age of do-not-call lists and cell phones. Based on his models, he forecast a loss of 12 to 14 seats in the House—enough to hang on to the majority. Rove placed so much faith in his figures that, after the elections, he planned to convene a panel of Republican political scientists—to study just how wrong the polls were.
So there you have it. Two plus two always adds to four, but sometimes our models and assumptions don't add up as well as we think they will.
Update: Adam Berinsky, an associate professor of political science at MIT, asks a good question in the comments:
Who were these Republican political scientists that were going to attend Rove's conference? I assume they were lined up before the election. If any of them are MP readers, it would be interesting to get their perspective?
I do not hear that as a rhetorical question. If any political scientists want to chime in on this issue, our "Guest Pollster Corner" is open and your comments would very much be welcome. Who knows, could Karl Rove himself be an MP reader?
The adorable little Petri dishes I call children gave me a post election gift of a nasty little flu bug that has been dogging me since late last week. I tried to write up something from home today, but sitting up an typing only raised my fever to 102.3. So I'm calling it a day. See you -- hopefully -- tomorrow.