Articles and Analysis


Regression Analysis of the Democratic Race

Topics: Barack Obama , Exit Polls , Hillary Clinton , Jay Cost , NY Times

Over the last few days, a number of political scientist bloggers have turned their statistical firepower on the Democratic presidential race, producing some analyses that are both tantalizing in their implications and confusing for those unfamiliar with multiple regression analysis. The most interesting posts come from Brendan Nyhan, Jay Cost and DailyKos diarist Poblano. Other than pointing you to their efforts, here are a few thoughts.

In many ways, the Democratic contest is the perfect problem for multiple regression analysis. Many different important variables appear to be strongly related candidate support: race and ethnicity, gender, age, socio-economic status and whether voters participate in a primary or caucus (to name just the most obvious). We are really interested in understanding the independent effects of each of these factors. You can see crude efforts along these lines in the exit poll tabulations: How does vote preference vary by gender or age, for example, once we control for race? The promise of multiple regression is the ability to estimate the independent effects for a large number of different variables on vote choice while controlling for all of them simultaneously.

Another tempting feature of multiple regression analysis -- at least in theory -- is the ability to take a model that does a good job predicting the Obama-Clinton vote looking backwards, plug values for the upcoming contests for each of the variables into the model (race, gender, age, etc) and attempt to predict the outcomes. The lure of predicting "what might happen at the end of an active campaign" (as Poblano put it), is what led Bill Kristol to cite Poblano in his New York Times column. Obviously, if it were possible, we would all like to use hard data to anticipate what might happen in Ohio, Texas or Pennsylvania.

At the same time, the efforts by the aforementioned bloggers also demonstrate just how complex and challenging multiple regression analysis can be when applied to real world problems using real world data. Here are three reasons to be cautious about interpreting the models linked to above:

1) The data are imperfect. As Jay Cost explains, we have a choice between two kinds of data. "Micro-level" exit poll data and "macro-level" data from statewide results. Exit polls collect data on the vote preferences and characteristics of individual voters. That level of data is idea, since we want to understand how individuals vote (not states or counties). Unfortunately, for now, only the subgroup exit poll tabulations are available and not for all states. The networks have not conducted "entrance polls" for most of the smaller caucus states.

Data is plentiful at the aggregate level (mostly states) but far less precise. One problem is that Census data (on race, age, religion or socio-economic status) is based on the total population rather than those who participated in the Democratic primaries or caucuses. We also have a relatively small number of states to consider, and we have to deal with the statistical problem that populations sizes vary considerably from state to state.

2) The models are poor predictors of the future. The limitations of the data are one reason why these sorts of regression models make for poor predictors of future outcomes. Consider the predictive accuracy of Poblano's model. He says it explained 95% of the variation in 26 states that voted through February 5 and reports estimates that predicted Obama's actual share of the vote within these states "within an average of two points." However, as TNR's Josh Patashnik points out, the model overestimated Obama's support in Louisiana (+11 points) and Nebraska (+8) and understated it in Washington (-14) and Maine (-7). The reason is something statisticians call "overfitting" "overestimation". The number of variables in Poblano's model (9) was large relative to the number of cases involved (26 states). So the "fit" of Poblano's model to the past data is deceiving because it is, in essence, too good. The 95% of variance explained is unique to those 26 states and thus does not generalize to predict the results in other states with anywhere near as much precision.

Reducing the number of variables does not solve the problem, it just makes the "fit" of the model to the existing data less predictive (though more realistic). Jay Cost explains why his own model is a decent vehicle for explaining the existing data but a poor predictor of future outcomes:

The model's predictive power (69%) is very high from a certain perspective. From another perspective, though, its accuracy is not great enough to [allow for] "publishable" predictions - not when candidates are often separated by tiny margins.

3) Demography is not always destiny. Or to put it another way, campaigns matter. At least that is the underlying assumption behind all the personal campaigning, field organizing and paid advertising that both campaigns are doing. The one thing these models lack is a better measurement of the influence of the various means of campaigning. Once again, a lack of decent data is the primary culprit. For example, we do not yet have FEC reports providing decent breakdowns of how much the candidates spent in each state. Also, the University of Wisconsin's Advertising Project will ultimately have breakdowns of what each candidate spent on television advertising in each media market, but those data are not yet in the public domain. The impact of campaigning so far is important. Will it matter, for example, that the campaigns will now slow down enough so that the candidates can devote significantly more time and paid advertising to states like Ohio, Texas and Pennsylvania than they did the Super Tuesday states?

Jay Cost's model does include "number of candidate visits" as a variable meant to "measure campaign effects per state." He reports that:

Clinton does better as the number of candidate visits increases. This was a bit of a surprise, but it is good news for her. Campaign effects seem to incline the electorate to her.

This finding is intriguing, but I wonder how the results might differ if Cost had used separate variables for the visits of each candidate rather than just the total number of visits for all candidates.

Mechanical issues of this sort help illustrate one of the practical limitations of regression modeling. It is a very powerful tool, but it is also sensitive to decisions the analyst makes about what data to use and what variables to include. We will no doubt see more attempts to model the primary campaign in the future. Do not be surprised if reasonable people disagree about what data is most appropriate, what model best "fits" the data and about which conclusions are best supported.

Update: Just want to underline a point that may have been unclear. Neither Brendan Nyhan nor Jay Cost used their regression models to try to predict future outcomes.



great piece. Very helpful. Would like to see how the model projects the March 4 primaries


In terms of number of candidate visits to a state, we should be careful in not assigning causality here. A candidate may visit a state in which they are already doing well in, so people shouldn't treat this as causal (actually, none of the factors should be treated as causal).


Also, here's a link where I try to see if Missouri can help predict the outcome in Ohio.



This is very interesting. Brian Schaffner has been using statistical models to try to predict how superdelgates might vote...


Would this be more or less prone to error?



Very interesting post! I don't think regression analysis will be an accurate predictor of future elections, but it is interesting to go back once an election is over.

Did anyone use a "momentum" variable representing the number of states won in the previous X number of primaries? I always wondered how much "momentum" matters.

Regarding number of candidate visits, I don't know how useful that result is. Excluding the pre Feb. 5 states, I assume the states with the greatest number of candidate visits were also the biggest states. Therefore, although CA may have had a lot of candidate visits, the average voter had less of a chance to meet a candidate than one or two visits in a small state. If number of visits is designed to measure whether contact with the candidate helps, then a better variable might be number of visits by each candidate divided each state's population.



Setting aside for a moment the predictive value of regression analysis, I'm happy to see poblano's attempt to tease out the effects of education and income from each other. Up to now have been conflated. According to poblana, though, those with higher education plus low income lean to Obama (including white women and Latinos), which means he performs better among low-income workers than previously thought, even among Latinos. Of course, this may be a case of "milady doth protest too much" but still it makes one think.



I don't see how these can be of much value without decent micro data. Seems to me with so few data points the macro models are going to be pretty flimsy once you add more than a couple of right hand side variables.


Steve Hendricks:

Just a comment to reinforce Mark's comment about the imperfect nature of the available data. Trying to analyze individual behavior from aggregate results runs smackdab into what's called the "ecological fallacy."

For example, back in the 1960's George Wallace drew his greatest support in Alabama's "black belt" counties. Although the area was named for its soil, not its ethnicity, it so happened that the counties also included the largest concentration of African Americans in the state.

Despite that correlation, one cannot reasonably conclude that Wallace received more than a miniscule number of African American votes.

Regression analyses based on state level results (or even county-level results) suffer from the same problem. Although aggregate results can set the min/max parameters for individual voting choices, except in almost unanimous splits one way or the other, the range of possible results is so wide as to be virtually useless.

Unfortunately, the data needed to perform truly useful analyses exist in the form of the exit polls from the states. If the data were avaiable to independent researchers, the many questions about the stable and shifting patterns of support for Clinton and Obama could be addressed.



First, thanks again to Mark and the commentators above for a very illuminating discussion.

Second, I just want to briefly suggest one beneficial purpose these efforts are serving, despite their inherent limitations. Namely, I think they are helping to undermine some of the more strongly worded demographic claims I have seen many people (including many in the media) asserting on the basis of the results so far. Of course for the reasons discussed above, the available data doesn't really allow for these claims to be entirely disproven. But it seems to me that merely helping people see that many of these superficially plausible claims are not in fact well-supported by the available data is a worthy result.



Do any of the blogs provide a table of coefficients and standard errors for their models?


Mark Blumenthal:

DTM: No disagreement with your second point.

ADG: Not as far as I know, but if you follow the links above, you will know as much as I do.


Post a comment

Please be patient while your comment posts - sometimes it takes a minute or two. To check your comment, please wait 60 seconds and click your browser's refresh button. Note that comments with three or more hyperlinks will be held for approval.