Guest Pollster | November 2, 2006
Topics: 2006 , The 2006 Race
Today's Guest Pollster Corner contribution comes from Jacob Eisenstein. While not technically a pollster -- Eisenstein is a PhD candidate in computer science at MIT -- he recently posted an intriguing U.S Senate projection (and some familiar looking charts) based on a statistical technique applied called "Kalman filtering" that he applied to the Senate polls. He explains the technique and its benefits in the post below.
Polls are inexact measurements, and they become irrelevant quickly as events overtake them. But the good news about polls is that we're always getting new ones. Because polls are inexact, we can't just throw out all our old polling data and accept the latest poll results. Instead, we check to see how well our poll coheres with what we already believe; if a poll result is too surprising, we take it with a grain of salt, and reserve judgment until more data is available.
This can be difficult for the casual political observer. Fortunately, there are statistical techniques that allow this type of "intuitive" analysis to be quantified. One specific technique, the Kalman Filter, gives the best possible estimate of the true state of an election, based on all prior polling data. It does this by weighing recent polls more heavily than old ones, and by subtracting out polling biases. In addition, the Kalman Filter gives a more realistic margin-of-error that reflects not only the sample sizes of the polls, but also how recent those polls are, and how many different polling results are available.
The Kalman Filter assumes that there are two sources of randomness in polling: the true level of support for a candidate, which changes on a day-to-day basis by some unknown amount; and the error in polling, which is also unknown. If the true level of support for a candidate never changed, we could just average together all available polls. If the polls never had errors, we could simply take the most recent poll and throw out the rest. But in real life, both sources of randomness must be accounted for. The Kalman Filter provides a way to do this.
Pollsters are happy to tell you about margin-of-error, which is a measure of the variance of a poll; this reflects the fact that you can't poll everybody, so your sample might be too small. What pollsters don't like to talk about is the other source of error: bias. Bias occurs when a polling sample is not representative of the population as a whole. For example, maybe Republicans just aren't home when the pollsters like to call -- then that poll contains bias error that will favor the Democratic candidates.
We can detect bias when a poll is different from other polls in a consistent way. After repeated runs of the hypothetical biased poll that I just described, careful observers will notice that it rates Democratic candidates more highly than other polls do, and they'll take this into account when considering new results from this poll. My model considers bias as a third source of randomness; it models the bias of each pollster, and subtracts it out when considering their poll results.
The Kalman Filter can be mathematically proven to be the optimal way to combine noisy data, but only under a set of assumptions that are rarely true (these assumptions are listed at my own site). However, the Kalman Filter is used in many engineering applications in the physical world -- for example, the inertial guidance of rockets -- and is generally robust to violations of these assumptions. In the specific case of politics, I think the biggest weakness of this method is the elections are fundamentally different from polls, and my model does not account for the difference between who gets polled and who actually shows up to vote. I think this can be accounted for, but only by looking at the results of past elections.