Articles and Analysis


Transparency and Pollster Ratings: Update

Topics: Clifford Young , Disclosure , Gary Langer , Joel David Bloom , Nate Silver , poll accuracy , Taegan Goddard

[Update: On Friday night, I linked to my column for this week, which appeared earlier than usual. It covers the controversy over Nate Silver's pollster ratings, and an exchange last week between Silver, Political Wire's Taegan Goddard and Research 2000's Del Ali over the transparency in the FiveThirtyEight pollster ratings. In linking to the column I also posted additional details on the polls that Ali claimed Silver had missed and promised more on the subject of transparency that I did not have a chance to include in the column. That discussion follows below.]

Although my column discusses issues of transparency of the database Nate Silver created to rate pollster accuracy, it did not address transparency in regards to the details of the statistical models used to generate the ratings.

When Taegan Goddard challenged the transparency of the ratings, Silver shot back that the transparency is "here in an article that contains 4,807 words and 18 footnotes," and explains "literally every detail of how the pollster ratings are calculated."

Granted, Nate goes into great detail describing how his rating system works, but several pollsters and academics I talked to last week wanted to see more details of the model and the statistical output in order to better evaluate whether the ratings perform as advertised.

For example, Joel David Bloom, a survey researcher at the University at Albany who has done a similar regression analysis of pollster accuracy, said he "would need to see the full regression table" for Silver's initial model that produces the "raw scores," a table that would include the standard error and level of significance for each coefficient (or score). He also says he "would like to see the results of statistical tests showing whether the addition of large blocks of variables (e.g., all the pollster variables, or all the election-specific variables) added significantly to the model's explanatory power."

Similarly, Clifford Young, pollster and senior vice president at IPSOS Public Affairs, said that in order to evaluate Silver's scores, he would "need to see the fit of the model and whether the model violates or respects the underlying assumptions of the model," and more specifically, "what's the equation, what are all the variables, are they significant or aren't they significant."

I should stress that no one quoted above doubts Silver's motives or questions the integrity of his work. They are, however, trying to understand and assess his methods.

I emailed Silver and asked about both estimates of the statistical uncertainty associated with his error scores and about not providing more complete statistical output. On the "margin of error" of the accuracy scores, he wrote:

Estimating the errors on the PIE [pollster-introduced error] terms is not quite as straightforward as it might seem, but the standard errors generally seem to be on the order of +/- .2, so the 95% confidence intervals would be on the order of +/- .4. We can say with a fair amount of confidence that the pollsters at the top dozen or so positions in the chart are skilled, and the bottom dozen or so are unskilled i.e. "bad". Beyond that, I don't think people should be sweating every detail down to the tenth-of-a-point level.

In a future post, I'm hoping to discuss the ratings themselves and whether it is appropriate to interpret differences in the scores as indicative of "skill" (short version: I'm dubious). Today's post, however, is about transparency. Here is what Silver had to say about not providing full statistical output:

Keep in mind that we're a commercial site with a fairly wide audience. I don't know that we're going to be in the habit of publishing our raw regression output. If people really want to pick things apart, I'd be much more inclined to appoint a couple of people to vet or referee the model like a Bob Erikson. I'm sure that there are things that can be improved and we have a history of treating everything that we do as an ongoing work-in-progress. With that said, a lot of the reason that we're able to turn out the volume of academic-quality work that we do is probably because (ironically) we're not in academia, and that allows us to avoid a certain amount of debates over methodological esoterica, in which my view very little value tends to be added.

To be clear, no one I talked to is urging FiveThirtyEight to start regularly publishing raw regression output. Even in this case, I can understand why Silver would not want to clutter up his already lengthy discussion with the output of a model featuring literally hundreds of independent variables. However, a link to an appendix in the form of a PDF file would have added no clutter.

I'm also not sure I understand why this particular scoring system requires a hand-picked referee or vetting committee. We are not talking about issues of national security or executive privilege

That said, the pollster ratings are not the fodder of a typical blog post. Many in the worlds of journalism and polling world are taking these ratings very seriously. They have already played a major role in getting one pollster fired. Soon these ratings will appear under the imprimatur of the New York Times. So with due respect, these ratings deserve a higher degree of transparency than FiveThirtyEight's typical work.

Perhaps Silver sees his models as proprietary and prefers to shield the details from the prying eyes of potential competitors (like, say, us). Such an urge would be understandable but, as Taegan Goddard pointed out last week, also ironic. Silver's scoring system gives bonus accuracy points to pollsters "that have made a public commitment to disclosure and transparency" through membership in the National Council on Public Polls (NCPP) or through commitment to the Transparency Initiative launched this month by the American Association for Public Opinion Research (AAPOR), because he says, his data shows that those firms produce more accurate results.

The irony is that Silver's reluctance to share details of his models may stem from some of the same instincts that have made many pollsters, including AAPOR members, reluctant to disclose more about their methods or even the support the Transparency Initiative itself. Those instincts are what AAPOR's leadership is hoping to use their Initiative to change.

Last month, AAPOR's annual conference included a plenary session that discussed the Initiative (I was one of six speakers on the panel). The very last audience comment came from a pollster who said he conducts surveys for a small midwestern newspaper. "I do not see what the issue is," he said, referring to the reluctance of his colleagues to disclose more about their work "other than the mere fact that maybe we're just so afraid that our work will be scrutinized." He recalled an episode where he had been ready to disclose methodological data to someone who had emailed with a request but was stopped by the newspaper's editors who were fearful "that somebody would find something to be critical of and embarrass the newspaper."

Gary Langer, the director of polling at ABC News, replied to the comment. His response is a good place to conclude this post:

You're either going to be criticized for your disclosure or you're going to be criticized for not disclosing, so you might as well be on the right side of it and be criticized for disclosure. Our work, if we do it with integrity and care, will and can stand the light of day, and we speak well of ourselves, of our own work and of our own efforts by undertaking the disclosure we are discussing tonight.



"Keep in mind that we're a commercial site with a fairly wide audience. I don't know that we're going to be in the habit of publishing our raw regression output...A lot of the reason that we're able to turn out the volume of academic-quality work that we do is probably because (ironically) we're not in academia, and that allows us to avoid a certain amount of debates over methodological esoterica, in which my view very little value tends to be added."

so all of a sudden, nate silver is just an entertainer with little interest in "methodological esoterica". i guess all those graphs and numbers and 4000+ word methodological statements with 18 footnotes are just a scientific kabuki for the statistically illiterate.

"Estimating the errors on the PIE [pollster-introduced error] terms is not quite as straightforward as it might seem..."

this sounds like silver never actually estimated the error. hey, why bother if you can publish estimates alone and you will still be hailed as a statistical genius.



So Silver actually did estimate SEs for his PIE values, but then _chose not to mention them_ when he released his rankings?

All I can say to that is: ?

Silver does some great stuff and I'll continue to read and enjoy his work, but man, that's kind of absurd.



"So Silver actually did estimate SEs for his PIE values, but then _chose not to mention them_ when he released his rankings?"

i don't think he did. "standard errors generally seem to be on the order of +/- .2" probably refers to errors reported in the regression output for rawscore. personally, i would expect some scores to have a much larger error, so i would still prefer to see actual errors. the argument that adding one column to the pollsters' rankings table is all of sudden "methodological esoterica" that is going to scare off his readers is not very convincing.

on the other hand, PIE is a function of several random variables, some of which are nested, so he needs a bootstrap to get SEs and confidence intervals, and a non-trivial one at that. honestly, i don't think he has done it.

which is just one more reason the rankings should have never been based on PIE to begin with. according to silver, the main advantage of PIE is that it is (supposedly) better for election predictions. but pollsters' ratings are for the most part a separate project, for which that argument is irrelevant.



My biggest issue with Silver is that he made up a formula to apply a huge weighting factor for a pollster that is a member of NCAA or AAPOR but has yet to give any evidence of his claims that those pollsters are somehow more accurate because of that membership. In fact, his raw scores show just the opposite. Some of those in his "top dozen" have horrible results based on raw scores.

He seems to have published this list of rankings and is angry that anyone dare question his methods for arriving at his conclusions.

Without showing his formulas and methods, we can only assume that what he published is based on his own opinions and conjecture.



@Cata: I agree, that makes more sense. That's partly why I was surprised. Calculating errors for the the PIE measure seemed like it could be horribly complex. Even if he didn't have SE values to produce, the fact that he didn't provide _any_ context, even as basic as what he emailed to Mark above, I find somewhat galling.

@hoosier_gary: My general impression from reading Silver and other polling aggregators/researchers is that he tends to produce models that are considerably more complex than others, but rarely provides concrete evidence that his added complexity actually contributes significantly to his models. Quite frankly, that's perfectly fine if you're just messing around on a blog, but I think the concern is that Silver's notoriety has progressed (deservedly or not) to a point that his models deserve a more "academic" level of criticism.

What I'd like to see:

(a) Some measure of error of the rawscore/PIE values, even if it's just his best guess.
(b) Some measure of overall model fit, preferably based on predictive ability
(c) Breakdown of which parts of his model contribute the most to its predictive power


N P:

On a more elementary level, Silver’s calculations include all polls by pollsters over 21 days leading up to election day. His justification is an example of a pollster who fudged his final poll numbers as an example. This is very rare.

Error is based on a pollsters’ multiple poll results compared with the election result with earlier polls given less weight. Deviations from the final result are due to a far more important element- campaign effectiveness - especially during the three week run-up to election.

Nick Panagakis



@hoosier_gary - It's not really some sort of arbitrary formula... If I'm interpreting his description right, I think of it as like he treated all the AAPOR/NCAA pollsters as if they were a giant mega-pollster, and then scored individual pollsters as variations from that group, scaling their individual scores down by how many polls they did. (And the same for the other categories.) And he accounts for possible big errors on the rawscore values (like if there are not enough polls) by regressing toward that group value to come up with PIE. That seems generically valid when you don't have that much individual pollster data to work with, and when you think the groups should have similar skill (and you can certainly make a case that willingness to be transparent is a proxy for some facets of skill, like having the resources to be paying attention or the such).

I don't know where you're getting that "top dozen have horrible results" bit from, all those top 12 raw scores seem average to better-than-average to me (i.e. near zero or negative).

But I do think he might need to beware of shakier pollsters trying to join AAPOR just to ride on the coattails - you start to need to worry about what causes what as soon as the pollsters start to pay attention to what helps improve the ratings. If the causation goes "better pollster -> support transparency", then pollsters just joining to hop onto the bandwagon might break that part of the model.


I'd too like to see a some more evaluations and support though - evaluate quantitatively how simpler models are enhanced by adding this or that feature to the model. I would hope he doesn't keep taking it as just academic jostling for attention, and instead take it as an opportunity to really show off how his model comes together. It can increase the confidence people have that everything he's put together, and also can help increase the elegance of the model by removing any pieces that end up extraneous.



The more I think about this the more irritated I get. Probably means I should stop thinking about it...

My issue remains the fact that error estimation and model validation are the pieces of _any_ modeling exercise that actually tell you how to use and interpret a model.

Silver has provided, in essence, just the raw output, with no context. If his email to Mark reflects what he actually thinks, it would seem to imply that his model is really only useful for classifying pollsters into three groups: Probably good, probably bad and everyone else.

If that's the case, and Silver actually believes this about his own model, why report a list of raw values to two decimal places with no explanation? Isn't part of the point of his site to explain and help people use his statistical analyses?

The question of whether his model is "right" or "wrong" is, to me, somewhat beside the point. My problem is that if you take your job as a data analyst seriously at all, you explain things in a manner such that most people can readily make informed decisions themselves on these questions.

As presented, Silver's rankings confuse more than they explain, and as such strike me as a step backwards in polling data analysis.


Post a comment

Please be patient while your comment posts - sometimes it takes a minute or two. To check your comment, please wait 60 seconds and click your browser's refresh button. Note that comments with three or more hyperlinks will be held for approval.