Articles and Analysis


Rivers: Random Samples and Research 2000

Topics: Daily Kos , Nate Silver , Research2000 , Sampling

Douglas Rivers is president and CEO of YouGov/Polimetrix and a professor of political science and senior fellow at Stanford University's Hoover Institution. Full disclosure: YouGov/Polimetrix is the owner and principal sponsor of Pollster.com.

I am, like most in the polling community, shocked by the recent accusations of fraud against Research 2000. Marc Grebner, Michael Weissman, and Jonathan Weissman convincingly demonstrate that something is seriously amiss with the research reported by Research 2000, which may well be due to fraud.

But some of the claims by the critics, such as Nate Silver's post this morning on FiveThirtyEight.com (as well as part of the Grebner et al. analysis), exhibit a common misunderstanding about survey sampling: "random sampling" does not necessarily mean "simple random sampling." I do not know what Research 2000 did (or claimed to do), but very few surveys actually use simple random sampling.

To recapitulate Nate's argument: if you draw a simple random sample of size 360 from a population of 50% Obama voters and 50% McCain voters, the day to day variation in the Obama vote percentage in the sample should be approximately normal, with mean 50% and standard deviation 2.7%. (Nate gets this by simulating 30,000 polls and rounding the results, but most students in introductory statistics would just calculate the square root of 0.5 x 0.5 / 360, which is about 2.6%.) This would give you the blue line in Nate's first graph, reproduced below.


However, what happens if the poll is not a simple random sample? Suppose (and this is entirely hypothetical) that you polled off of a registration list composed of 50% Democrats and 50% Republicans (to keep things simple, let's pretend there are no independents). Further, suppose that 90% of the Democrats support Obama and 90% of the Republicans support McCain, so it's still 50/50 for Obama and McCain in the population. Instead of drawing a simple random sample, we draw a "stratified random sample" with 180 Democrats and 180 Republicans each day. That is, we draw a simple random sample of 180 Democrats and a simple random sample of 180 Republicans and combine them. What should the distribution of daily poll results look like?

I should caution that there is a little math in what follows, but nothing hard. The variance (the square of the standard deviation) of each subsample is 0.90 x 0.10 / 180 = 0.0005. The combined sample mean is just the average of these two independent subsamples, so its variance is 0.0005/2 or 0.00025, so the standard deviation is the square root of 0.00025 or approximately 1.6%, not the 2.6% that Nate thought it should be. This distribution is shown in the figure below as a green lines, which is a lot closer to the suspicious red line in Nate's graph, showing the Research 2000 results.


Does this absolve Research 2000 of fraud? Of course not. There are other factors (such as weighting) that usually increase the variability, so Nate is right that the Research 2000 results look suspicious. But we should be a little more cautious before convicting upon the basis of this sort of evidence.


Matthias Winkelmann:

Good analysis, Nate needs some competent critics. I'm really wondering why always goes for the simulation even when simple calculations would give analytical results.



I agree. Some serious reservations before I will climb on the FRAUD train.

Nate et al may be over-reaching a bit on the claims of inherent intentional fraudulent manipulation of the 'data' as this exercise points out.

A cautionary tale...


Michael Weissman:

Doug- Read our (GWW) article again. The underdispersion analysis was of the difference between two PARTY ID categories (IND and OTH) which could not, therefore have been stabilized by party ID. The topline analysis, which could have been, was mentioned only in passing precisely because we didn't want to explain this issue. We did, however, do the stabilization calculation using sample crosstabs for the top-line analysis.

Other crosstabs could give a very minor stabilization, quite irrelevant to the extremely dramatic underdispersion.

You owe us a closer reading, and an Emily Litella moment.


Douglas Rivers:

Michael Weissman:

You use the V(phat) = p(1-p)/n formula repeatedly. This assumes a SRS. The chi-square tail probability is based on this variance calculation. I reread your article again and did not find any mention that this variance calculation is incorrect (since it ignores any stratification, unequal probabilities of selection, or weighting), so I'm not sure what I missed.

As I indicated in my posting, however, I find your overall analysis convincing and a terrific piece of detective work.



It seems Nate is a little quick to jump to conclusions (he claims he only spent a few hours on his pollster rankings) and he builds incredibly complex models for things that are pretty simple.

I have to wonder why so many people are delaring him the prophet of poll analysis. Does he have any actual training or is he just someone who used to write articles for Daily Kos?

In my line of business, there is a saying, "If you can't dazzle them with graphics then bury them with bulls**t." These attacks from Nate seem to be both.

Research 2000 might have been completely making their numbers up. I don't know. But I am more likely to trust just about anyone before I trust Daily Kos.


Michael Weissman:

Doug- Which of the other demographic subcategories could possibly be used to stratify the little subcategories of party ID = IND or OTH in any significant way? Remember, the variance was low by a factor of, IIRC, 8. The week-to-week jitter was low by more than a factor of 100. We were writing for a lay readership.

When it counted, in the top-line analysis, we did look at party ID effects. E.g. on May 13, my notes say that without party ID, averaged over the 14 top-lines was 0.834 of the simple 50-50 value. With party ID, it dropped to 0.570. Those numbers have not been checked enough to formally publish. You can check yourself.

Again- you seriously think we should have been crudding up the blog with that sort of analysis, when we were talking about IND-OTH missing enormous factors in ?



Gary, I'm unclear as to why you would write something that is demonstrably false:

"It seems Nate is a little quick to jump to conclusions (he claims he only spent a few hours on his pollster rankings)..."

Actually, in the blog post about pollster rankings from June 20th, he states, "I'd guess that, from start to finish, the pollster ratings required something like 100 hours of work." That equates to a full-time employee working on one project for two-and-a-half straight weeks.

You also asked, "Does he have any actual training or is he just someone who used to write articles for Daily Kos?" In the time you took to write that sentence, you could have found his wikipedia entry stating that he studied at the University of Chicago and the London School of Economics.

Then you went to, "If you can't dazzle them with graphics then bury them with bulls**t." Were you being ironic?


Matt Sheldon:

This analysis is dead on.

This is an example of how overcomplicating simple analysis can produce misleading analysis.

Nate Silver has a history of making unwarranted assumptions regarding the behavior of numbers. He made the same unwarranted assumption in his original analysis of Strategic Vision polling data.

Nate assumed that the trailing digits on topline polling results should be random and normally distributed. His whole analysis rested on this assumption.

The assumption was demonstrably false. Given methodological differences in classifying non-responders and leaners, it should be quite expected to see non-random patterns in trailing digits.

I linked to the following dataset:
Bush Approval Rating Polls
This dataset clearly shows that many pollsters have non-random patterns in the trailing digits, even given the fact that Bush's approval ratings varied from 92% to 19% with a mean of around 50%. His ratings were normally distributed over time.

Yet, reputable pollsters like Gallup and ARG have tremendously skewed patterns in their trailing digits. In each case, the sample is large enough to show the bias is systematic for very legitimate methodological reasons.

Nate's assumption of randomness where it should not necessarily be is a huge oversight for someone who considers themself to be a professional researcher.

He has replicated the same error in this case.

Likewise, he only compared Strategic Vision to one other pollster (Quinnipiac), despite repeated requests that he repeat the analysis for other pollsters.

Nate never did that, nor did he provide his underlying data for scrutiny.

Note: Strategic Vision contributed to the appearance of guilt by not responding forcefully to Nate's claims with the underlying detail required.

Nate's assertion of fraud may very well have been correct, but his mathematical analysis was not evidence of it.

His methods produce far too many false positives to be credible.

***One quibble with the commentary...
Weighting the data should not increase the variability of the results, it should reduce the variability.

We are removing sources of random variation by weighting according to Party ID, Demographics, etc.

Essentially weighting the sample by Party ID is statistically identical the 50/50 a priori sampling plan outlined in the above analysis.

Now, consider the possibility of weighting by multiple dimensions(such as Age x Gender x PartyID)...this would reduce variability even further.


Matt Sheldon:

@jamesautomatic -

While I am not one to base credibility on academic credentials, Nate's experience in statistics is not overwhelming.

The more troubling aspect of this is that Nate is a frequent commentator on polling and pollster quality, yet he has never actually conducted polling research.

There is no evidence that he has designed it either.

Given the fact that Nate seemed to entirely look past the possibility of weighting or stratified sampling as an explanatory factor here indicates that he actually understands very little about how actual polling is done.

Given the stature bestowed on him by his adoring fans, this is a tremendously amateurish oversight.

That Nate could be referred to as a "Prophet of Polling" yet be so inexperienced in the design and execution of polls is shocking.

That is the troubling aspect of this.


Douglas Rivers:

Matt Sheldon:

While in principle weighting can either increase or decrease the variance of estimates, in practice it tends to increase it. The reason for this is that most of the weighting occurs because of stable nonresponse patterns (phone samples tend to underrepresent men, young people, minorities, etc., but the day-to-day variation in sample demographics is small). If the weights are independent of the response variable, then variation in the weights necessarily increases the variance of the estimates. But it can go either way (Rod Little had an interesting article on this a few years ago, though I don't think his argument has much relevance to phone polls.)

Michael Weissman:

I think you're missing my point. I think both Nate and you have identified real anomalies in the Research 2000 data. But you both casually assume simple random sampling without ever mentioning it. In Nate's case, the whole analysis was premised on what the variance should be with simple random sampling--which they're not. Usually there are corrections for gender, race, education, household size, number of phone lines, region, etc. Real world sampling does not look anything like the version in Statistics 1.



@Matt Sheldon:

Weighting should decrease topline variability, but should also increase crosstab variability.

I think Nate made a mistake comparing Gallup's variability in presidential tracking to R2k. Gallup's response based likely voter model and unweighted party affliation is going to produce much greater day to day swing than polling done with a set likely voter model (or just registered voters, I'm not sure which R2K used) and weighting by party.

Overall there are three big red flags for me:

1) Refusal to turn over the raw data - If I client I polled for ever asked for the raw data it would be no problem for me to turn it over.

2) The cleanliness of the crosstabs - Anyone who has seen lots of crosstabs knows they often produce odd and even contradictory results due to their small sample sizes.

3) The lack of movement in day to day tracking - Similar to crosstabs, single day results can produce head scratching results due to the sample size (even with heavy weighting) especially over a diverse population.


Michael Weissman:

Doug- After running up to work a while and returning with fresh eyes, two points:

First: Whoops E(var) is missing at multiple spots in my last note because i used pointy brackets.

Second: Oy, do we sound like a couple of academics! We agree on
1. the possibility of subcategory stabilization reducing Var.
2. and (I think you agree) that by far the most dramatic instance of that for R2K, per their cross-tabs, is party ID.
3. and Party ID is useless for stabilization within Party ID categories.
4. Nonetheless, in principle other stabilization, e.g via gender, could have a very subtle effect on E(Var) within party ID categories.
5. No such effect is remotely close to removing the huge anomaly for the R2K IND-OTH.
6. In an academic paper, such issues ought to be mentioned.
So we're doing the academic fight thing over the question, I guess, of whether we should have mentioned that in the blog? Truce?


Douglas Rivers:

Michael Weissman:

Yes. Yes. Yes. Yes. Yes. Yes. Yes. I'm closing my browser now.


Michael Weissman:

For anyone disappointed with that non-dramatic conclusion, here's more entertainment:

the R2K tracking data set, in convenient format.


You can do your own calculations instead of listening to our boring agreement.

Thanks to Jonathan for doing this and Markos for giving permission.



Doug and Michael,
As a student of statistics, I am truly enjoying your "academic quibbling." Please, don't stop. It's very helpful to see qualified, thorough, and most of all, clear and specific analysis.



Since we know next to nothing about how Research 2000 does its sampling, and there is reason to wonder about their case management (based on Mark Blumenthal's link to strange final dispositions), I think this is another reason why the discussion above may have been "academic."

What if Research 2000 is using a rotating panel sample like the UMichigan Consumer Sentiment Survey? (This has been proposed by a commenter at 538.com.) That sample uses respondents in more than one wave, and then rotates them out. There may be a lag, i.e., the respondents aren't used in successive waves but perhaps they are reinterviewed a couple of rounds later? (This is a monthly survey.)

In this way the sample for each round consists of people who were interviewed before (about 40% of the R's) and fresh RDD respondents (about 60%). Because of this, there are two sets of weights -- one for each segment of the sample. And then the two segments have to be combined to estimate the sampling error for the various statistics for the total N of respondents. (There's a discusion of this on the UMich website.)

This is a "cheap" way to build a sample because previous R's have been located, agreed to be recontacted, and so forth. But there's probably something of a selection effect (something that can't simply be eliminated by weighting). I would guess that they also often remember their previous answers, and if there is some kind of strain toward consistency or some motivation behind their willingness to be interviewed again the "reinterviewed R's" are likely to respond to the current questions differently from the fresh RDD group (even adjusting for SES, age, gender, party, etc.).

I have no idea whether Research 2000 uses such an approach as a way to reduce costs, but if they do it could be a factor in the stability of responses over time. Not that this would account at all for the even-number miracle. But it could be relevant to the tests that both Silver and Rivers have done.


Michael Weissman:

We also do not know if R2K does that. What we do know (via Markos) is that, when Markos asked, Del Ali specifically and emphatically denied that he did anything like that.


Matt Sheldon:

Doug -

When I said that weights reduce variability, I was not talking about the reducing the standard error measurements. Weighting can lead to a larger confidence interval among undersampled subpopulations. Rather, I was speaking of the week-to-week stability in the topline results of the poll.

It is hard for me to imagine a situation where weighting increases the volatility of the topline results from week-to-week. The weighting procedure effectively reduces a random sample to a balanced stratified sample. Ceteris Paribus this will eliminate sources of random variation due to sampling. If it reduces random noise, it should reduce variability unless the weights are badly designed and implemented.

To your point, in situations where you have extremely high weights (2-3x) for a small subpopulation (i.e. Asian-Americans in the South) this could magnify results so much that the ability of few respondents to drive big changes outweighs the reduction in random sampling issues.

In practice, having such large weights for any subpopulation would be counterproductive.

However, the kinds of weights we are considering (Party ID, Gender, Age) are so broad, and the resulting weights so close to 1 (.5-1.5x) that this scenario is unrealistic.

In my experience with executing brand tracking research (which is lengthy) our unweighted results were full of noise that the weighted results did not reflect.


Matt Sheldon:

Doug -

Please comment on the following possibility regarding the gender even/odd pattern:

This may explain the matching even/odd patterns on gender...

Below is a commonly used rounding algorithm in java, php, and other object oriented languages:

public static final int ROUND_HALF_EVEN
Rounding mode to round towards the "nearest neighbor" unless both neighbors are equidistant, in which case, round towards the even neighbor. Behaves as for ROUND_HALF_UP if the digit to the left of the discarded fraction is odd; behaves as for ROUND_HALF_DOWN if it's even. Note that this is the rounding mode that minimizes cumulative error when applied repeatedly over a sequence of calculations.

Now, if you look at the R2K crosstabs, you will see that every row totals to 100% and is rounded. This is certainly intentional and totally legitimate.

What this implies is the existence of a rounding algorithm which must decide how to round the numbers in a way that totals 100%, yet minimizes the total induced rounding error.

In other words, I do NOT want a crosstab like this:

M 34.4
F 32.5
GAP 1.9%

to look like...

M 34
F 33
GAP 1.0%

I would rather it be...

M 34
F 32

The gap of 2.0% is closer to the actual gap of 1.8%.

This implies that the numbers would be rounded as a group rather than individually.

If the above coding is part of that algorithm, is it not plausible that this pattern is simply an artifact of that artificial rounding method?

Perhaps the coding breaks down only in binomial categories (male vs. female)?

There was no observed pattern like this in ANY of the other crosstabs which had 3 and 4 categories.

I think it is vastly more plausible that this was sloppy rounding code that works for multinomial crosstabs, but breaks down for binomial crosstabs.

Sloppy rounding is nothing remotely close to fraud.

It is not clear that either the authors of the analysis, or Nate Silver, ever entertained this possibility.

It is further proof that they really have no experience in how this research gets produced and the ins-and-outs of crosstab generation.


This module, as part of a comprehensive rounding algorithm, could produce the observed patterns on gender.


Matt Sheldon:

Chris@PTS -

I agree with all of your points.

With regards to the failure to turn over data, this is troubling, but has several mitigating factors...

1. You have just been fired in a very public way by said client
2. Said client is trashing you in the media
3. Said client has not paid recent invoices (as R2K claims)
4. Said client is investigating you and hinting at lawsuits

Given that, I can see a lack of full cooperation and not assume guilt.

Markos is the guy who lawyered up first.


Matt Sheldon:

From a comment on 538.com....


A regression on the individual gender approvals and overall approval rating suggests that R2K weights it's sample to be 50% female / 50% male.

Regression Statistics
Multiple R 0.999977223
R Square 0.999954447
Adjusted R Square 0.999954334
Standard Error 0.095356729
Observations 813

Intercept -0.004174906
MEN 0.499635141
WOMEN 0.500651122

t Stat
Intercept -0.38877702
MEN 1160.7836
WOMEN 2143.945047


Given that the weighting is 50/50, this will introduce some new properties on the numbers.

If I say that the average of M/F approval MUST equal the overall rounded approval EXACTLY, then the emerging pattern MUST be that both are even or that both are odd.


Overall Approval = 51% Both MUST be the same.

52/50 or 53/49 or 54/48, etc.

Overall Approval = 52% Again, both MUST be the same.

53/51 or 54/50 or 55/49, etc.

The average of M/F should EXACTLY equal the overall approval.

They do. This is a property of the 50/50 weighting.

IF, this is the answer to the riddle, then there is lots of egg on everyone's face.



@ Matt Sheldon. The ratio of females to males is closer to 55/45, so there would be no reason to weight them 50/50.


Matt Sheldon:


I am not arguing that a 50/50 weight is what is appropriate for accuracy, although your are wrong on the 55/45 split.

The 15-64 population is exactly 50/50.

The total population is 49.2% Male and 50.8% Female. Go check for yourself.

Rather, I am simply reporting what a regression on the crosstabs suggests.

It suggests that he used a 50/50 weight.

Doing this would then make his rounding procedure totally legitimate.

If the topline number is 55% then the male and female % must average to 55%.

By this logic, it makes total sense that a QA check would enforce that.

The result is the even/odd pattern.



The problem I have is that it would not be legitimate to weight 50/50 like that if the relevant population (all RV) is not 50/50. My understanding was that RV is 55/45, although I just learned that I was incorrect: census.gov indicates that it is approximately 52/48. My two mistakes (referring to all population and simply being wrong as to the percentage breakdown) notwithstanding, the point is the same, though -- because the population is not 50/50, it would not be legitimate for him to round to 50/50, which is what you indicate it seems that he did.

It is obviously possible that there is some sound explanation for what he was doing (i.e., not fraud or simply incredibly shoddy polling), but possible is not the same as likely. In my opinion, all of this (by all of this, I mean everything that Blumenthal, et al. have discussed in various posts) looks very bad for him.


Post a comment

Please be patient while your comment posts - sometimes it takes a minute or two. To check your comment, please wait 60 seconds and click your browser's refresh button. Note that comments with three or more hyperlinks will be held for approval.