Articles and Analysis


Doug Rivers: Second Thoughts About Internet Surveys

Topics: Douglas Rivers , Gary Langer , Internet Polls , Jon Krosnick , Probability samples , Sampling , Weighting

Douglas Rivers is president and CEO of YouGov/Polimetrix and a professor of political science and senior fellow at Stanford University's Hoover Institution. Full disclosure: YouGov/Polimetrix is the owner and principal sponsor of Pollster.com.

I woke up on Tuesday morning to find several emails pointing me to Gary Langer's blog posting, which quoted extensively from a supposedly new paper by Jon Krosnick. These data and results appeared previously in a paper, "Web Survey Methodologies: A Comparison of Survey Accuracy," Krosnick coauthored with me and presented at AAPOR in 2005. The "new" paper has added some standard error calculations, some late arriving data, and a new set of weights, but the biggest changes in this version are a different list of authors and conclusions.

The 2005 study compared estimates from identical questionnaires fielded to a random digit dial (RDD) sample by telephone, an Internet-based probability sample, and a set of opt-in panels. Of these, Internet probability sample had the smallest average absolute error, followed closely by the RDD telephone survey, and the opt-in Internet panels were around 2% worse. In his presentation of our paper at AAPOR in 2005, Krosnick described the results of all the surveys, both probability and non-probability, as being "broadly similar." My own interpretation of the 2004 data, similar to James Murphy's comment on AAPORnet, was that although the opt-in samples were worse than the two probability samples, the differences were small enough--and the cost advantage large enough--to merit further investigation. Even if it were impossible to eliminate the extra 2% of error from opt-in samples, they could still be a better choice for many purposes than an RDD sample that cost several times as much.

Krosnick now concludes that "Non-probability sample surveys done via the Internet were always less accurate, on average, than probability sample surveys" and, tendentiously, criticizes "some firms that sell such data" who "sometimes say they have developed effective, proprietary methods" to correct selection bias in opt-in panels.

In fact, the data provide little support for Krosnick's argument. The samples from the opt-in panels were, as we noted in 2005, unrepresentative on basic demographics such as race and education because the vendors failed to balance their samples on these variables, while the two probability samples were balanced on race, education, and other demographics. This is not a result of probability sampling, but of non-probabilistic response adjustments. It is too late to re-collect the data, but the solution (invite more minorities and lower educated respondents) doesn't involve rocket science.

Instead, Krosnick tries to fix the problem by weighting, and concludes that weighting doesn't work. A more careful analysis indicates, however, that despite the large sample imbalances in the opt-in samples, weighting appears to remove most or all selection bias in these samples. Because the samples were poorly selected, heavy weighting is needed and this results in estimates with large variances, but no apparent bias. In fact, if we combine the opt-in samples, we can obtain an estimate with equal accuracy to the two probability samples.

First, consider the RDD telephone sample. The data were collected by SRBI, which used advance letters, up to 12 call attempts, $10 incentives for non-respondents, and a field period of almost five months. Nonetheless, the unweighted sample was significantly different from the population on ten of the 19 benchmarks. RDD samples, like this one, consistently underrepresent male, minority, young, and low-education respondents. These biases are reasonably well understood and, for the most part, can be removed by weighting the sample to match Census demographics.

Next, consider the Probability Sample Internet Survey, conducted by Knowledge Networks (KN). The unweighted sample does not exhibit the skews typical of RDD. How is this possible, since the KN panel is also recruited using RDD? Buried in a footnote is an explanation of how KN managed to hit the primary demographic targets more closely than SRBI (which had a much better response rate). The answer is that "The probability of selection was also adjusted to eliminate discrepancies between the full panel and the population in terms of sex, race, age, education, and Census region (as gauged by comparison with the Current Population Survey). Therefore, no additional weighting was needed to correct for unequal probabilities of selection during the recruitment phase of building the panel." That is, the selection probabilities that are supposedly so important to probability sampling were not used because they would have generated an unrepresentative sample!

The opt-in panels, for the most part, were not balanced on race and education. Only one of the opt-in samples, Non-Probability Sample Internet Survey #6 actually used a race quota. Another, the odd Non-Probability Internet Sample #7, claims to have sent invitations proportionally by race and ended up with 46% of the sample white, despite a 51% response rate. (This survey will be excluded from subsequent comparisons.) Non-probability Sample Internet Survey #1 involved large over-samples of African Americans and Hispanics. I could find no explanation of how Krosnick dealt with the oversamples in the 2009 paper, but it should either match exactly (if the conventional stratified estimator is used) or be far off (if the data are not weighted). In fact, the proportion of whites and Hispanics is off by 1% to 2%.

The selection of a subsample of panelists for a study is critical to the accuracy of opt-in samples. Regardless of how the panel was recruited, the combination of nonresponse or self-selection at the initial stage along with subsequent panel attrition, will tend to make the panel unrepresentative. In 2004, we instructed the panel vendors to use their normal procedures to produce a sample representative of U.S. adults. The practice then (and perhaps now for some vendors) was to use a limited set of quotas. If you didn't ask most opt-in panels to use race or education quotas, they wouldn't use them.

Even without correcting these obvious imbalances, the opt-in samples provided what most people would consider usable estimates for most of the measures. For example, the percentage married (unweighted) was between 53.7% and 61.5% vs. a benchmark of 56.5%). The percentage who worked last week (unweighted) was between 53.6% and 63.1% (vs. a benchmark of 60.8%). The percentage with 3 bedrooms (unweighted) was between 41.2% and 46.1% (vs. a benchmark of 43.4%). The percentage with two vehicles (unweighted) was between 40.1% and 46.9% (vs. a benchmark of 41.5%). Home ownership (unweighted) was between 64.8% and 72.8% (vs. a benchmark of 72.5%). Has one drink on average (unweighted) was between 33.8% and 40.2% (vs. a benchmark of 37.7%). The KN sample and phone samples were better, but the difference was much less than I expected. (Before doing this study, I thought the opt-in samples would all look like Non- probability Sample Internet Survey #7.)

The 2009 paper attempts to correct these imbalances by weighting, but the weighted results do not show what Krosnick claims. He uses raking (also called "rim weighting") to compute a set of weights that range from .03 to 70, which he then trims at 5. The fact that the raking model wants to weight a cell at 70 is a sign that something has gone wrong and can't be cured by arbitrarily trimming the weight. If there really are cells underrepresented by a factor of 70, then trimming causes severe bias for variables correlated with the weight and not trimming causes the estimates to have large variances. In either case, the effect is to increase the mean absolute error of estimates.

The fact that the trimmed and untrimmed weights have about the same average absolute error does not mean that weighting is unable to remove self-selection bias from the sample. The mean absolute error is a measure of accuracy. It is driven by two factors: bias (the difference between the expected value of the estimate and what it is trying to estimate) and variance (the variation in an estimate around its expected value from sample to sample). The usual complaint about self-selected samples is that you can never know whether they will be biased or the size of the bias. Inaccuracy due to sampling variation can be reduced by just taking a larger sample. Bias, on the other hand, doesn't decrease when the sample size is increased.

Obviously, uneweighted estimates from these opt-in samples will be biased because the vendors ignored race and education when selecting respondents. This wouldn't have been difficult to fix, but it wasn't done. Apparently very large weights are needed to correct demographic imbalances in these samples, but the large weights give estimates with large variances and, hence, a high level of inaccuracy. If one tries to control the variance, as Krosnick does, by trimming the weights, then the variance is reduced at the expense of increased bias. The result, again, is inaccuracy. We are asking the weighting to do too much.

A simple calculation shows that all of Krosnick's results are consistent with the weighting removing all of the bias from the opt-in samples. One way to combat increased variability is to combine the six opt-in samples. Without returning to the original data, a simple expedient is to just average the estimates. Since the samples are independent and of the same size, the average of 6 means or proportions should have a variance about 1/6 as large as the single sample variances. The variance is approximately equal to the square of the mean absolute error which, after weighting, was about 5 for the opt-in samples, implying a variance of about 25. If there is no bias after weighting, then the variance of the average of the estimates should be 25/6 or approximately 4, implying a mean absolute error of about 2%.

How does this prediction pan out? If we average each of the weighted estimates and compute the error for each item using the difference between the average estimate and the benchmark, the mean absolute error for the opt-in samples is 1.4% -- almost identical to the mean absolute error for each of the weighted probability samples. That is, the amount of error reduction that comes from averaging the estimates is about what would be predicted if the all bias could have been removed by weighting. Thus, the combination of these six opt-in samples gives an estimate with about the same accuracy as a fairly expensive probability sample (which also required weighting, though not as much).

There is no reason, however, why you should need six opt-in samples to achieve the same accuracy as a single probability sample of the same size. If the samples were selected appropriately, then we could avoid the need for massive weighting. It is still an open question what variables should be used to select samples from opt-in panels or what the method of selection should be. In the past few years, we have accumulated quite a bit of data on the effectiveness of these methods, so there is no need to focus on a set of poorly selected samples from 2004.

Probability sampling is a great invention, but rhetoric has overtaken reality here. Both of the probability samples in this study had large amounts of nonresponse, so that the real selection probability--i.e., the probability of being selected by the surveyor and the respondent choosing to participate--is not known. Usually a fairly simple nonresponse model is adequate, but the accuracy of the estimates depends on the validity of the model, as it does for non-probability samples. Nonresponse is a form of self-selection. All of us who work with non-probability samples should spend our efforts trying to improve the modeling and methods for dealing with the problem, instead of pretending it doesn't exist.


Gian Fulgoni:

Hi Doug:

Thanks for the explanation and clarification. You set the record straight and it was needed.



"The opt-in panels, for the most part, were not balanced on race and education."

What do you mean by the word "balanced"? Do you mean stratifying from the panel based on demographics?

I think the point of the Krosnick analysis was that when the results for probability samples were weighted based on primary demographics, the results for other questions got closer to the accepted government estimates, while such weighting failed to bring opt-in surveys in line with the govt estimates. This implies that the weight strata are not representative of their respective groups in the general population. What is your response to this critique?



This is a good post and I appreciate the attention to detail.

Weights of 70 are truly a bad sign. To get an idea how bad, you can do a simple effective sample size calculation using the sum of the weights and the sum of the weights squared. This is often sobering.


Post a comment

Please be patient while your comment posts - sometimes it takes a minute or two. To check your comment, please wait 60 seconds and click your browser's refresh button. Note that comments with three or more hyperlinks will be held for approval.