Most of the surveys we analyze and report on are ones we planned and conducted ourselves. But what about when you are assessing surveys conducted by others? How do you evaluate their quality and filter the information they provide accordingly?
I decided to ask that question of someone who evaluates others’ surveys for a living. In my recent interview with Huffington Post’s Mark Blumenthal, I asked him how they evaluate quality of the the polls the report on. His answers are instructive for any researcher evaluating survey quality.
Dana Stanley: How do you evaluate the quality of different polls?
Mark Blumenthal: That is the toughest, hardest, ongoing question that we have, because I think there’s an assumption among ordinary people that’s based in reality that there are higher quality and lower quality surveys. And that the assumption, which I think is the tougher one, is that somehow we ought to be able to sort them into two categories– quality, reliable, worthy, accurate research; and crap. I think that there’s a little bit too much of an assumption that they’re all either one or the other one. What I find in reality is that for better or for worse, they’re all pretty ugly.
In one sense, if what you use as your yardstick is the accuracy of the final estimate of the poll compared to the election result, it’s, as a statistical matter, very hard to tease out much variance between polling organizations. And the people who try to do this, you can see that it’s only one or two organizations that have been really in a statistically meaningful way, less accurate almost every year. And only a very small handful. And this one, it’s really tough because the more polls you do the less error you have, the more accurate you look. So the most prolific pollsters usually look like the most accurate. The reality is that those accuracy measures don’t really help us sort out good from bad.
And that has two implications. Most the time it means you can use less rigorous methods and predict election outcomes reasonably well. It’s the unfortunate, inconvenient truth about looking at “accuracy,” quote unquote. But then there’s the other way to judge survey quality, which we learn, which is about looking at how they were done. And even there, I think one of the controversies or one of the sources of ongoing debate is about taking the interviewer out of the surveys and doing an automated robo, IVR, poll.
And so there’s this huge debate, should we measure them or not measure them? But if you look more closely at the individual practitioners, the three biggest, most prolific conductors of automated surveys have very little in common with each other, in terms of how they sample, how they weight data. There’s one that has long done call backs and run a survey for a field period of two or three days, and is very disciplined about not doing surveys of more than three or four minutes in length. While the others, one uses registered voter files, the other two do not, runs much longer questionnaires. One way weights everything by party, the other two don’t.
And I think those things, all those things I just rattled off, are probably of greater consequence than the missing survey interviewer in thinking about whether they have house effects or how they are different consistently than the others. I think having a self-administered telephone survey is one property. So the reality I see– and so let’s throw in one other thing– those rigorous, expensive, high quality, surveys I know of, we know just last week the Pew Research Center updated its third wave of a non-response study that they’ve run. The first one in, I think, 1997, using their methods which I think are more expensive and more rigorous than 90 something percent of the other media surveys out there. They’re getting a typical response rate of 9%.
Now we know from the very work that they released last week that for the most part they’re not seeing bias, they’re not seeing non-response error in those results to the extent that that’s measurable. But it’s hard to hold up the RDD sample, even with cell phones, even with lots and lots of money spent on trying to be persistent and getting someone on the phone, as a perfect benchmark that cannot be questioned. Because any survey at any point now, they have some either random error, or response bias, or coverage problems that should make you question the results, regardless of the methods used to collect the data.
You mentioned the recent Pew Research Center study looking at long-term decline and telephone survey response rates and the 9% figure you quoted was from their work in telephone surveys in 2012 so far. I think a lot of people– and you wrote a post about that– I think a lot of people were shocked that the response rate is now into the single digits. Yet there was some encouraging–
If the Pew Research center with a six day field period, I think they’re up to like eight attempts over that six days, is getting 9% response rate, the reality of most commercial data is that it’s a whole lot lower than that. And that’s the world we live in for telephone stuff anyway.
How low can it go before the data start to really–
Well, I think the point here is not to say that the sky is falling and that we can’t do research anymore. It’s clear we’re doing this kind of research. The numbers have fallen a fair amount in two years, but not that far. In 2010 and 2004 and 2008, most surveys got Presidential, and Senate, and governor elections right, in most cases. There were a couple of prominent misses.
It’s almost miraculous to me how accurate they continue to be considering all the challenges we’re up against. But I think the important thing for anybody consuming, paying for, or otherwise reading about survey numbers to understand is that the notion that what makes for accurate represented data is the magic of random sampling, those days are long gone. What makes it accurate is some sort of process that takes a biased, potentially skewed, data collected, and turns it into something representative. And they’re all different.
For the reigning, for the RDD Cell Phone Landline Sampling that Pew Research and ABC/Washington Post, and CBS/New York Times, and CNN and AP-GfK, and all the other news organizations do, that involves calling cell phones and landlines and getting interviews that are going to be skewed, unweighted. They’re going to have a much harder time interviewing people in urban areas, much harder time interviewing non-whites, and people less well educated and the like. And they’re going to statistically adjust and weight those to match the more high quality numbers we have from the Census. And that process of adjusting the numbers after you’ve collect it is what makes it representative and removes most of the bias that seems to come up.
And so the harder question is how much of the rigor is really necessary. And as the response rates get lower and lower, what point, this is the ultimate question we’re all working on, is can you start with a panel? Can you start with a pool of a non-random respondents and apply that kind of waiting or stratification or selection or whatever? Can you get representative data? We’re starting to see that some applications we’re doing pretty well.