Online Survey Sample Is Not Clean Enough – Clean it Yourself

This information is useful for people who use panel sample for online surveys, and who want to make sure their survey data is truly clean.

Online Survey Panels Tell Us Their Panelists Are Clean

It’s hard to open a marketing magazine without seeing an ad from an online survey panel company proclaiming how clean and high quality their panel is.  A few years ago, this claim was a big deal – it was the Wild West of online survey panels, and buyers of sample had to be very careful as to who they worked with.  Today, however, most major online survey sample companies have adopted measures to get rid of professional respondents, prevent over-surveying, and make sure that respondents are who they say they are.  So, whether the sample is “true” or “pure”, or there’s “attention to detail”, most reputable panel companies are doing a decent job of giving those of us who field surveys a good product.

But Survey Data is Still Dirty

However, and here’s a big however, the data from most online surveys using panel sample still comes in with some dirty responses.  My research shows that between 1 and 5% of survey data from panel sample is garbage.  Garbage – throw it out; don’t bring it into your final dataset to analyze.  Sure, one can blame some of these dirty responses on frustrated respondents dealing with poor survey writing (bad questions, too long, etc.), but the fact remains that you had better clean that survey data before it goes in for analysis.

So, How Do I Clean the Data?

Here’s a plan you can use to clean your data.

When I say “flag” below, I mean that you create a new variable in your dataset next to the variable you are examining, and you place a “1” in a cell if the respondent’s case is flagged.

  1. Flag speeders. Look at time to completion and flag those respondents who took the survey in an unrealistically short time.  Check the median time to completion and establish rules that you feel comfortable with – I often flag those taking <1/3 of median time with a “1″ (“speeder”), and those taking < 1/4 of the median time with a “2″ (“super speeder”).  You might consider removing outliers (at the slow end) before calculating your median.
  2. Flag straightliners. If you having any grid/matrix questions, flag those respondents who gave the same response to every item (unless it makes sense that they could do so).
  3. Flag gibberish or garbage responses. If you have any open-ended responses, look for text such as “asdf” or “…..”; flag these responses, and any other “colorful, yet meaningless” responses you find.
  4. Flag incongruent combinations. If a respondent says their company size is 1000 and the number of PCs in the company is 5, something’s wrong here.  Flag it.
  5. Trap questions. Did you include any questions such as “Please choose the third response below”, or “Please type the word “attention” below”?  If you did, check them, and flag those respondents who didn’t follow the directions.
  6. Sum up your flags. Compute a new variable that sums all the flags.
  7. Sort your dataset by summed variable. Bring cases to the top that have suspicious answers on a number of your checks.
  8. Inspect and delete cases with flags. Delete those cases that are too “dirty” to be included.  Review with key stakeholders to agree on deletions.
  9. Notify your vendor of any bogus respondents. All the vendors I work with do not charge for any respondents I have flagged for deletion.  Show them the IDs of the respondents you threw out, and they’ll take action on their side to warn and/or remove these panelists from their database.

Following the steps above will insure that the data you analyze is as clean as possible.  Yes, it takes a bit of time, but the effort is clearly worth it when compared to making decisions based on the analysis of data that includes bogus responses.

One last note: if you really need your final sample size to hit a specific number, and you can’t go below that number, you can over-sample, in anticipation of throwing out some respondents.

Feel free to contact me for more details about some of the specific techniques I have found useful to clean data, or follow me on Twitter @NicoPeruzziPhD to hear about other marketing research topics.

Related posts:

  1. Online Video: Right on Target
Advertisement
  • Jackie Anderson

    Love the ideas here. Can you discuss a little more about the research you’ve done to come up with the 1-5% figure you reference above?

    Thanks!

  • http://researchaccess.com/nico Nico Peruzzi

    The 1-5% figure comes from studies I’ve done using online research panels over the past 5 or so years – since I’ve had a formal cleaning program in place. Using all the name-brand online panels, both B2B and B2C, lots of tech and consumer electronics, primarily North America. I have seen the figure go higher with international studies.

  • http://www.genroe.com Adam Ramshaw

    Nico,

    Good piece that brings some rigor to the data cleaning process. This is applicable not just to the use of panels but to all on-line surveys.

    Thanks for posting it.

  • http://www.markettools.com/truesample Emily Morris

    So true, Nico! We see similar statistics across all of the surveys that use TrueSample for ensuring data quality. For any given survey, TrueSample typically flags between 3-5% of the respondents as “unengaged”, meaning they sped and/or straight-lined on a significant percentage of the questions — and this is after the sample has already been cleansed of “fake” or “duplicate” respondents.

    I think it’s very important that research buyers implement their own quality assurance mechanisms and audits to the surveys becuase while using clean sample certainly helps, it isn’t enough.

  • http://www.twitter.com/duey23 Brian LoCicero

    Agree with everything you’ve written however we really shouldn’t be limiting these cleaning techniques to JUST Online. I have seen just as much garbage data on the telephone and with f2f interviewing.

    I DO have to point out though, as you did touch upon, that we as an industry have done little to nothing to improve how we converse with respondents in the last 30+ years.

    If we continue to use “inside speak” words and terminology in surveys, speak AT them and not WITH them and continue to use horrible “form-based” interfaces, all the internal data cleaning steps in the world will not replace a disengaged human being who is being forced though our surveys.

  • Pingback: Follow These Data Cleaning Tips On You Online Survey Panel Responses « SurveyAnalytics Blog

  • Katrin

    Dear Mr. Fugazzi,

    I am a student from Germany and I’m actually writing my diploma thesis (psychology).

    I therefore ran an online questionnaire in a German assurance company and now I am looking for ways to clean my dataset. I am specially interested in cleaning the data by means of the times people spend answering my questions.

    I found this article and I would like to use your approach to datacleansing by means of the time (excluding those who answered faster than 25-30% of median time) and I have two questions concerning this approach:

    1. What is your reason for defining 25-30% of median time as a cutoff?

    2. Do you have any publications, papers or something about your approach to data cleansing? Could you recommend me some of them? If I use your approach I would like to cite it properly in my thesis to justify my procedure and to show, that i didn’t do this cleaning procedure at random.

    You would help me very much.

    Thank you and kind regards,

    Katrin