Meet the Data Triplets: Data, Metadata and Paradata

tripletsThere are three sorts of data, and very often you need all three to understand and use the data you collect from your survey.

Here are the three sorts:

1. The Data.

The data is the data, that is the actually numbers, codes or open ended text that the respondent enters into the survey. There should probably be a better way of describing this than “the data.”, maybe raw data is a better term to use ?

2. The Metadata

Metadata describes the raw data.

For instance the raw data for question 2 may be a “1” or a “2”. The metadata would say that the value “1” means male and the value “2” means female.  Or the metadata may say that the values for question 3 could be a range of 1-100, or 32-78.

Metadata gives meaning to the raw data, and so it is vital to the analysis process of the raw data that the metadata is present.  Otherwise the raw data is just a collection of numbers with no meaning.

One of the problems with metadata is keeping it connected to the right raw data. The wrong metadata with raw data can be a disaster.

Over the years data formats have got more complex, and one of the big reasons is to keep the metadata with the data. More recent data exchange formats/protocols such as JSON (Javascript Object Notation for the technically minded) have capabilities for attaching metadata to the data, which is a very good thing.

Raw data with no metadata is just a load of junk.

3. The Paradata

Paradata is the least well known of the data triplets. In the past decade or so it has become much more important for the survey research world.

Paradata is data which describes something about the way the raw data was collected.

It is data about data.

The most commonly used form of paradata used at the moment is data about questionnaire and question timings. That is, the time a respondent takes to complete a question or questionnaire.

This type of data is now one of the cornerstones of quality measurements for web surveys.

Obviously there can be many different sorts of paradata. For open ended text questions the length of text entered by the respondent can be measure, as well as the “level of vocabulary” contained in the text.

One metric used for web surveys is that of “speeders,” that is, the number of people who complete the survey extremely quickly. The paradata for time take to complete the questionnaire is used here.

Paradata can also be useful in revealing hidden biases; for instance, using paradata in the gamificaton of surveys is a rising trend. The time taken to do something in a gamified survey as well the action can have a great deal of meaning. Some researchers claim that hidden racism, some times unknown to the subject themselves, can be revealed by measuring someone’s reaction time to specific questions.

In a future post we will delve more into exactly how paradata can be used for quality control of web surveys.

About Andrew Jeavons

With over 25 years in the market research industry, Andrew is a frequent writer and speaker for various publications and events around the world. He has a background in psychology, statistics and software development. Andrew is President of Survey Analytics.


