DIW Berlin: ENTAILab Wiki

Survey Quality

1. So, you want collect survey data!

In theory, this is quite simple. You have a couple of questions you want to ask; you get your respondents to reply to them, and there you have your data. However, in order to be able to make any meaningful statements based on the collected data, it is important to ensure that it is of high quality.

There are a few general things to consider before diving into more specific potential issues. First of all, the survey sample needs to be big enough. If you want to be able to make causal inferences, you will need to collect data multiple times across waves. If that is the case, it is important to ensure that the survey runs for long enough (e.g. the number of waves) and that there is sufficient time between the waves. Of course, the measurements you use to capture your data need to be precise and consistent, they need to capture the intended construct and need to reliably capture it in repeated application of the measurement (e.g. across waves). Finally, the sample of respondents needs to fit the purpose of your survey. Depending on what your goal is there are a variety of different sampling techniques (see below) that might be more or less suited for your purpose and for then making any claims regarding the more general population you drew your sample from.

1.1. In short:

Survey in general needs to run for enough time and have enough people in it
Measurements need to be precise and consistent
Sampling needs to fit the purpose:
- Simple Random sampling: Equal chance for everybody
- Stratified sampling: Randomly selecting from predefined subgroups
- Cluster sampling: Divide population into clusters and randomly sample entire clusters
- Systematic sampling: Starting from a random point and going in fixed intervals
- Convenience sampling: Easiest to reach or readily available sample
- Quota sampling: Non-random sampling from strata to get fixed quotas

2. Problems that can arise

Even after settling issues with the general framework of the survey just laid out, more issues can arise once you want to start collecting data. These issues can impact the quality of the data if not handled correctly.

First of all, successfully contacting and interviewing everybody listed in the intended sample can be challenging. Many of the people contacted might not reply at all or only after substantial additional effort (multiple attempts to contact, additional incentives, etc.). If there are certain people that are systematically failing to respond to invitations, this could lead to an unbalanced sample which in turn can impact the generalizability of results.

Once you are fielding your questionnaire there can be mode effects. A survey can be filled out in different ways (e.g. face-to-face with an interviewer, online alone by the respondent, etc.). While careful sampling might have ensured a representative starting pool of respondents, the mode can still cause differences in the final results. If respondents get to choose the mode themselves, the demographics between those who complete a survey online can differ substantially from those who respond in person (e.g. older people might be less inclined to use a computer). Moreover, modes can influence aspects such as social desirability or satisficing differently (more on those two things later).

Even if all respondents are presented with the survey the same way, there can still be problems. There could be interviewer effects which are differences in data attributed to the person conducting the interview. This might be caused, for example, by the interview topic. If a survey is conducted on gender inequality, the gender of the interviewer could result in respondents reacting differently. Nevertheless, literature has shown that even regardless of the topic, interviewer effects can occur and should, thus, be something to keep in mind.

The survey questionnaire itself might pose problems as well. If the survey is provided in multiple languages, translating the questionnaire can lead to a difference in effects. Literal translations might mean something different in another cultural context. If the survey is provided in one language, it needs to be ascertained that the survey population is able to understand it well. Complicated language or ambiguous questions could further impact the data quality if survey respondents understand questions differently because of different interpretations.

Another factor here is Differential Item Functioning (DIF). Respondents might interpret response categories differently. Especially for questions in social science such as “How interested are you in XYZ” one respondent might have a different definition of “very” than another.

Respondents might also experience fatigue effects if the survey is too long or features too many questions of the same type. They might stop to fully pay attention, thus once more leading to a decline in the quality of answers provided.

Similarly, satisficing can also lower the quality of obtained data. Respondents might go through the survey and answer in a way that causes them the least amount of effort (e.g. always selecting the first response, straight lining answers for questions presented in a grid, defaulting to “don’t know”, etc.), even if that means their answers are completely removed from their true opinions.

Respondents might also be inclined to answer certain questions in a way they think they should instead of being truthful. They might provide more socially acceptable answers even if they do not reflect their actual opinion – this problem might arise more in non-anonymous settings such as face-to-face interviews.

In addition to providing answers that are outright wrong, respondents might choose not to answer some questions at all. This could be due to questions touching on a sensitive topic or be related to previously mentioned issues such as fatigue or complicated wording. A high number of missing items or “don’t know/don’t want to answer” responses on a specific topic can further impact the quality of the gathered data as it can skew the results. Potentially, there are certain demographics that are more likely to answers specific questions while others who form another homogeneous group are more likely to avoid answering.

In panel surveys there might be other issues that arise from respondents participating in survey research over longer periods of time. Simply the fact that they are knowledgeable about the survey, about the specific questions asked could affect the results (learning effects). Moreover, those who instead chose to not answer also pose a big problem. Respondents who drop out of the survey, who fail to participate in future waves can also lead to an unbalanced sample. Panel attrition needs to be carefully monitored.

Many of the mentioned issues can influence each other. For example, a too long questionnaire that fatigues the respondents could lead to them engaging in satisficing. Similarly, many questions of the same type (e.g. 10 questions in a grid that all ask to indicate from 0 to 10) might cause the same issue. Mode effects can relate to different demographic features in the sub-sample but could also influence social desirability. It is clear that is important to pay close attention to the possibility of these problems occurring and monitoring the data in order to reach a high level of quality.

How to handle some of the problems will be discussed in the next section.

2.1. In short:

Non-responses
Mode effects
- Different selection process depending on modes
Interviewer effects
Translation influence
Ambiguous/complicated questions
Differential Item Functioning (DIF)
Fatigue effects
- Survey is too long
- Too much of the same question type
Social desirability
Satisficing
Many missing items
Learning effect in surveys for panel members
Panel attrition

3. What to consider when trying to design a high-quality Survey

The first step in getting high quality data is to ensure that the potential respondents in the sample actually participate in the survey.

It is important to consider when respondents are contacted (e.g. how long before the survey takes place), in what manner (via e-mail, telephone, letter, etc.), and when/how often one attempts to follow up when they do not react to initial attempts at contact. Additionally, reminders shortly before the survey (e.g. the agreed upon interview date) might be helpful.

Oftentimes, incentives are used to encourage participation (e.g. monetary compensation).

Another important consideration is which respondents get which survey mode. As previously discussed, potential differences can arise, so researchers should be aware when assigning different modes/when letting respondents choose a survey mode. Researchers should monitor the sample for each for mode to check for clustering of demographics (e.g. only younger people choosing to respond to the survey online) or whether any other systematic differences in responses seem to emerge.

Moreover, the survey mode should match the survey topic. For example, when dealing with a highly sensitive topic, a survey mode without an interviewer might make respondents feel more at ease and more inclined to provide answers. For certain populations, face-to-face interviews might be necessary as they do not have the technological capabilities for participating in an online survey (e.g. spotty internet penetration in rural areas). These factors need to be considered to ensure the final sample is not skewed.

For those surveys that are conducted by interviewers – be it face-to-face or online – it is important they are trained. While interviewer effects might occur regardless of any training, it ensures that all interviewers introduce the survey or themselves in the same way. They ask questions the same way and are briefed to tackle potential follow up questions or response inconsistencies in the same way. Ensuring interviewer behavior is consistent is crucial in obtaining high quality data.

When we move on to the survey itself, it is, first of all, paramount that the instrument has been pretested. Especially when including new constructs researchers should make sure that they capture what they intended. Respondents might not interpret a question the way researchers had intended, the wording is too complex, etc. Researchers should also consider the length of their surveys as it might induce fatigue among respondents if the questionnaire is overly long. As previously mentioned, certain question types, especially if asked in a consecutive manner could induce negative effects. This could also include asking sensitive questions which then leads to the respondents shutting down and then being less inclined to reply to the following items, even if they are of a less sensitive nature. To combat the systematic effect of a certain question being asked first in a cluster or of a question always being the last (e.g. when asking people about their satisfaction with different things, or when presenting them with a battery of knowledge questions), the order of these questions is often randomized among participants.

In order to further evaluate that people are engaging in any form of satisficing or have stopped paying close attention, it can be advisable to include attention checks in the survey. For example, in a battery of questions (e.g. how often do you…? 0=never, 10=all the time), one might include a question along the lines of “how often do you swim across the Atlantic Ocean?”. Moreover, monitoring data outliers can be helpful. Users who complete a survey in extremely short time or take extremely long might not provide the highest quality of answers. Similarly, somebody who answers “10=all the time” across the board when the mean of the survey overall is around 3 might not have been giving honest answers and thus compromise the quality of the data obtained.

After the survey is completed, it can be helpful to ask respondents for feedback. If multiple respondents mention that certain questions were confusing or that the survey was too long, this feedback should be considered. Especially in panel surveys, this could be vital to keep respondents from dropping out.

Many of the mentioned issues might only lead to certain responses being dropped after the survey was conducted. However, applying a responsive or adaptive design while collecting data allows for interventions during the collection period. By monitoring some key elements, researchers can make adjustments to their design. For example, keeping track of which contact times or forms are most effective can save money and time. If it becomes apparent that respondents are not receptive to being contact by telephone, this might be dropped in favor of letters and e-mails. Similarly, these designs can provide indicators on follow up times or for providing incentives. This way, researchers might be able to improve costs and non-response bias.

3.1. In Short:

Contact timing
- Who to contact when?
- And how?
Mode
- Who gets what mode?
What mode to use
- self-administered
  - web (CAWI)
  - offline (on paper)
- interviewer
  - in person
  - video
  - avatar
  - Chatbot
Incentives for participants
Interviewer training
Extensive instrument pretesting
Survey length
Question complexity
Question order
Attention checks in the survey
Monitoring of data outliers
Response from respondents on design
Use of paradata
- Responsive and adaptive designs
- Interventions
- Potential use of machine learning

4. Quality indicators

As you can see, there are many factors involved in creating a survey that will result in high quality data. There are some specific indicators that you can look at to get a sense of whether your gathered data is good.

First of all, you can test the measurements you plan to use. There are different types of reliability and validity to consider. For example, it is advisable to measure constructs with more than one item to increase internal consistency. This is not needed for all items (e.g. something like age should be easy to capture accurately with a single item) but for more ambiguous measures (e.g. life satisfaction – you would ask multiple questions to capture that and not just “How satisfied are you with your life”), more items are normally used. Whether those items do indeed measure the same construct, and correlate can be mathematically tested. Another important factor is testing and retesting measurements. They should reliably produce the same results when applied multiple times. However, especially in social science this can be tricky. Asking somebody how interested they are in something on a scale of 1 to 10 might result in different answers despite no actual change occurring simply because there is no objective definition of what somebody might define as an 8 or a 9. In this regard, multiple items are also helpful to accurately capture the construct as it takes more than one item into account, thereby alleviating this concern a little bit. As reliability and validity are already discussed in the “Introduction to Panel Studies” in Teams, I will just list important aspects once more with a few bullet points rather than talking about everything in detail again (see below).

As mentioned above, it is important to consider your sample. Who is drafted into the sample and why? I listed different types of sampling before. The thing to consider is whether your sample is representative of the population you wish to make claims about based on your data. To check representativeness, the composition of the sample might be, for example, checked against the census to determine whether certain demographics are over- or underrepresented.

Once the survey is fielded, the response rate is another factor to consider. There are certain benchmarks on what is considered an acceptable percentage of response to contact that can vary depending on fields of study. However, in general you want as many of the people you contact to provided responses to the survey. If there is an unusually high non-response rate, there might, for example, be issues with how respondents are contacted. It is crucial to monitor those who do not respond in order to prevent systematic missing of a certain demographic.

When respondents start filling out the survey, you want to achieve as many complete questionnaires as possible. That means you wish to avoid respondents a) dropping out of the survey at some point, b) skipping questions, c) providing many “don’t know” or “don’t want to answer” replies. Especially if you find that it is only a certain demographic that skips certain questions, this can be problematic for representativeness of data. If a great number of respondents drops out after the same question or a very high number of respondents provide “don’t know” answers to a specific question, this might be an indicator for issues with the question wording.

Finally, in the specific case of panel surveys, you do not just want respondents to complete one questionnaire. You want them to participate in future waves of the survey as well. If you find there is an unusually high number of respondents dropping out of the panel after a specific wave, there might be an issue with the survey design. Was anything changed or added? For example, transitioning to a new data collection mode or new sets of questions might not work for all respondents. You want to achieve the highest retention rate possible between waves as this is what allows you to establish causal relations over time.

4.1. In short:

Measurement/design reliability and validity
- test-retest (can be problematic in survey research – e.g. scale of 1-10 of how interested you are in…?)
- internal consistency (multiple items for a construct that capture the same thing)
- internal validity
  - are there other influences (e.g. confounding variables, selection bias, maturation, history, testing effects, instrumentation)?
- external validity
  - population validity: sample representativeness
  - Ecological validity: real world conditions (especially for survey experiments)
  - Temporal validity: are findings applicable over time?
  - nteraction effects: caused by setting, respondent, interviewer, etc.
- construct validity
  - convergent validity: correlation with other measures that capture the same idea
  - discriminant validity: no correlation with unrelated measures
  - factorial validity: see factor analysis, underlying structure, does it align with theoretical construct?
  - Content validity: measurement covers the whole range/all relevant aspects of the construct
  - Criterion related validity: relation to/predictions of other relevant outcomes,
Sample representativeness
High response rate
Few missing values/incomplete answers
For panel studies: high retention of respondents

5. Important Factors to Consider for Panel Surveys

Panel surveys are somewhat special. You want to achieve consistency across panel waves which means changing as few things as possible. As we have already discussed, any change (e.g. introduction of a new mode) might impact the responses obtained which can in turn make it more difficult to compare data across time. Nevertheless, is important for panel studies to reflect new developments. This can mean adding new items to the questionnaire that reflect on current events. It could also mean integrating new types of data collection, like data donation. One of the biggest additions can be a variety of new modes in addition to or as a replacement of traditional face-to-face interviews.

In face of everything discussed previously, it is obvious that any change or addition needs to be carefully considered and planned. Moreover, after adaptations are made to a panel study, researchers should closely monitor the data obtained along the guidelines and indicators for survey quality laid out before.

6. Tools, Papers, Resources

In this section I’ve gathered an assortment of papers and other resources that contain helpful information and new findings regarding all the issues I’ve talked about.

6.1. Principles and Guidelines

· FAIR principles for data management

· The FAIR Guiding Principles for scientific data management and stewardship - PubMed

· Total Survey Error

· The SAGE Handbook of Survey Methodology - Google Books

· Cross Cultural Survey Guidelines

· https://ccsg.isr.umich.edu/

· Guidelines for data harmonization

· https://doi.org/10.1093/ije/dyw075

· Guidelines for survey data documentation

· https://doi.org/10.3390/data3040045

6.2. Tools

· Survey Data Recycling toolbox

· SDR 1.0 Master Box - Survey Data Recycling Dataverse

· Data harmonization

· See https://doi.org/10.1080/00207659.2016.1130424 for application

6.3. Mode related effects

· Mode Effects in Mixed-mode Surveys

· https://doi.org/10.1093/poq/nfq059

· Method for evaluation

· Difference in samples

· Evidence for social desirability

· Differences caused by survey modes

· https://www.econstor.eu/handle/10419/65907

· Protocol effects

· Affects mostly attitude scales (could be due to social desirability)

· Potential differences in effects

· Distractions during web-surveys and data quality

· https://doi.org/10.1177/0894439319851503

· Tested music or the presence of other people/loud conversations

· No significant impact on data quality or attentiveness

· Household survey with multiple visits and how it affects quality

· https://doi.org/10.1111/obes.12196

· High frequency visits have conditioning effects

· Data from later visits of less quality

· Survey quality depending on screen size/smartphone use

· Completing web surveys on mobile devices: Does screen size affect data quality?

· Smaller smartphones à more likely to drop out, give shorter responses, straight line

· Survey mode and data quality

· https://doi.org/10.1016/j.chieco.2024.102271

· CAPI and CAWI

· China

· No difference found in data quality

· CAWI tend to take a bit longer

· Interviewer Effects in Face-to Face Surveys

· https://doi.org/10.1177/0049124120926215

· Effects hinge on how the interviewer is seen by respondent

· Respondents adjust their answer to what they assume is the interviewer’s opinion

· Interviewer Effects

· https://doi.org/10.1093/jssam/smw024

· Literature review

· Application of TSE

· How investigate and explain interviewer effects across data collection

6.4. New measurements and measurement related effects

· Measurement invariance in large scale international surveys

· https://doi.org/10.1177/0013164413498257

· Discussion of which indicators are suitable and recommendations

· Measurement invariance

· https://doi.org/10.18148/srm/2020.v14i4.7655

· How to test for it

· And how to correct it

· Different ways to measure reliability in survey

· https://doi.org/10.1093/jssam/smaa021

· Discussion of correlation across methods

· More sophisticated measures were not found to be better than traditional ones

· new indicator to measure data quality

· https://doi.org/10.3233/SJI-20071

· SCAI

· For both probability and nonprobability samples

6.5. Data donation, linking, harmonization

· Data donation and selection bias

· https://doi.org/10.5117/CCR2024.2.3.STRY

· Found differences in age, digital and algorithmic efficacy

· No influence of privacy concerns or lack of trust

· Survey data quality assessment via survey data recycling

· https://doi.org/10.1177/00027642211021623

· Focus on protest behaviour and used indicators thereof

· Intersurvey variance of over 5% due to

· Measures of survey documentation

· Data processing

· Computer records

· Dissertation on consent to data linking

· https://doi.org/10.4119/unibi/2992206

· Repeated requests for linkage consent improve the rate

· Presenting different arguments over time improve consent rate

6.6. Miscellaneous data quality

· Literature review on adaptive and responsive survey designs

· https://doi.org/10.1111/rssa.12186

· Improving Survey Quality through paradata

· http://dx.doi.org/10.2139/ssrn.4114585

· Providing fieldworkers with feedback to adapt behavior

· Discussion of what paradata to monitor, how to monitor it and how often to intervene

· In this case: completed interviews, time per completed interview, ratio of initiated vs completed interview

· Influence of incentives on data quality in panel waves

· https://doi.org/10.1177/1525822X209018

· Larger incentives lead to higher completion rates

· Almost no impact on data quality or bias

· Transitioning Panel Surveys from in-person to web

· https://doi.org/10.1111/rssa.12750

· Transition without data quality loss is possible

· Better response rates through model-guided incentives

· Splitting the questionnaire in separate parts did not help to increase response rates

· Data Quality with Amazon’s MTurk

· https://doi.org/10.1017/psrm.2021.57

· Decrease in users who complete the survey more than once (check via IP address)

· More suspicious IP addresses on MTurk than other platforms

· many humorous or insincere answers

· affects treatment effects by up to 30%

· Burdensome questions and data quality

· https://doi.org/10.2478/jos-2022-0044

· Can lead to more nonresponse in subsequent questions

· No effect on answer quality though

· No difference depending on personal characteristics observed

· Thesis on improving survey quality

· https://eprints.soton.ac.uk/444046/

· Predicting survey nonresponse via propensity models

· Only lead to slight improvement

· Interviewers and monetary incentives

· Significant and substantial variability on how effective incentives are between interviewers, but no specific interviewer characteristics found

· Face-to-face vs. online survey

· Measurement effects cause most of the mode differences

· Sample difference only play secondary role

· Language Proficiency and Survey Quality

· https://doi.org/10.1093/jssam/smz045

· Non-native speakers tend to provide lower quality data

· Impact of Personality Traits on data quality

· 10.24874/IJQR18.02-13

· Higher openness and lower conscientiousness lead to more social desirability

· More extroversion and low conscientiousness participate more in self-selected panels

6.7. Special Issue of Survey Research Methods

· Vol. 17 No. 3 (2023): Recent Methodological Advances in Panel Data Collection, Analysis, and Application | Survey Research Methods

· Consideration of reciprocity in getting people to participate

· https://doi.org/10.18148/srm/2023.v17i3.7975

· Strong preference for reciprocity leads to quick participation after getting an invitation

· Likelihood of reciprocity declines the longer the time between incentive and survey

· Numerous reminders were not helpful

· Prepaid monetary incentives worked best

· *but* not enough for enhancing response rates

· Needs to be combined with other strategies

· Responsive survey designs to decrease attritions with the help of machine learning

· https://doi.org/10.18148/srm/2023.v17i3.7988

· Paradata, survey data and interviewer observations used to predict future outcomes

· Adaptive design framework with the help of machine learning algorithms

· Use of past panel waves to predict future behavior

· Interventions found for low propensity cases (increasing monetary incentives)

· Satisficing in panel surveys and effect on data quality

· https://doi.org/10.18148/srm/2023.v17i3.7986

· Satisficing was found in all waves tested

· No change across waves or moderation of any other influences

· Affects distribution of respondents’ estimates by 15+%

· Can affect associations across variables

· Memory Effect/response recall across panel waves

· https://doi.org/10.18148/srm/2023.v17i3.7991

· Recall lower than in cross-sectional designs with multiple measures

· Those who don’t remember prior answer mostly only off by one point

· Extreme responses more likely to be remembered

· Female respondents were more likely to remember

· Potential for measurement error across panel waves is low after four + months

· Panel conditioning effects based on survey frequency

· https://doi.org/10.18148/srm/2023.v17i3.7990

· Conditioning turned out to be only a minor issue

· And only for some specific questions

· Panel conditioning and question order

· https://doi.org/10.18148/srm/2023.v17i3.7993

· Based on self-reported data

· Use of anchored vignettes

· No priming effect across waves

· Panel survey measures for employment precarity

· https://doi.org/10.18148/srm/2023.v17i3.7989

· For cross-country research

· Use of national panel survey data

· Development of Cross-national precarity index CNPI

· Propose methodological framework for ex-post harmonization of career data

· Neighbourhood research via dwelling panel

· https://doi.org/10.18148/srm/2023.v17i3.7987

7. Webinars

RAGs and LLMs: Turning AI into Real Intelligence for Public Opinion Research

https://www.youtube.com/watch?v=Pdt6vs1Qhtg

Phone Survey Methods CATI: Effective Techniques to Improve Response Rate

https://www.youtube.com/watch?v=BzpPjhCu1nM

Technologies and Strategies for Improving Survey Data Quality and Fieldwork Efficency

https://www.youtube.com/watch?v=VSfGwfaFY-I

Biases in LLM-Based Predictions of Voting Behavior in European Elections

https://www.youtube.com/watch?v=k1v1tjHH2ew

Using NLP and Generative AI in Survey Research

https://www.youtube.com/watch?v=DDjUzQjj778

Transitioning to multimode data collection: Understanding the Measurement Dimension

https://www.youtube.com/watch?v=F54TV5wU174

Data Quality in Mixed Device online Surveys in the UK

https://www.youtube.com/watch?v=pT2wygY_OeI

Generalizable Nonresponse Prediction and Machine-Learning Based Adaptive Designs

https://www.youtube.com/watch?v=GOP2DFwG6TA

Designed by Big Data: Augmenting Surveys with sensors, Apps, wearables and data Donation

https://www.youtube.com/watch?v=Sm3-oYOvbWA

Advances in Comparative Survey Methods

https://www.youtube.com/watch?v=hVE0aMhdJVc

What is Data Survey Quality

https://www.youtube.com/watch?v=NwFO5K0mhh8

Survey Data Recycling and Data Harmonization

https://www.youtube.com/watch?v=ZHdDqTkYKKk

Abteilungen und SOEP

Forschungsgruppen

Prognose und Projekte

Aktuelles

Über uns

SOEP-Daten

Forschung

Survey Quality

1. So, you want collect survey data! keyboard_arrow_up

2. Problems that can arise keyboard_arrow_up

3. What to consider when trying to design a high-quality Survey keyboard_arrow_up

4. Quality indicators keyboard_arrow_up