The Socio-Economic Panel (SOEP) study is a wide-ranging, nationally representative longitudinal study of private households across Germany that was launched in 1984. It is based at DIW Berlin. Every year, nearly 15,000 households and more than 25,000 individuals are surveyed for the SOEP-Core study by the fieldwork organization Kantar Public (TNS Infratest up to 2017).
Since its inception, the SOEP’s goal has been to collect and provide representative microdata that allow researchers to study stability and change in living conditions. Its approach is micro-econometric, with added variables from sociology and political science (influenced by the social indicator movement).The data provide information on all members of households in both the former East and West, including foreigners, and recent immigrants to Germany.
The study was launched in 1984. Some of the many topics include household composition, occupational biographies, employment, earnings, health and satisfaction indicators. As early as June 1990—even before the Economic, Social and Monetary Union—SOEP-Core expanded to include the states of the former German Democratic Republic (GDR), thus seizing the rare opportunity to observe the transformation of an entire society. Immigrant and refugee samples were added to account for the changes that took place in German society in 1994/95, 2013, 2015, and 2016. Further new samples were added in 1998, 2000, 2002, 2006, 2009, 2011, and 2012.
Since Version 31 (10.5684/soep.v31), the SOEP has included the complete data from “Familien in Deutschland” (Families in Germany, FiD), which has been integrated retrospectively into the SOEP and made available for analysis in user-friendly form. The FiD survey was carried out in parallel to the SOEP from 2010 to 2013 as a "SOEP-related study". The most recent version of the SOEP-Core data contains data from the migration and refugee samples, which have been integrated into SOEP-Core. The SOEP survey is constantly being adapted and developed in response to current social developments. The international version of the SOEP-Core data contains 95% of all cases surveyed.
The following is a list of SOEP topics. For more information click on the appropriate topic. You will then be redirected to our SOEPcompanion:
Collection period: 1984-2017
Publication date: 2019-03-04
Principal investigators: Stefan Liebig, Jan Goebel, Martin Kroh, Carsten Schröder, Jürgen Schupp, Charlotte Bartels, Alexandra Fedorets, Andreas Franken, Marco Giesselmann, Markus Grabka, Jannes Jacobsen, Selin Kara, Peter Krause, Hannes Kröger, Maria Metzing, Janine Napieraj, Jana Nebelin, David Richter, Diana Schacht, Paul Schmelzer, Christian Schmitt, Daniel Schnitzlein, Rainer Siegers, Knut Wenzig, Stefan Zimmermann
Data collector: Kantar Deutschland GmbH
Population: Persons living in private households in Germany
Dataformat: STATA, SPSS, SAS, CSV
Selection method: All samples of SOEP are multi-stage random samples which are regionally clustered. The respondents (households) are selected by random-walk.
Collection mode: The interview methodology of the SOEP is based on a set of pre-tested questionnaires for households and individuals. Principally an interviewer tries to obtain face-to-face interviews with all members of a given survey household aged 16 years and over. Additionally one person (head of household) is asked to answer a household related questionnaire covering information on housing, housing costs, and different sources of income. This covers also some questions on children in the household up to 16 years of age, mainly concerning attendance at institutions (kindergarten, elementary school, etc.)
Users outside the European Economic Area (EEA) are only permitted to use a reduced version of the dataset (Scientific Use File) due to data protection regulations.
For complete information, see DOI https://doi.org/10.5684/soep.v34i
Known Bugs/Fixes for wave BH (1984-2017):
Please note that the update was not carried out in the current data transfer. It must be carried out manually by you!
Overview (May 2019):
Values for the variables plb0186_v2 and plb0186_h for the East sample in 1990 are too small by a factor of 10.
The names assigned to the raw variables bhh_37_01 “electricity included in rent” and bhh_37_02 “assessed burden of housing expenses (rent and additional expenses)” do not correspond to the standard SOEP concept for naming variables. Both variables will be renamed in the new version.
The previous version from the migspell dataset was delivered.
The new identifiers were not filled in and have to be filled in from the old identifiers.
1. Dataset: pl
Variables: plb0186_v2, plb0186_h
Values for the variables plb0186_v2 “Actual working time with overtime (1990-2017)” and plb0186_h “Actual working time with overtime (harmonized)” have the wrong values for the East sample in 1990.
The variable plb0186_h is made up of the variables plb0186_v1 (1984-1989) and plb0186_v2 (1990-2017). We included all of the values for plb0186_v1 as they were, and divided all of the valid values for plb0186_v2 by 10. The process of harmonization is necessary due to the fact that the two raw variables for 1990 were provided in different formats:
gpost: gp3601e (two-digit, no comma)
gp: gp39 (three-digit, no comma)
The raw variable gp3601e from gpost was assigned to the variable plb0186_v2 although it does not have to be divided by 10. As a result, all values for the East German population for the year 1990 were mistakenly divided by 10. The simplest way of solving this problem is to multiply the valid values for the East German population by 10.
2. Dataset: bhh
Variables: bhh_37_01, bhh_37_02
The names assigned to the raw variables bhh_37_01 “electricity included in rent” and bhh_37_02 “assessed burden of housing expenses (rent and additional expenses)” do not correspond to the standard SOEP concept for naming variables. Both variables had to be renamed:
bhh_37_01 “Electricity included in rent” → bhh_33
bhh_37_02 “Assessed burden of housing expenses (rent and additional costs)” → bhh_37
To find out more about how raw variables are named in the SOEP, see the SOEPcompanion:
Naming conventions of Variables and Datasets
3. Dataset: migspell
Unfortunately the previous version of the migspell dataset was delivered. For the current version, please contact the SOEPhotline or write an email to soepmail.
4. Dataset: biobirth, bioimmig, biojob, bioparen, bioresid, biosib, biosoc, biotwin, pflege
Variables: pid, cid, hid
In the process of “merging” SOEP-Long and SOEP-Core, all of the SOEP-Long ID variables (pid, hid, cid) were also included in the raw datasets to make merging easier for users. In some datasets, only the ID variables were created but not filled in with the corresponding IDs.
Empty pid: biobirth, bioimmig, biojob, bioparen, bioresid, biosib, biosoc, biotwin, pflege
Empty hid: bioimmig, bioresid, biosoc
Empty cid: biobirth, bioimmig, biojob, bioparen, bioresid, biosib, biosoc, biotwin, pflege
With these datasets, please continue to use persnr, hhnrakt, hhnr, or copy the content into the corresponding new ID variable.
clonevar pid = persnr
Further information on SOEP identifiers can be found here:
The following changes have been made to the current data distribution :
The new wave of the SOEP-Core study incorporates our “wide” and “long” data formats, which used to be provided to users separately. Our aim is to eliminate any confusion about what is available in which format and to make data use easier overall. After several years of testing SOEPlong as an additional service designed to facilitate analysis for both experienced and new users, we will now be providing all datasets in the “long” format as a standard part of our SOEP data release. This means that you will find the different SOEP data formats listed below in your data file, some of which will be contained in separate subdirectories.
Please make sure that you unpack the entire directory structure when unpacking your data.
1.1. SOEP in “long” format on the top level
In the top-level (or root) directory, you will find all of the datasets provided up to now with SOEPlong (pl, ppfadl, etc.) as well as all of the additional datasets formerly provided only in our classic “wide” format (biographical or spell data such as bioparen, artkalen, etc.). All of the data in the main SOEP-Core study are therefore contained in the datasets in the top-level directory.
Feedback from experienced and beginning users over the past several years shows that the “long” data offer significant advantages in ease of use, particularly for beginners. We have therefore decided to use this as our primary data format in future data releases.
All available individual year-specific datasets are pooled into a single dataset (e.g., all $P datasets are integrated into the PL dataset). In some cases, this means that we have to harmonize variables in order to be able to define them consistently over time. For instance, income information is given in euros up to 2001 and not in deutschmarks, and in cases where questionnaires have changed, the categories are modified over time. All changes are presented to users in a clear and understandable way, and if harmonization is necessary, all input variables are provided in their original form (see below _v*-variables). SOEPlong thus significantly reduces the number of datasets and the number of variables.
A more detailed description of the format of our SOEP-Core data release can be found in our new SOEPcompanion.
1.1.1. Most important changes to v33 in the long format
1.2. Classic format in the subdirectory raw
Since we know that many users have existing scripts that are based on the original data format, and to enable users to understand the process of generating the “long” data, we provide all of the datasets in their original SOEP format in the directory raw.
Users who want to continue using the old format simply need to switch into subdirectory rawand use the datasets there.
The only change is that there are now additional identifiers in all of the datasets in the raw directory with the name in the long format (PID and PERSNR or HID and $HHRNAKT) and a survey year variable (SYEAR) so that users can easily merge variables from the two data formats.
1.3. New EU-SILC clone in the subdirectory eu-silc-clone
Many users are undoubtedly aware that the SOEP supports cross-national analysis with CNEF through the dataset PEQUIV. We have now produced a data product that allows you to use the SOEP data in comparative analyses with the EU-SILC (European Union Statistics on Income and Living Conditions) data. EU-SILC, which is provided by Eurostat upon request, offers cross-sectional and longitudinal information for many European countries. Up to now, only cross-sectional information has been available for Germany. The EU-SILC clone offers longitudinal information on private households in Germany based on the SOEP data. All of the information contained in it can be directly compared with the EU-SILC longitudinal information on other European countries.
The EU-SILC clone is integrated into the standard SOEP data release (in subdirectory eu-silc-clone).
Documentation on the 2005-2016 EU-SILC clone can be found here (PDF, 3.01 MB).
The new SOEP data release (v34) will be the first to contain data from the IAB-BAMF-SOEP Survey of Refugees in Germany as Sample M5, as well as the continuation of the PIAAC-L Survey, as Sample N.
2.1. IAB-BAMF-SOEP Survey of Refugees (M5)
The SOEP, in cooperation with the Institute for Employment Research (IAB) and the Federal Office for Migration and Refugees (BAMF), has succeeded in integrating a third sample of refugee households (M5) into the SOEP study. The survey was launched in 2017. The population of M5 covers adult refugees who have applied for asylum in Germany since January 1, 2013, and are currently living in Germany. M5 added another 1,519 households of refugees who have migrated to Germany since 2013 to the SOEP framework.
2.2. Integration of respondents from PIAAC-L as Subsample N
Sample N integrated 2,314 households of former participants of the Program for the International Assessment of Adult Competencies (PIAAC and PIAAC-L) in 2017. This is the most recent addition to the SOEP-Core samples. Fieldwork in sample N was conducted between mid-March and mid-August and thus slightly later than the majority of samples A–L1. More information on the PIAAC-L project can be found on the project homepage.
In the IAB-BAMF-SOEP Survey of Refugees (M3-M5), there were translation errors in some some of the questions on income components in translated versions of the household questionnaire. Answers for these variables are therefore not comparable with other answers. The corresponding variables were set to -3.
In the process of data preparation, three interviewers were identified who had not conducted interviews in line with the standards of the IAB-BAMF-SOEP group (more information here). The interviewers in question were responsible for 88 households in 2016 and 112 households in 2017. The households affected in the first wave of the survey (2016) were completely removed from the dataset. The households affected in 2017, who were supposed to be interviewed for the second time, were deleted for 2017 but left in the dataset for 2016. There are no indications that the first interviews (by a different interviewer) were not conducted in line with IAB-BAMF-SOEP standards. The interviews and cases deleted from the data release may be accessed upon request from a guest work station at the SOEP-RDC for survey methodological analysis. After these lines were deleted from all datasets, the following adjustments were made:
The extended variable naming convention is applied only to data sets from wave BH onwards and only applicable for the datasets $P, $H, $KIND. We added underscores between unit of analysis, question identifier, and item identifier to clearly separate the analysis unit, question, and item visually. In addition, a questionnaire identifier was introduced, which is also separated by an underscore from the item. This new version of naming variables is only used if the survey instrument differs from the “original” SOEP-Core instrument.
Due to our different samples in the SOEP, there are some respondents that receive sample-specific questions, such as the refugee sample that started in 2016. For that specific group, we created an extended individual questionnaire with some specific questions along with the standard SOEP questions that are asked every year. For the specific questions, you can use the instrument variable to see the source of the variables.
Examples and more detailed descriptions can be found in the chapter on this subject in the SOEP Companion.
6.1. Dataset PPATH / PPATHL (in raw: PPFAD)
6.1.3 Migration information
6.1.4. Asylum-Seekers and Refugees
6.2. Dataset PGEN
6.2.1 Partner pointer
6.2.2. Volunteer work and side jobs
6.2.3. Educational degrees
6.3 Dataset PEQUIV
6.4. Dataset BIOAGEL and BIOPUPIL
6.5. Dataset HGEN
A number of changes have taken place in recent years in questions on home rental. The first change took place in the hosehold questionnaire of wave BF (2014). The question asked about the costs of utilities in such detail that respondents were not able to provide correct answers. This led to underestimation of both base rent and utilities.
It emerged that this led to a slight break in the time series. Rent has increased continuously over the years since 1984. In 2014 and 2015, however, rental costs fell and have been increasing again sharply since 2016. This break can be explained by the change in the questionnaire.
Starting with wave BH, respondents are being asked about rent in the same way as in wave BG (2016) and in wave BD (2013) in order to maintain long-term comparability. In addition, with wave BH, the new migration sample M5 and the new refresher sample N are part of the SOEP. Since Sample M5 was not surveyed on utility costs in a comparable way and since many of these respondents probably live in group housing or receive subsidies to cover living costs, no rent variable was generated for them.
v33 - rent
v34 - rent
6.6. Dataset BIOIMMIG
6.7. Dataset HHRF/PHRF
6.7.1. Revisions and Bugfixes
1. Citing the data source
To improve our documentation on data use, we ask that you include a reference to the dataset used and institution providing the data in all future publications (ideally in a footnote at the beginning of the paper or in the foreword to books). This should read as follows:
The Socio-Economic Panel (SOEP) data used in this [publication, paper, book, etc.] were provided by DIW Berlin..
2. Citing the data version
Precise citation of data sources is becoming ever more important in the research context. The SOEP group recommends that you cite the data according to the following (fictitious) example:
Socio-Economic Panel (SOEP), data for years 1984-2019, Version 36, SOEP, 2020, doi:10.5684/soep.v36.
Sozio-oekonomisches Panel (SOEP), Daten für die Jahre 1984-2019, Version 36, SOEP, 2020, doi:10.5684/soep.v36.
In your references, please cite one of the following publications:
If you need a publication that describes the SOEP Public Use File, the 95% version distributed to international users, please cite:
When using PanelWhiz, a collection of Stata/SE® add-on programs that automatically extract data from large panel datasets including the SOEP, please cite:
For more information on citation of the various versions of the SOEP data, please see Digital Objekt Identifier.
Hier erstmal Erklärtext. Was sind MD5-Fingerprints und wie kann man Sie nutzen.
|Stata zweisprachig||dfe399ba3879874dbdd0096b58cbd90f||(TXT, 19.29 KB)|
|Stata deutsch||9cbe419645ee17bdb5265df5a5662802||(TXT, 19.29 KB)|
|Stata englisch||3a195c128e21b732d8b1f0ff64316b35||(TXT, 19.29 KB)|
|SPSS deutsch||33763b1f68c54f790d9826b4923ac276||(TXT, 19.29 KB)|
|SPSS englisch||0454017269b9f5601d3fe30ace13211f||(TXT, 19.29 KB)|
|SAS deutsch||84c5124b696a552340b1d7bca79c8c15||(TXT, 21.53 KB)|
|SAS englisch||e6cb205a9d2abec3a37872f1dbf2a6e8||(TXT, 21.53 KB)|
|CSV||df524ba26e46b42ff77dd6991046485d||(TXT, 19.29 KB)|
|GGKBOU||1fd60d2f3f1a405d508cf472ff916cc9||(TXT, 140 Byte)|
|GGKBOU englisch||67c43e2e72aab736e6c6dafb75da57f5||(TXT, 140 Byte)|
The SOEP allows you to link data from a wide variety of other studies to SOEP-Core data. For instance, you can analyze data at the regional level or match micro-marketing indicators with survey data. For more a detailed description and further data linkage possibilities, see here.