SOEP-Core v33 - Changes in the Dataset

Änderungen am Datensatz

Dataset Information

SOEP-Core soep.v33.1

1 Deletion of incorrectly conducted interviews in the IAB-BAMF-SOEP Survey of Refugees

In the process of preparations for the next wave of the IAB-BAMF-SOEP Survey of Refugees, the survey institute determined that an interviewer had not conducted interviews correctly, affecting six percent of the household interviews in the sample. These households were removed from the dataset, but are available upon request for survey methodological analysis at a guest work station at the SOEP Research Data Center. In addition to deleting these lines of all affected datasets, we also made the following modifications:

  • Due to the deletion of household and individual interviews, the weights had to be updated (dataset HHRF and PHRF) to take the slightly reduced number of cases in the 2016 survey year into account.
  • The new weights were updated or included in the dataset BGPEQUIV.
  • Imputation of monthly household net income (I[1-5]HINC16) was redone for this sample in BGHGEN and in the dataset MIHINC.

2 Update INTID in BG files

Datasets from the current BG wave contained errors in the assignment of interviewer IDs. These were corrected.

3 Corrected number of entries in `$$KIND' (2014-2016)

Inconsistencies between key variables on population assignment in the PPFAD and $$KIND datasets were corrected. There was an error of one year in the definition of the target population in the $$KIND datasets from 2014 to 2016. In some cases, this led to a lack of information on the year of birth in files on children:

    • bekgjahr: 1998 for all samples
    • bfkgjahr: 1999 for all samples
    • bgkgjahr: 1999 only for samples M3 and M4 in 2016

These corrections also affect the number of cases in the file KIDLONG, which was corrected correspondingly.

3.1 Change in the $$NETTO codes in 96 cases (children) in the years 2014-2016

In the process of data checks, the $$NETTO codes in PPFAS were also compared and corrected. In survey years 2014 to 2016, some children had been incorrectly assigned the code 20 instead of 30 on the variable $$NETTO in the PPFAD dataset. This error has been corrected in v33.1 with the correction of the variable $$NETTO. The update also made it necessary to correct person weights in the affected survey years (dataset PHRF), because the determination of which individuals in interviewed households should be assigned a valid weight is based on the variable $$NETTO. The updated weight is also contained in v33.1.


In BIOAPREN, a number of missing values in the flag variables for parental (professional) education and the years of death of the parents were updated and filled in.


The algorithm for imputation of missing dates in the spells were optimized. As a result, in v33.1, the imputed variables and the variables imputed from these were changed, specifically all variables with the suffixes _imp and the variable staytime. The changes affected a total of 349 of 15,640 spells.

6 Update AUSB16 in BGPGEN

The variable AUSB16 (“profession requires vocational training”) from BGPGEN were updated. The correction substantially decreased the number of missings [-1].

SOEP-Core soep.v33

The new data distribution (1984–2016) “SOEP v36” provides, for the most recent survey year 2016, the usual wave-specific data files BGPBRUTTO, BGP, BGPKAL, BGPGEN, BGHBRUTTO, BGH, BGHGEN, BGKIND, and BFPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). Additional new samples, datasets, or variables are listed below:

1 New Sample IAB-BAMF-SOEP Geflüchtetenstichprobe (Samples M3/M4)

With version 33 of the SOEP data, we will also be releasing the integrated data from the 2016 IAB-BAMF-SOEP Survey of Refugees in Germany as two supplementary samples to the SOEP. The samples are based on the overall population of refugees, independent of residency status, who arrived in Germany between January 1, 2013, and January 31, 2016. The samples were surveyed through additional funding from BA/IAB/BMAS in the case of M3 and from BMBF in the case of M4. Sample M4 contains a higher number of refugee families containing children and teenagers.

The German Central Registry of Foreigners (AZR) provided the data basis from which the sample was drawn. In this sample, 4,816 adults in 3,554 households were surveyed in 2016, and basic data was collected on 5,717 minors living in the same households. To conduct the survey, the questionnaire was translated into seven languages, in some cases interpreters were available. To address the specific situation of the target group, audio-assisted versions of the survey instruments were developed.

The survey is comprised of an integrated individual—life course questionnaire, a household questionnaire, and a questionnaire for interviewers. As was the case with samples M1 and M2 (IAB-SOEP Migration Surveys), participants were first asked for consent to linking their survey data with the IAB Integrated Employment Biography data. The Research Data Centre of the Federal Employment Agency at IAB will be providing the linked data to researchers as of Spring 2018.

The study design is described in detail in Martin Kroh et al. 2016. Das Studiendesign der IAB-BAMF-SOEP-Befragung von Geflüchteten. SOEP Survey Papers 365: Series C. Berlin: DIW Berlin / SOEP

1.1 Integration into SOEP and original data (BGP and BGP_REF)

The original data from the survey instruments used in Samples M3 and M4 can be found in original format in the dataset BGPREF, where the individual and the biographical questionnaires are combined. The variables are also integrated into the other standard or generated datasets:

  • Variables equivalent to those in the individual questionnaire in other samples are included in the dataset BGP. Also included in BGP are all variables which will be asked more than once, but specific to the refugee questionnaire.
  • Variables equivalent to those in the biographical questionnaires in other samples are included in the respective biographical datasets (e.g., BIOMARSM).
  • The comprehensively surveyed migration biography can be found in the new dataset REFUGSPELL.

2 New datasets / variables

2.1 Datasets directly based on survey instrument (like $P and $H)


        With the integration of the new migration samples since 2013, the $P and $H datasets include data from more survey instrument. The basis remains the paper version of the questionnaire from samples A-L1, but it is supplemented by data from sample-specific survey instruments. To make it easy for users to understand, there is now a variable in $H and $P identifying the particular instrument from all waves starting with BD (2013).

Additional variables on occupational codes

      There are now many more variables containing coded occupational information in the different questionnaire-specific datasets ($P, $JUGEND, $LUECKE, also $P_MIG and $P_REFUGEES). The variables can be identified by the suffixes denoting the classification used. ISCO-88 and KldB92 are available for all occupations: older $P-files contain ISCO-68, newer files contain ISCO-08 and KldB2010. 


    Since 2000 (wave Q), first-time respondents between the ages of 16 and 17 have received a separate biographical questionnaire with additional age-group-specific questions, for instance, about their relationship to their parents or about what they do in their free time. Up to now, only some of the data collected from this survey have been processed and provided to users in dataset BIOAGE17. Starting with the current data release, the complete data will be provided in individual $JUGEND datasets.

2.2 New variables in PPFAD


      The variable SEXOR combines information on the sexual orientation of respondents from various sources in the SOEP. In 2016 (wave BG), for the first time, the SOEP included a direct question about sexual orientation (self-reported).


    This variable tells which federal state the respondent was born in for respondents who were interviewed after 2012 and who reside in Germany’s current federal states. Data users interested in obtaining more specific information on place of birth at the level of the municipality can access this data on a guest visit to the SOEP Data Research Center (contact for details). 


Bioagel now contains information from the new questionnaire for 13- to 14-year-olds that was introduced in 2016 (v33). The questionnaire contains items on personality, leisure time activities, personal networks, educational aspirations, and family life, and is completed by the young people themselves. Many of the questions included here stem from the questionnaire for 11- to 12-year-olds introduced in 2014 (v31). For those respondents who already completed this questionnaire, we now provide longitudinal information on development in many areas such as personality and educational aspirations.


For migration biographies in the refugee samples, we created the new spell data set REFUGSPELL. The variables in MIGSPELL and REFUGSPELL are derived from different instruments and only partially overlap. The data structure allows the two data sets to be linked if desired. Detailed documentation will be provided by the biographical data documentation of the SOEP.

2.5 New variables in $PGEN

$P_RELIGION (Religious affiliation)

      An integrated version of religious affiliation variable for all respondents in 2016 since additional differentiations were used to survey the migration samples.

PICORIG[A-C]$$ (Party Identification in Country of Origin)

    • PICORIGA16 - Party Identification in Country of Origin
    • PICORIGB16 - Party Identified with in Country of Origin
    • PICORIGC16 - Party Family of PI in Country of Origin


Current residency status

    This variable will be a harmonized version of the current residency status variable for all immigrants in the corresponding samples including the new refugees sample.


2.6. New variables in $PEQUIV

  • KIDY$$ Income of Children in Household
  • FKIDY$$ Imputation Flag for Income of Children in Household
  • IWITH$$ Profit Withdrawal
  • FWITH$$ Imputation Flag for Profit Withdrawal

2.7      Gripstrength data for 2016

GRIPSTR update: The data on grip strength from the survey year 2016 is now included in the GRIPSTR dataset.

3. Revisions and bug fixes

3.1 Variables in PPFAD


    Information on the country of birth (GERMBORN, CORIGIN) and the year of immigration (IMMIYEAR) is no longer compared with and coded according to the previous year’s information in PPFAD. Instead, all information available on a respondent in the SOEP is collected and compared to code these variables. Efforts have been made (1) to give information on all respondents, dramatically reducing the number of missing values, and (2) to avoid group categories for the country of birth such as Eastern Europe (now, e.g., Poland). Three new variables, GERMBORNINFO, CORIGININFO, and IMMIYEARINFO, are introduced in v33 to indicate the quality of information given in GERMBORN, CORIGIN, and IMMIYEAR.



    The changes in GERMBORN also influence MIGBACK and MIGINFO, resulting in some value changes and a stronger focus of miginfo on the availability of parental information.



    Different proxies were used to code the respondents' place of residence in 1989 (variable LOC1989), resulting in some value changes. A new variable, LOCINFO, has been introduced in v33 to indicate the quality of the information in LOC1989.


The target population and sample size of the BIOIMMIG dataset has changed. The dataset is no longer limited to respondents who were born abroad and had non-German citizenship. In addition, cases without valid BIOIMMIG information in any wave or only with valid information on BISCGCF, are no longer included in the dataset.

3.3 Update of PWEALTH and HWEALTH

Up to now, the former FiD samples were not integrated into the data for 2012 even though these samples also received wealth questionnaires. With the current data release, these cases have now been integrated.


BIOPAREN was build new from the scratch. We excluded redundant variables and changed the variable names to English for consistency (which means that VNR is now FNR). Please see the documentation for a full list of changes and an overview of the new variable names.

3.5 $PGEN

With variables DEGREE$$, FIELD$$, and TRAIN[A-D]$$, discrepancies had appeared in the retrospectively reported data. The classification was improved to deal with inconsistencies in multiple answers. The conversion key for occupational information based on the ISCO-88 classification for TRAIN[A-D]$$ in KldB92 was extended from two to four digits, and the conversion key itself was revised.

3.6 $HGEN

In Wave BF, a major change took place in the way respondents were asked about rent, and respondents were asked to provide more detailed information on their rental expenses. It became clear from ex post analysis that the way these questions were asked was too complicated for some respondents and that it had resulted in a discontinuity in the time series. As a result, according to SOEP—as well as in the comparative statistics— the average rent had risen systematically over time, but not in 2014 or 2015. This discontinuity can be explained by the change in the questionnaire. Starting with wave BG, the questions on rent were therefore changed back to those in wave BD. In wave BG, migration samples M3 and M4 were are also part of the SOEP. Since these respondents were not asked what they paid for utilities, no rental variable was generated for this group.