Changes in the Dataset

Data Updates

Data distribution v35 (Wave BI)

SOEP-Core soep.v35

1. New sample in the main SOEP study

The new refresher sample, Subsample O, contains 1,000 new households. These were selected in cooperation with BBSR using a new sampling design based on regional data in areas where the “Soziale Stadt” (social city) urban development project is being carried out. Based on the digital data available on the boundaries of the “Soziale Stadt” areas, it was possible to create a new variable going back to the year 2000 that shows whether or not a household’s address is within an area covered by the project (see Variable Description below under 4.4).

2. Modifications in our new main data format, SOEPlong

We have made the following important changes over and beyond to our normal annual updates:

  • PKAL: Integration of the $PKALOST datasets
  • PL/PKAL: Calendar strings all now stored in PL and monthly variables in PKAL
  • PLUECKEL: Introduction of RYEAR and correction of SYEAR, which was RYEAR up to now
  • PBRUTTO: If a variable was not part of the year-specific gross file, the missing code has been changed to -8 and is no longer -2.
  • VPL: The case numbers for past years have increased since cases without a SOEP respondent are no longer deleted
  • KIDLONG: The harmonization concept has been adapted to the concept used with other datasets; more variables from $KIND datasets have been included (more information under 4)

3. New in SOEPhelp

  • SOEPhelp now includes links between topics and variables from the metadata. The data overview (command: soephelp (without variable)) lists all the topics in the dataset and tells which variables belong to which topic.
  • The variable overview (command: soephelp [variable]) lists the topics covered by the variable (and the relationships among topics and subtopics). The topic labels are linked to
  • SOEPhelp now has a search tool! If you type in the command: soephelp, search (SEARCH TERM) [verbose], you will get a list of the variables for which your SEARCH TERM is contained either in the question or one of the answer options. The variables are provided in list form and saved in r (for returns). The option “verbose” describes the variables in more detail.
  • More information on SOEPhelp

4. New Datasets and Variables

4.1. Early Childhood

  • New dataset BCBFK “Early Childhood" with geographically detailed information about the places where the respondents grew up. Because of the detailed regional data the dataset is only available with the RDC SOEP. The corresponding field report and questionnaire is available as SOEP Survey Paper 766 | PDF, 1.28 MB (in German).

4.2. Your Life in the GDR

  • New dataset DDR18 “Your Life in the GDR”, the corresponding questionnaire is available as Survey Paper 676.

4.3. Biography follow-up survey

  • The variables from the biography follow-up survey on migration status have been integrated into the dataset BILELA or BIOL.

4.4. New variable SOCURBAN in dataset HBRUTTO

  • SOCURBAN: Household address is in an area where the “Soziale Stadt” (social city) urban development project is being carried out (as of July 2017) (Yes/No)

4.5. New variables in dataset EQUIV

  • ILIB1$$: pensions for liberal professions
  • ILIB2$$: widow / orphans pensions for liberal professions

4.6. New variables in dataset BIOJOB

  • In 2018, respondents received new survey instruments concerning job classifications and prestige score. This information is provided in new following variables: STBA10, ISCO08, EGP08, ISEI08, MPS08, and SIOPS08. Corresponding variables STBA, EGP, ISEI, MPS, und SIOPS of older versions of BIOJOB are renamed in STBA92, EGP88, ISEI88, MPS92, und SIOPS88.

5. Changes to datasets or individual variables

5.1. Weighting variable PHRF in the dataset PPATHL

  • There are slight changes concerning the poststratification of the weighting variables starting in 2013. The changes relate to the year of immigration. Previously, respondents who immigrated before 1955 were treated as migrants; they now constitute a distinct category of their own, along with recent immigrants and German-born respondents. The reason is that it is not possible to define ethnic Germans consistently between the Mikrozensus and the SOEP.

5.2. Variables representing occupational codes

  • Since 2013, open-ended questions on occupations have been coded in ISCO-08 and KldB 2010. This is the first year in which the old classifications ISCO-88 and KldB 92 are no longer available. We have therefore introduced new prestige scores based on the new classifications and discontinued the old scores.
  • Calendar strings have been moved from $PKAL to $P or standardized.

5.3. Educational variables

  • Up to soep.v34, the basic generated educational variables were generated annually and were cumulated over time. Due to the availability of SOEPlong, we have substantially revised the tools used for generating variables to always consider all available educational variables for each year.
  • In addition to the fact that all variables are now generated based entirely on SOEPlong files, we have also made two additional modifications:
  • First, the main educational variables now also take into account inconsistencies over time, in contrast to the educational variables in PGEN prior to soep.v34.
  • Second, variable “Amount of Education or Training in Years” ($$BILZEIT) has been slightly modified. To consider occupational training (for non-university degrees), we have adjusted the years of education for “civil servants” and “others” slightly.

5.4. Dataset KIDLONG

  • Errors in the integration of variables were corrected, split up in versioned variables, and harmonized variables were constructed. As a result, the number of variables has increased: 110 variables (v.34); 267 variables (v.35)
  • Missing variables from the $KIND datasets were incorporated into KIDLONG.
  • Corrected version of BHKIND was incorporated into KIDLONG.
  • KIDLONG now adheres to the classic harmonization concept).

5.5. Dataset BHKIND

  • Flag variable to identify child questionnaires that were not completed (BHKFLAG)
  • Missing observations were added: 15,032 (v.34) to 15,504 (v.35).
  • Errors in the integration of variables were corrected and missing variables were incorporated into BHKIND: 85 (v.34) variables; 129 variables (v.35).
  • All variables were renamed and now follow the SOEP naming conventions.

5.6. Dataset BIKIND

  • Flag variable added to identify child questionnaires that were not completed (BIKFLAG)
  • All variables now follow the SOEP naming conventions.

5.7. Variable PARID in the dataset PPATHL

  • Partnerships of respondents with net codes between 40 and 49 were dissolved and will be coded -2 “does not apply” in the future.

5.8. Variable HGOWNER in the dataset HGEN

  • In samples M3-M5 in 2017, several missing values in the variable HGOWNER were replaced with the information that a household is living in a shelter or housing for refugees.

5.9. Dataset INTERVIEWER

  • The year 2016 now contains information from Samples L2-M4.
  • The variable on the length of the interview (LENGTHINT) was eliminated and replaced by three variables, which each just give the average length of one questionnaire (LENGTHINT- H / P / J).
  • The youth surveys, which were previously counted in the number of interviews per person (AMOUNTINTP) now have their own variable (AMOUNTINTJ).

5.10. Dataset BIOAGE17

  • Previous versions of BIOAGE17 contained the identifier of the respondent’s mother (BYMNR) and father (BYVNR). The identifiers of the parents are found in BIOPAREN (MNR and VNR) and can be easily merged with BIOAGE17.
  • Desired occupation variables ISCO88 have been replaced by ISCO08. The same is true for BYKLAS: The old 1992 version has been replaced by the 2010 version.

5.11. Dataset BIOAGEL

  • The internal distinction between BIOAGE 8a and 8b, or between 81 and 82, has been eliminated, meaning that the dataset BIOAGEL now contains one line per child and respondent for questionnaires about 7-8-year-old children. As a result, when each parent completed a questionnaire on a child in a given year, there are two lines for that child (one line per parent). These can be identified by the different PIDE (PID of the respondent).

Data distribution v34 (Wave BH)

SOEP-Core soep.v34

1. New, user-friendly integrated data format

The new wave of the SOEP-Core study incorporates our “wide” and “long” data formats, which used to be provided to users separately. Our aim is to eliminate any confusion about what is available in which format and to make data use easier overall. After several years of testing SOEPlong as an additional service designed to facilitate analysis for both experienced and new users, we will now be providing all datasets in the “long” format as a standard part of our SOEP data release. This means that you will find the different SOEP data formats listed below in your data file, some of which will be contained in separate subdirectories.

Please make sure that you unpack the entire directory structure when unpacking your data.

1.1. SOEP in “long” format on the top level

In the top-level (or root) directory, you will find all of the datasets provided up to now with SOEPlong (pl, ppfadl, etc.) as well as all of the additional datasets formerly provided only in our classic “wide” format (biographical or spell data such as bioparen, artkalen, etc.). All of the data in the main SOEP-Core study are therefore contained in the datasets in the top-level directory.

Feedback from experienced and beginning users over the past several years shows that the “long” data offer significant advantages in ease of use, particularly for beginners. We have therefore decided to use this as our primary data format in future data releases.

All available individual year-specific datasets are pooled into a single dataset (e.g., all $P datasets are integrated into the PL dataset). In some cases, this means that we have to harmonize variables in order to be able to define them consistently over time. For instance, income information is given in euros up to 2001 and not in deutschmarks, and in cases where questionnaires have changed, the categories are modified over time. All changes are presented to users in a clear and understandable way, and if harmonization is necessary, all input variables are provided in their original form (see below _v*-variables). SOEPlong thus significantly reduces the number of datasets and the number of variables.

A more detailed description of the format of our SOEP-Core data release can be found in our new SOEPcompanion.

1.1.1. Most important changes to v33 in the long format

  • The following new files have been added:
    • HBRUTT: long file of the HBRUTT$$ files
    • PLUECKEL: long file of the $PLUECKE files
    • VPL: long files of the $VP files
  • Data sets PL and PL2 are being provided again in one combined file (PL).
  • The variable scheme with c-variables (cross-sectional) and l-variables (longitudinal) has been modified as follows:
    • If the variables on which a variable in the long format is based changed in the cross-section, then corresponding _v*-variables will be created for each version. A harmonized _h-variable is provided as well. Further information can be found in the SOEPcompanion (general description, examples)
  • All of the long datasets generated from the various cross-sectional datasets contain the new variable: INPUTDATASET.
  • Due to adjustments to the new joint data release format, some files with “long”-specific names are no longer included in the data release: CDESIGN, CSAMP, CSAMPFID, KIDL, PBREXIT.
  • The following datasets have been renamed to avoid conflicts with the data names in the raw directory:
    • PPATH replaces PPFAD
    • PPATHL replaces PPFADL
    • HPATH replaces HPFAD
    • HPATHL replaces HPFADL

1.2. Classic format in the subdirectory raw

Since we know that many users have existing scripts that are based on the original data format, and to enable users to understand the process of generating the “long” data, we provide all of the datasets in their original SOEP format in the directory raw.

Users who want to continue using the old format simply need to switch into subdirectory rawand use the datasets there.

The only change is that there are now additional identifiers in all of the datasets in the raw directory with the name in the long format (PID and PERSNR or HID and $HHRNAKT) and a survey year variable (SYEAR) so that users can easily merge variables from the two data formats.

1.3. New EU-SILC clone in the subdirectory eu-silc-clone

Many users are undoubtedly aware that the SOEP supports cross-national analysis with CNEF through the dataset PEQUIV. We have now produced a data product that allows you to use the SOEP data in comparative analyses with the EU-SILC (European Union Statistics on Income and Living Conditions) data. EU-SILC, which is provided by Eurostat upon request, offers cross-sectional and longitudinal information for many European countries. Up to now, only cross-sectional information has been available for Germany. The EU-SILC clone offers longitudinal information on private households in Germany based on the SOEP data. All of the information contained in it can be directly compared with the EU-SILC longitudinal information on other European countries.

The EU-SILC clone is integrated into the standard SOEP data release (in subdirectory eu-silc-clone).

Documentation on the 2005-2016 EU-SILC clone can be found here | PDF, 3.01 MB .

2. New samples in the main SOEP study

The new SOEP data release (v34) will be the first to contain data from the IAB-BAMF-SOEP Survey of Refugees in Germany as Sample M5, as well as the continuation of the PIAAC-L Survey, as Sample N.

2.1. IAB-BAMF-SOEP Survey of Refugees (M5)

The SOEP, in cooperation with the Institute for Employment Research (IAB) and the Federal Office for Migration and Refugees (BAMF), has succeeded in integrating a third sample of refugee households (M5) into the SOEP study. The survey was launched in 2017. The population of M5 covers adult refugees who have applied for asylum in Germany since January 1, 2013, and are currently living in Germany. M5 added another 1,519 households of refugees who have migrated to Germany since 2013 to the SOEP framework.

2.2. Integration of respondents from PIAAC-L as Subsample N

Sample N integrated 2,314 households of former participants of the Program for the International Assessment of Adult Competencies (PIAAC and PIAAC-L) in 2017. This is the most recent addition to the SOEP-Core samples. Fieldwork in sample N was conducted between mid-March and mid-August and thus slightly later than the majority of samples A–L1. More information on the PIAAC-L project can be found on the project homepage.

3. Translation errors in some questionnaire languages

In the IAB-BAMF-SOEP Survey of Refugees (M3-M5), there were translation errors in some some of the questions on income components in translated versions of the household questionnaire. Answers for these variables are therefore not comparable with other answers. The corresponding variables were set to -3.

4. Deletion of interviews not conducted in line with the standards of the IAB-BAMF-SOEP group in the IAB-BAMF-SOEP Survey of Refugees (M3/M4)

In the process of data preparation, three interviewers were identified who had not conducted interviews in line with the standards of the IAB-BAMF-SOEP group (more information here). The interviewers in question were responsible for 88 households in 2016 and 112 households in 2017. The households affected in the first wave of the survey (2016) were completely removed from the dataset. The households affected in 2017, who were supposed to be interviewed for the second time, were deleted for 2017 but left in the dataset for 2016. There are no indications that the first interviews (by a different interviewer) were not conducted in line with IAB-BAMF-SOEP standards. The interviews and cases deleted from the data release may be accessed upon request from a guest work station at the SOEP-RDC for survey methodological analysis. After these lines were deleted from all datasets, the following adjustments were made:

  • The deletion of the household and individual interviews required an update of the weights (dataset HHRF and PHRF), which now take account of the slightly reduced case numbers in survey years 2016 and 2017.
  • Update / inclusion of the new weights in the datasets BGPEQUIV and BHPEQUIV.

5. Extended variable naming convention

The extended variable naming convention is applied only to data sets from wave BH onwards and only applicable for the datasets $P, $H, $KIND. We added underscores between unit of analysis, question identifier, and item identifier to clearly separate the analysis unit, question, and item visually. In addition, a questionnaire identifier was introduced, which is also separated by an underscore from the item. This new version of naming variables is only used if the survey instrument differs from the “original” SOEP-Core instrument.

Due to our different samples in the SOEP, there are some respondents that receive sample-specific questions, such as the refugee sample that started in 2016. For that specific group, we created an extended individual questionnaire with some specific questions along with the standard SOEP questions that are asked every year. For the specific questions, you can use the instrument variable to see the source of the variables.

Examples and more detailed descriptions can be found in the chapter on this subject in the SOEP Companion.

6. Changes in specific variables

  • New variables for interview year: HIYEAR in HGEN, HPATHL, and PIYEAR in PGEN, PPATHL. These new variables indicate, for all survey years, the household and individual interviews that were finalized after (or before) the survey year (variable syear), which is the reference year for the questionnaires and for data collection.

6.1. Dataset PPATH / PPATHL (in raw: PPFAD)

6.1.1. SEXOR

  • The previous data release was the first to include the variables SEXOR (sexual orientation) and SEXORINFO (source of information on sexual orientation). The value -1 “insufficient information” has been changed to 2 “insufficient information”.


  • The value -1 “unclear” has been changed to 5 “unclear”.

6.1.3 Migration information

  • The coding of GERMBORN, CORIGIN, IMMIYEAR, and MIGBACK was changed for inconsistent cases (for more information, see the PPATH/PPFAD documentation).

6.1.4. Asylum-Seekers and Refugees

  • The variables for asylum-seekers and refugees [AREBACK, AREFINFO] have been renamed (in v33: REFBACK, REFINFO) and revised. The variable AREFINFO now also allows identification of specific subgroups (more information is available in the documentation).

6.2. Dataset PGEN

6.2.1 Partner pointer

  • For the variable PGPARTZ (PARTZ$), the value -1 (“no answer”) has been replaced by the correct value 5 (“unclear”).
  • Starting with wave BH, the new quality control processes implemented in generating the partner indicator have improved the quality of data from previous waves:
    • Contradictory answers between partners regarding their relationship have been identified and corrected.
    • Partnerships with differing partner indicators (1 “spouse” or 2 “life partner”) within a relationship have been identified and corrected.
    • Errors in the assignment of PARTZ values (1 “spouse” and 2 “life partner”) due to different filter routing in the different survey instruments have been corrected. Marriages were asked differently in the individual biography questionnaire for Sample J+K and in the individual questionnaire for Samples A-I. Separating out samples J and K played a key role in this correction, since these two led to errors due to their different filter routing.
    • Partnerships with recently deceased individuals were identified and deleted.
    • Respondents’ data on divorce, separation, or the death of a life partner within the past year have been taken into account for the first time in the process of generating the data.
    • For the first time, family status, civil status, partner’s first name (permanent partner number) and place of residence of the partner in the case of refugees’ partnerships (Samples M3-M5) have been taken into account in addition to interviewer given relationships between the different houshold members.

6.2.2. Volunteer work and side jobs

  • The PGEN (raw: $$PGEN) files contain nine new variables. In 2017, the SOEP fundamentally revised how respondents were surveyed about side jobs. Now, for the first time, respondents can provide answers on three different side jobs. They can also now differentiate the type of side job, whether volunteer work (variables HONOR1, HONOR2, HONOR3) and whether they are working for an employer or working freelance (SNDTYP1, SNDTYP2, SNDTYP3). The amount of gross additional income from side jobs is provided as imputed information (SNDJOB1, SNDJOB2, SNDJOB3).
    • SNDTYP117 : First side job occupational status
    • SNDTYP217 : Second side job occupational status
    • SNDTYP317 : Third side job occupational status
    • SNDJOB117 : Current gross additional income from side job 1 (gen.) in euros
    • SNDJOB217 : Current gross additional income from side job 2 (gen.) in euros
    • SNDJOB317 : Current gross additional income from side job 3 (gen.) in euros
    • HONOR117 : Volunteer work 1
    • HONOR217 : Volunteer work 2
    • HONOR317 : Volunteer work 3

6.2.3. Educational degrees

  • In v34, CASMIN and ISCED are based on additional information on educational degrees obtained abroad. Hence, some individuals with degrees from abroad display higher ranks in v33 than in v34.
  • The error in the CASMIN variable in v33 is fixed: In v33, individuals with 2c_voc (vocational maturity certificate) were mistakenly categorized as 2c_gen (general maturity certificate).

6.2.4. AUTONO

  • The generation of autono was discontinued in 2017 due to the difficulty in comparing this variable with the usual models of autonomy. Work is currently underway to introduce comparable definitions of autonomy.

6.3 Dataset PEQUIV

  • The PEQUIV (raw: $$PEQUIV) files contain six new variables. These are:
    • IAUS117 : Pensions from another country
    • AUS217 : Widows / orphans pension from another country
    • ASYL17 : Asylum-seeker benefit
    • FASYL17 : Imputation flag: Asylum-seeker benefit
    • EDUPAC17 : Benefits from the educational package
    • FEDUPAC17 : Imputation flag: Benefits from the educational package
    For more details, see the SOEP Survey Paper: Codebook for the $PEQUIV File 1984-2017.

6.4. Dataset BIOAGEL and BIOPUPIL

  • Variables from questionnaires given to 12-year-olds and 14-year-olds are now provided in BIOPUPIL dataset to reflect the differences in survey mode (parents being asked questions about their children vs. children being surveyed directly).
  • Variables from additional questions in refugee samples are integrated in BIOAGEL and BIOPUPIL datasets.

6.5. Dataset HGEN

A number of changes have taken place in recent years in questions on home rental. The first change took place in the hosehold questionnaire of wave BF (2014). The question asked about the costs of utilities in such detail that respondents were not able to provide correct answers. This led to underestimation of both base rent and utilities.

It emerged that this led to a slight break in the time series. Rent has increased continuously over the years since 1984. In 2014 and 2015, however, rental costs fell and have been increasing again sharply since 2016. This break can be explained by the change in the questionnaire.

Starting with wave BH, respondents are being asked about rent in the same way as in wave BG (2016) and in wave BD (2013) in order to maintain long-term comparability. In addition, with wave BH, the new migration sample M5 and the new refresher sample N are part of the SOEP. Since Sample M5 was not surveyed on utility costs in a comparable way and since many of these respondents probably live in group housing or receive subsidies to cover living costs, no rent variable was generated for them.

v33 - rent

v34 - rent

2010: 486.25

2010: 486.21

2011: 484.93

2011: 485.64

2012: 491.01

2012: 490.75

2013: 505.00

2013: 505.59

2014: 470.95

2014: 473.74

2015: 507.06

2015: 508.57

2016: 545.53

2016: 541.90


2017: 550.67

6.6. Dataset BIOIMMIG

  • The population of BIOIMMIG shrunk due to a change of coding of BIIMGRP (for more information, see the BIOIMMIG documentation)

6.7. Dataset HHRF/PHRF

  • New variables in PHRF (and ENUMHRF, available on request): BHPHRF, BHPBLEIB, BHPHRFAM4, BHPHRFM5, BHPHRFN
  • Please note that with our new integrated data format, you’ll find all weighting variables now directly in PPATHL or HPATHL.
  • On request, we provide stand-alone weighting variables (BHPHRFM35, BHHHRFM35) for the refugee samples M3, M4, and M5.

6.7.1. Revisions and Bugfixes

  • Due to confusion in the country codes for Iran and Russia in the sampling frame (Central Register of Foreign Nationals, AZR), design weights for Samples M3 and M4 as well as their cross-sectional weights for wave BG had to be updated.
    In Wave BG, we interpreted the population of samples M3 and M4 as refugees who immigrated to Germany between January 2013 and January 2016. In fact, only those refugees whose registration at the Central Register of Foreign Nationals (AZR) took place until April 2016 were included in those samples. In Sample M5, among others, those refugees were interviewed who, although they had immigrated in the same period, were registered later. For this reason, the total for the post-stratification of the second wave of M3 and M4 has been reduced by the number of refugees with a later registration date.

Data distribution v33 (Wave BG)

SOEP-Core soep.v33.1

1 Deletion of incorrectly conducted interviews in the IAB-BAMF-SOEP Survey of Refugees

In the process of preparations for the next wave of the IAB-BAMF-SOEP Survey of Refugees, the survey institute determined that an interviewer had not conducted interviews correctly, affecting six percent of the household interviews in the sample. These households were removed from the dataset, but are available upon request for survey methodological analysis at a guest work station at the SOEP Research Data Center. In addition to deleting these lines of all affected datasets, we also made the following modifications:

  • Due to the deletion of household and individual interviews, the weights had to be updated (dataset HHRF and PHRF) to take the slightly reduced number of cases in the 2016 survey year into account.
  • The new weights were updated or included in the dataset BGPEQUIV.
  • Imputation of monthly household net income (I[1-5]HINC16) was redone for this sample in BGHGEN and in the dataset MIHINC.

2 Update INTID in BG files

Datasets from the current BG wave contained errors in the assignment of interviewer IDs. These were corrected.

3 Corrected number of entries in `$$KIND' (2014-2016)

Inconsistencies between key variables on population assignment in the PPFAD and $$KIND datasets were corrected. There was an error of one year in the definition of the target population in the $$KIND datasets from 2014 to 2016. In some cases, this led to a lack of information on the year of birth in files on children:

    • bekgjahr: 1998 for all samples
    • bfkgjahr: 1999 for all samples
    • bgkgjahr: 1999 only for samples M3 and M4 in 2016

These corrections also affect the number of cases in the file KIDLONG, which was corrected correspondingly.

3.1 Change in the $$NETTO codes in 96 cases (children) in the years 2014-2016

In the process of data checks, the $$NETTO codes in PPFAS were also compared and corrected. In survey years 2014 to 2016, some children had been incorrectly assigned the code 20 instead of 30 on the variable $$NETTO in the PPFAD dataset. This error has been corrected in v33.1 with the correction of the variable $$NETTO. The update also made it necessary to correct person weights in the affected survey years (dataset PHRF), because the determination of which individuals in interviewed households should be assigned a valid weight is based on the variable $$NETTO. The updated weight is also contained in v33.1.


In BIOAPREN, a number of missing values in the flag variables for parental (professional) education and the years of death of the parents were updated and filled in.


The algorithm for imputation of missing dates in the spells were optimized. As a result, in v33.1, the imputed variables and the variables imputed from these were changed, specifically all variables with the suffixes _imp and the variable staytime. The changes affected a total of 349 of 15,640 spells.

6 Update AUSB16 in BGPGEN

The variable AUSB16 (“profession requires vocational training”) from BGPGEN were updated. The correction substantially decreased the number of missings [-1].

SOEP-Core soep.v33

The new data distribution (1984–2016) “SOEP v36” provides, for the most recent survey year 2016, the usual wave-specific data files BGPBRUTTO, BGP, BGPKAL, BGPGEN, BGHBRUTTO, BGH, BGHGEN, BGKIND, and BFPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). Additional new samples, datasets, or variables are listed below:

1 New Sample IAB-BAMF-SOEP Geflüchtetenstichprobe (Samples M3/M4)

With version 33 of the SOEP data, we will also be releasing the integrated data from the 2016 IAB-BAMF-SOEP Survey of Refugees in Germany as two supplementary samples to the SOEP. The samples are based on the overall population of refugees, independent of residency status, who arrived in Germany between January 1, 2013, and January 31, 2016. The samples were surveyed through additional funding from BA/IAB/BMAS in the case of M3 and from BMBF in the case of M4. Sample M4 contains a higher number of refugee families containing children and teenagers.

The German Central Registry of Foreigners (AZR) provided the data basis from which the sample was drawn. In this sample, 4,816 adults in 3,554 households were surveyed in 2016, and basic data was collected on 5,717 minors living in the same households. To conduct the survey, the questionnaire was translated into seven languages, in some cases interpreters were available. To address the specific situation of the target group, audio-assisted versions of the survey instruments were developed.

The survey is comprised of an integrated individual—life course questionnaire, a household questionnaire, and a questionnaire for interviewers. As was the case with samples M1 and M2 (IAB-SOEP Migration Surveys), participants were first asked for consent to linking their survey data with the IAB Integrated Employment Biography data. The Research Data Centre of the Federal Employment Agency at IAB will be providing the linked data to researchers as of Spring 2018.

The study design is described in detail in Martin Kroh et al. 2016. Das Studiendesign der IAB-BAMF-SOEP-Befragung von Geflüchteten. SOEP Survey Papers 365: Series C. Berlin: DIW Berlin / SOEP

1.1 Integration into SOEP and original data (BGP and BGP_REF)

The original data from the survey instruments used in Samples M3 and M4 can be found in original format in the dataset BGPREF, where the individual and the biographical questionnaires are combined. The variables are also integrated into the other standard or generated datasets:

  • Variables equivalent to those in the individual questionnaire in other samples are included in the dataset BGP. Also included in BGP are all variables which will be asked more than once, but specific to the refugee questionnaire.
  • Variables equivalent to those in the biographical questionnaires in other samples are included in the respective biographical datasets (e.g., BIOMARSM).
  • The comprehensively surveyed migration biography can be found in the new dataset REFUGSPELL.

2 New datasets / variables

2.1 Datasets directly based on survey instrument (like $P and $H)


        With the integration of the new migration samples since 2013, the $P and $H datasets include data from more survey instrument. The basis remains the paper version of the questionnaire from samples A-L1, but it is supplemented by data from sample-specific survey instruments. To make it easy for users to understand, there is now a variable in $H and $P identifying the particular instrument from all waves starting with BD (2013).

Additional variables on occupational codes

      There are now many more variables containing coded occupational information in the different questionnaire-specific datasets ($P, $JUGEND, $LUECKE, also $P_MIG and $P_REFUGEES). The variables can be identified by the suffixes denoting the classification used. ISCO-88 and KldB92 are available for all occupations: older $P-files contain ISCO-68, newer files contain ISCO-08 and KldB2010. 


    Since 2000 (wave Q), first-time respondents between the ages of 16 and 17 have received a separate biographical questionnaire with additional age-group-specific questions, for instance, about their relationship to their parents or about what they do in their free time. Up to now, only some of the data collected from this survey have been processed and provided to users in dataset BIOAGE17. Starting with the current data release, the complete data will be provided in individual $JUGEND datasets.

2.2 New variables in PPFAD


      The variable SEXOR combines information on the sexual orientation of respondents from various sources in the SOEP. In 2016 (wave BG), for the first time, the SOEP included a direct question about sexual orientation (self-reported).


    This variable tells which federal state the respondent was born in for respondents who were interviewed after 2012 and who reside in Germany’s current federal states. Data users interested in obtaining more specific information on place of birth at the level of the municipality can access this data on a guest visit to the SOEP Data Research Center (contact for details). 


Bioagel now contains information from the new questionnaire for 13- to 14-year-olds that was introduced in 2016 (v33). The questionnaire contains items on personality, leisure time activities, personal networks, educational aspirations, and family life, and is completed by the young people themselves. Many of the questions included here stem from the questionnaire for 11- to 12-year-olds introduced in 2014 (v31). For those respondents who already completed this questionnaire, we now provide longitudinal information on development in many areas such as personality and educational aspirations.


For migration biographies in the refugee samples, we created the new spell data set REFUGSPELL. The variables in MIGSPELL and REFUGSPELL are derived from different instruments and only partially overlap. The data structure allows the two data sets to be linked if desired. Detailed documentation will be provided by the biographical data documentation of the SOEP.

2.5 New variables in $PGEN

$P_RELIGION (Religious affiliation)

      An integrated version of religious affiliation variable for all respondents in 2016 since additional differentiations were used to survey the migration samples.

PICORIG[A-C]$$ (Party Identification in Country of Origin)

    • PICORIGA16 - Party Identification in Country of Origin
    • PICORIGB16 - Party Identified with in Country of Origin
    • PICORIGC16 - Party Family of PI in Country of Origin


Current residency status

    This variable will be a harmonized version of the current residency status variable for all immigrants in the corresponding samples including the new refugees sample.


2.6. New variables in $PEQUIV

  • KIDY$$ Income of Children in Household
  • FKIDY$$ Imputation Flag for Income of Children in Household
  • IWITH$$ Profit Withdrawal
  • FWITH$$ Imputation Flag for Profit Withdrawal

2.7      Gripstrength data for 2016

GRIPSTR update: The data on grip strength from the survey year 2016 is now included in the GRIPSTR dataset.

3. Revisions and bug fixes

3.1 Variables in PPFAD


    Information on the country of birth (GERMBORN, CORIGIN) and the year of immigration (IMMIYEAR) is no longer compared with and coded according to the previous year’s information in PPFAD. Instead, all information available on a respondent in the SOEP is collected and compared to code these variables. Efforts have been made (1) to give information on all respondents, dramatically reducing the number of missing values, and (2) to avoid group categories for the country of birth such as Eastern Europe (now, e.g., Poland). Three new variables, GERMBORNINFO, CORIGININFO, and IMMIYEARINFO, are introduced in v33 to indicate the quality of information given in GERMBORN, CORIGIN, and IMMIYEAR.



    The changes in GERMBORN also influence MIGBACK and MIGINFO, resulting in some value changes and a stronger focus of miginfo on the availability of parental information.



    Different proxies were used to code the respondents' place of residence in 1989 (variable LOC1989), resulting in some value changes. A new variable, LOCINFO, has been introduced in v33 to indicate the quality of the information in LOC1989.

    3.2 BIOIMMIG

    The target population and sample size of the BIOIMMIG dataset has changed. The dataset is no longer limited to respondents who were born abroad and had non-German citizenship. In addition, cases without valid BIOIMMIG information in any wave or only with valid information on BISCGCF, are no longer included in the dataset.

    3.3 Update of PWEALTH and HWEALTH

    Up to now, the former FiD samples were not integrated into the data for 2012 even though these samples also received wealth questionnaires. With the current data release, these cases have now been integrated.

    3.4 BIOPAREN

    BIOPAREN was build new from the scratch. We excluded redundant variables and changed the variable names to English for consistency (which means that VNR is now FNR). Please see the documentation for a full list of changes and an overview of the new variable names.

    3.5 $PGEN

    With variables DEGREE$$, FIELD$$, and TRAIN[A-D]$$, discrepancies had appeared in the retrospectively reported data. The classification was improved to deal with inconsistencies in multiple answers. The conversion key for occupational information based on the ISCO-88 classification for TRAIN[A-D]$$ in KldB92 was extended from two to four digits, and the conversion key itself was revised.

    3.6 $HGEN

    In Wave BF, a major change took place in the way respondents were asked about rent, and respondents were asked to provide more detailed information on their rental expenses. It became clear from ex post analysis that the way these questions were asked was too complicated for some respondents and that it had resulted in a discontinuity in the time series. As a result, according to SOEP—as well as in the comparative statistics— the average rent had risen systematically over time, but not in 2014 or 2015. This discontinuity can be explained by the change in the questionnaire. Starting with wave BG, the questions on rent were therefore changed back to those in wave BD. In wave BG, migration samples M3 and M4 were are also part of the SOEP. Since these respondents were not asked what they paid for utilities, no rental variable was generated for this group.

    Data distribution v32 (Wave BF)

    SOEP-Core soep.v32.1

    • BIOCOUPLY and BIOMARSY:  By mistake in the first version of the data delivery wrong data were uploaded for the two datasets. This version contains the correct datasets .
    • NACE in BFP and BFPGEN: A user reported implausible values for the variables BFP55_NACE and NACE15 containing information on the current job's industry. In this version  the information is updated after a bug in the script has been fixed.
    • Scale shift in BFP: In the v32 data release, the scales in BFP on the probability of specific events occurring in working life, which in previous years had been coded from 0-100 at 10-point intervals, were given on a scale from 0-10 for the CAPI and CAWI interviews. This inconsistency was corrected in the update adapting the scales to the previously used coding: scales from bfp4201, bfp4202, bfp4203, bfp7201, bfp7202, and bfp7203 were multiplied by 10 where bfpinta = 9 or 10; also, one case in  bfp7201 was changed from 4 to 40 where bfpinta = 8.
    • einstieg_artk and einstieg_pbio: SOEP has offered two additional labor market entry variables since providing data version 32 as part of the BIOJOB file. They were constructed on the basis of employment history information to the exact year and month. They refer to a generic uniform definition of the first survey period after the transition from the educational system to the labor market. The construction details for these variables are documented in detail in the SOEP Survey Paper 429, a short version of the description is also available in the BIOJOB documentation. (SOEP Survey Paper 418)

    SOEP-Core soep.v32

    The new data release (1984–2015) "SOEP.v32" provides, for the most recent survey year 2015, the usual wave-specific data files BFPBRUTTO, BFP, BFPEQUIV, BFP_MIG, BFPKAL, BFPGEN, BFPAGE17, BFHBRUTTO, BFH, BFHGEN, BFKIND, and BEPLUECKE as well as the updated files with a longitudinal component  (PFAD files, biography files, spell data, and weighting factors).

    1. New migrant subsample (M2)

    In 2013, we conducted the first IAB-SOEP Migration Sample in partnership with the Institute for Employment Research (IAB) in Nuremberg (for an overview of M1, see SOEP Survey Paper 216). The households from the second IAB-SOEP Migration Sample surveyed in 2015 are now also included in the SOEP data. The target population of the second IAB-SOEP Migration Sample consists of immigrants to Germany who have arrived between 2010 and 2013. Migrants from the new EU member states in Eastern Europe dominate this group. This focus will make it possible to better describe the dynamic recent evolution of immigration to Germany. The sample M2 consists of 1,096 households, and was, like sample M1, drawn from register data from the Federal Employment Agency.

    Record Linkage

    Please note that data from both samples can be linked with administrative employment and income data: Survey respondents are asked to provide explicit consent to record linkage. But since this linked dataset contains social data, these weakly anonymized data are only accessible on site at the Research Data Center of the German Federal Employment Agency at the IAB (FDZ IAB). Researchers can access FDZ IAB data through a guest visit to the IAB or through remote data processing, also arranged with the IAB. The linked data will soon be available to external researchers. Requests for data access should be directed to FDZ IAB, since a contract with IAB for data use is required.

    For more information, see the FDZ IAB website.

    2. Weighting

    • In version v32 of the SOEP data, the new migrant subsample, M2, has been integrated into the SOEP weighting framework. As is our usual practice when a new sample is integrated into the SOEP, we make different weighting factors available for the first wave. The standard weights (bfhhrf/bfphrf) allow researchers to draw inferences about the underlying population of residents in Germany based on all SOEP samples. The variables bfhhrfam1/bfphrfam1 allow the same inferences, but only using data from the old Samples A to M1. Comparisons between both sets of weights thus enable researchers to gauge the influence of the recent enlargement of the SOEP for population estimates. Weights specific to the recent enlargement M2, bfhhrfm2/bphhrfm2, allow researchers to draw inferences about the target population of immigrants to Germany between 2010 and 2013.
    • The adjustment of weights to census margins on the individual level has been updated since 1984 so that now the number of women and men in each age group (five-year categories) is given as the margin. Up to now, two separate margins were used for sex and age group.
    • Upon request, we now provide weighting factors for survey years 2010 to 2013 (waves BA to BD) excluding Samples L1 to L3. Due to differences in survey instruments used with Samples L1 to L3 in the corresponding waves as part of the "Familien in Deutschland" (Families in Germany) survey,  a need for weighting may arise when variables are to be analyzed that were not surveyed in the other samples.

    3. Changed datasets or variables

    • MIGSPELL: With the integration of the data from 2013 (BD) to 2015 (BF), larger changes in the number and coding of the MIGSPELL variables were necessary, since in particular the status upon entry to Germany was surveyed in the individual waves with differing degrees of specificity. In addition, an improved procedure was introduced for imputation of missing data. A detailed description of the new version of MIGSPELL can be found in the SOEP 2015 documentation on Biography and Life History Data (coming soon).
    • Variables connected to occupations:
      - The variables names have changed and should now be more informative; the name of the coding scheme is now part of the variable name, e.g., isco88.
      - The occupational codes (KldB92, ISCO-88) now comply better with official standards (e.g., variables with suffixes _kldb92 or _isco88 in $P files).
      - In $PGEN there are now also variables using the coding schemes for KldB2010 and ISCO-08.
      - The code for generating the derived prestige scales has been redesigned, e.g., egp88_12 for egp class based on ISCO-88 in the year 2012.
    • BIOIMMIG:  The variable biwfam ("Already Had Family In Country") was recoded incorrectly in the generated dataset for the migration samples in 2013 and 2014. This was corrected in the current data release.
    • Survey Year: With Version 32, variables referring to the survey year are referred to consistently as syear. Previously there were a few variables with names like erhebj and svyyear.

    4. New datasets or variables

    • BIOIMMIG: Additional variable for the main reason for migrating to Germany (only available since 2014).
    • PFLEGE: A new variable, appraisal with the label: “officially assessed as in need of care”
    • $PEQUIV: six new variables:
          -  ichsu$$ Child support, caregiver alimony
          -  fchsu$$ Imputation flag child support, caregiver alimony
          -  ispou$$ Divorce alimony
          -  fspou$$ Imputation flag Divorce alimony
          -  irie1$$ Riester pension plan
          -  irie2$$ Riester widow pension plan


    • PPFAD: Person-related meta dataset
      -  Some immigration variables (GERMBORN, CORIGIN and IMMIYEAR) previously contained a -3 for all respondents in Sample G who were not asked to state their country of birth and year of immigration. Since respondents from other samples (e.g. A) were also not directly asked to provide this information and were coded -2, the coding of missing values was not consistent across samples. This inconsistency was corrected in the new update (v32).
      -   Respondents who immigrated in the year 1949 (when the Federal Republic of Germany was founded) were previously considered not to have been born in Germany due to a coding error. This has been fixed in the updated version, and now, in accordance with the German Microcensus, all persons who immigrated before 1950 (after 1949) are considered to have been born in Germany. This also led to a change in the value label of IMMIYEAR.
      -   More information was considered in the updated version of MIGINFO, leading to changes in the values.

    Data distribution v31 (Wave BE)

    1. Integration of the FiD study (data from 2010 ongoing)

    We are pleased to announce that the data release v31 will include the data from “Familien in Deutschland” (Families in Germany, FiD) which is being retrospectively integrated into the SOEP and made available in user-friendly form to all SOEP users. The survey has been carried out in parallel to the SOEP as a so-called “SOEP-related study” from 2010 to 2013.

    The original SOEP-related study FiD

    The idea of FiD was to evaluate the full range of public benefits in Germany for married people and families on behalf of the Federal Ministry for Family Affairs. The datasets available—including the SOEP—were not sufficient for differentiated analysis of the segments of the population targeted by family policies. Particularly problematic were the very small percentages of single parents, families with more than two children, low-income families, and families with very young children in the German population. These groups are of course included in the SOEP, but the number of observations is too small for sound statistical analysis.

    Since 2010, the SOEP Research Infrastructure at DIW Berlin has been working in collaboration with TNS Infratest Sozialforschung to survey more than 4,500 households every year. The FiD sample consists of the following subsamples:

    • A sample of families in “critical income brackets”
    • A sample of single parents
    • A sample of families with more than two children
    • “Cohort samples” of the 2007, 2008, 2009, and 2010 (first quarter) birth cohorts.

    A description of the original FiD study can be found in article “Familien in Deutschland – FiD” by Mathis Schröder, Rainer Siegers, and C. Katharina Spieß, Schmollers Jahrbuch 133 (4), 2013, 595-606. ( (Pre-published 2013: SOEPpapers 556 | PDF, 160.15 KB . Berlin: DIW Berlin).

    Integration into SOEP-Core

    Starting with Version 31 of the data, the FiD sample will be integrated completely into the SOEP-Core data—that is, as if it were a new sample drawn as part of SOEP-Core in 2010 and 2011. The integration of the FiD sample will result in a significant increase by almost one-third in the number of cases in SOEP-Core since 2010. The figure shows how the new FID samples L1 to L3 have affected cross-sectional sample size since 2010. The retrospective integration meant that the sample variables had to be adjusted as other subsamples have been added to SOEP-Core since 2010 (see adjustment of the sample variables).


    In total, 14,166 variables from 64 datasets have been integrated into the various SOEP datasets, and the generated data sets or variables have been adjusted. Variables in the FiD survey instruments that were not contained in the corresponding SOEP survey instruments have been included in the respective datasets as additional variables (with the original FiD variable names starting with “fyy”, where “yy” is a two-digit year identifier). The table below gives an overview of the number of variables in each of the two main questionnaires that could be integrated.

    Year Individual questionnaire (–p)
    Number of variables integrated
    Household questionnaire (–h)
    Number of variables integrated
    2010 314 274
    2011 472 172
    2012 350 188
    2013 363 169

    This means that from 2010 on, SOEP users have more cases in their study population—automatically, as it were—without having to make any changes in scripts. Of course, it may be that certain variables were not collected in FiD and are therefore unavailable for these cases. Here, please refer to our conventional approach to missings, which makes this easy to see on the variable level:

    Code Meaning
    -1 no answer / don’t know
    -2 does not apply
    -3 implausible value
    -4 Inadmissible multiple response
    -5 Not included in this version of the questionnaire
    -6 Version of questionnaire with modified filtering
    -8 Question not part of the survey program this year*

    *Only applicable for datasets in long format.

    2. Cross-sectional weights 2014

    The Federal Statistical Office plans to adjust the already-released Microcensus data from 2011 and 2012 based on the 2011 census data. This means that in the present SOEP data release (v31), the weights for waves BB and BC will change due to the adjustment to the 2011 census data.

    Because v31 will include the data from the SOEP-related study FiD, the integration of these households into the SOEP will increase the overall case number by around one-third and it will also affect the integrated weighting variables. This is due to the additional households as well as to the differentiated consideration of official information on family types in the weighting process. To allow users to test how a new sample may affect their research using the SOEP data, we provide both integrated weights and also separate weights for the old and new samples in the year when a refresher sample was integrated into the SOEP.

    3. Other changes

    3.1 Adjustment of the psample / hsample variables

    Due to the retrospective integration of the FiD sample, the psample variable in ppfad and the corresponding hsample variable hpfad had to be adjusted.

    sample variables

    Value Old Labels (v30) New Label (v31)
    1 A German West A Original Sample (DE-West)
    2 B Foreigner West B Migration (up to 1983, DE-West)
    3 C German East C Original Sample (DE-East)
    4 D 84-93 Immigrant (West) D 1994/5 Migration (1984-92/94 DE-West)
    5 E Refreshment 1998 E 1998 Refreshment
    6 F ISOEP 2000 F 2000 Refreshment
    7 G High-Income Test 2002 G 2002 High-Income
    8 H Refreshment 2006 H 2006 Refreshment
    9 I Incentives 2009 I 2009 Incentivization
    10 J Refreshment 2011 J 2011 Refreshment
    11 K Refreshment 2012 K 2012 Refreshment
    12 L1 2010 Birth Cohorts (2007-2009)
    13 M Migration 2013 L2 2010 Family Types
    14 L3 2011 Family Types
    15 M1 2013 Migration (1995-2010)

    3.2. Biographical data sets

    The following datasets with biographical information were pooled to keep the number of life-courserelated datasets to a reasonable level:

    biobirth and biobirthm -> biobirth
    Women’s (biobirth) and men’s (biobirthm) childbirth biographies are merged into the dataset biobirth as of v31, of course along with a gender variable.

    bioage01 to bioage12 -> bioagel
    Starting with data distribution v31, the age-specific data from the mother/parent-child questionnaires are provided only in the user-friendly “long” format: Rather than as age-specific individual files (e.g., bioage01, bioage03, ...), all mother-child and parent-child questionnaires are now pooled in the bioagel dataset. Consequently, all information on children can now easily be found in one dataset. The documentation on the biographical data includes a syntax to generate the age-specific individual files for those who do need them and information on how to use the new bioagel “long” data set most efficiently with SPSS and Stata.

    The dataset bioage17 derived from the youth questionnaire is not included in this bioagel dataset.

    3.3 Changes in $HGEN

    The file HGEN v31.1 now contains the variable gas$$, which states the household’s gas costs starting in 2014. The variables $$eqplif and $$eqpnrj have now been carried forward from the last two years if a household did not provide a response in a given year.

    3.4 Other changes in SOEP v31.1

    The updates in v31.1 only affected the values of various variables. For detailed information please see doi soep.v31.1

    Data distribution v30 (Wave BD)

    The new data distribution (1984–2013) “SOEP v30” provides, for the most recent survey year 2013, the usual wave-specific data files BDPBRUTTO, BDP, BDPKAL, BDPGEN, BDPAGE17, BDHBRUTTO, BDH, BDHGEN, BDKIND, and BCPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). Additional new samples, datasets, or variables are listed below:

    1. Cross-Sectional Weights 2013

    1. Cross-sectional weights 2013

    We are pleased that with the figures now available from the official statistical agencies, we are now able to provide you the finalized weighting variables in this version of the data (doi:10.5684/soep.v30). As is always the case in years of refresher and enlargement samples, we are providing weights for the old and new samples, both separately and together. These different sets of weights are designed to make it easier for users to study how the integration of a new sample affects the analysis of specific research topics.

    Please also note that the government census carried out in 2011 replaced the projected population figures, which had been regularly updated based on the last census in 1987, with current population of the Federal Statistical Office. This means that the post-stratification of SOEP weights from wave BD in data release v30 are based on a version of the Microcensus from 2013 that considers the 2011 census for the first time. It is therefore possible that changes in weighted analyses of the SOEP between 2012 (BC) and 2013 (BD) are the result of the government statistics switching over to the more recent census. The correction is evident in the fact that the estimated total number of individuals living in private households in Germany fell from 81 million in 2012 to less than 80 million in 2013.

    Given the retrospective revision of the 2011 and 2012 Microcensus data to account for the census results, our next data release (soep.v31) will include retrospectively revised weighting variables for the 2011 and 2012 survey data.

    If you have any comments on the weighting variables, we would be happy to hear from you ().

    2. New IAB-SOEP Migration Sample (Sample M)

    2. New IAB-SOEP Migration Sample (Sample M)

    The new IAB-SOEP Migration Sample (Sample M) is a joint project with the Institute for Employment Research (IAB). It is therefore provided as part of the normal SOEP distribution (see, for example, variable psample in dataset ppfad), but also as a separate study including only Sample M households (10.5684/soep.iab-soep-mig.2013).

    The new sample takes into account changes in the structure of migration to Germany since 1995. It covers not only direct immigration but also the “second generation,” the children of immigrants. The new sample opens up new perspectives for migration research and provides insights into the lives of new immigrants to Germany. The new sample has the following key features:

    1. The IAB-SOEP Migration Sample substantially increases the sample size for research on migration and the lives of immigrants in Germany: 4,964 persons residing in 2,723 households participated in the first wave of the survey. Moreover, since the survey is included in the regular SOEP as subsample “M”, including migrants from the other SOEP samples in analyses may increase the number of observations further.
    2. The questionnaire used with the new migration sample covers respondents’ entire migration biography. Migration episodes to other countries than Germany are covered as well. This is an important extension over previous SOEP surveys of immigrants’ personal biographies. For the first time, we can now track whether important events in individual biographies occurred in the respondent’s home country, in Germany, or in other destination countries. This also takes into account that migration is no longer a one-time event that lasts for a lifetime but that individual biographies are becoming increasingly “transnational,” often with several migration episodes taking place during an individual’s lifetime and involving personal ties in different countries. We created a user-friendly spell data set, called MIGSPELL, for the use of this data.
    3. Following recent advances in the research on migration and immigration, the IAB-SOEP Migration Sample considers numerous new sets of questions that were not previously considered in the SOEP or other household surveys in Germany, at least not in the necessary depth. Examples of such question blocs are: earnings and the labor force and occupational status before migration; migration decisions in the family and partnership context; and the purposes and channels of transferring remittances.

    3. New datasets / variables

    3.1. MIGSPELL

    3. New datasets / variables

    For the comprehensively surveyed migration biography, we have created a user-friendly spell data set. Detailed documentation will be available in the biographical data documentation of the SOEP.

    3.2. BDP_MIG

    The original data from the Sample M specific survey instrument is included in the dataset BDPMIG, combining the individual and the biographical questionnaire. The variables are also included in the other standard or generated datasets:
    • Variables equivalent to variables in the individual questionnaire of other samples are included in the dataset BDP
    • Variables equivalent to variables in the biography questionnaire of other samples are included in the respective biography dataset (e.g. BIOMARSM)
    • The comprehensively surveyed migration biography can be found in the new dataset MIGSPELL.

    3.3. JOBEND$$

    Since a number of changes occurred in the categories for reasons for job dismissal, a new longitudinally consistent variable (JOBEND$$) is now offered in the $PGEN data sets./p>

    3.4. New additional occupations codes

    The data on occupations in the individual questionnaire are now additionally coded using KldB2010 and partly also ISCO-08. The following variables are included in the dataset BDP:


    Variable Label


    Current Occupational Classification (KldB2010)


    Current Occupational Classification (ISCO-08)


    Current Occupational Classification Secondary Employment (KldB2010)


    Current Occupational Classification Secondary Employment (ISCO-08)


    Vocational Training / Education Degree Prev. Yr. (KldB2010)

    However, variables of derived scales (e.g. prestige scores in $$PGEN) are still based on ISCO-88.

    3.5. Grip strength data for 2012

    GRIPSTR update: The data on grip strength from the survey year 2012 is now included in the GRIPSTR dataset.

    3.6. Wealth data for 2012

    PWEALTH and HWEALTH updated: In the year 2012, all individuals aged 17 and over were again surveyed on wealth, just as they were in 2002 and 2007. These “raw” data were already part of the standard data distribution for Wave 29 and will be included in the upcoming data distribution in a file containing the data for 2002, 2007, and 2012 in “long format”—the file PWEALTH for individual data, HWEALTH with data aggregated according to household context. Values that are missing due to item or partial unit non-response (e.g., missing interviews with individual household members in interviewed households) will be subjected to multiple imputations in complex procedures taking longitudinal information into account.

    3.7. BIOEDU now part of the regular data distribution

    After it became impossible to update the beta version of this data set in version 29, the data have now been updated and incorporated into the regular data distribution. The information from the new IAB-SOEP Migration Sample was also integrated.

    3.8. INTERVIEWER dataset

    The dataset comprises demographic and employment information about interviewers, aggregated data on the interviewers’ fieldwork in each wave, as well as personal details that they provided in the two interviewer surveys of 2006 and 2012. In the process of creating the INTERVIEWER dataset, all interviewer indicators (INTID) in all of the SOEP datasets were checked thoroughly and in some cases revised.

    4. Revisions and Bug fixes

    4.1. Corrections in BILZTCH$$ and BILZTEV$$

    The variables BILZTCH$$ and BILZTEV$$ lacked information on a number of waves up to now. As a result, false values were ascribed to variables in a number of cases: a total of 638 previously consistent cases proved to be inconsistent increases in educational levels and 2,582 previously inconsistent cases proved consistent.

    4.2. Corrections in DUEBSTD

    In addition to the generation of overtime work for 1984 and 1985 overtime work has now been generated for 1987 as well. For these years, overtime hours result from the difference between contractually agreed working hours and the number of hours actually worked per week.

    4.3. Revisions of marital and relationship status

    $FAMSTD: As a result of a new process for generating BIOMARSM/Y and BIOCUPLM/Y, two changes occurred in $FAMSTD: Since 2010 the question on marital status has included the categories “registered same-sex partnership, living together” and “registered same-sex partnership, not living together”. These two categories are also included in $FAMSTD as values “7” and “8”. Furthermore all spells of BIOMARSM/Y in the category “widowed or divorced” have been set to “not valid” in $FAMSTD. These changes were also applied to previous waves. The variable $FAMSTD is set to -3 if information is implausible, to -5 if persons were not interviewed, and to -1 if persons did not answer the question.

    BIOCOUPLM/Y: For the process of generating BIOCOUPLM, the current relationship status and reported changes in the family situation are taken into account. Although the questionnaire asks for such events on a monthly basis, numerous changes in the relationship status are not reported as events. So in the new version of BIOCOUPLM, we have included a censor variable called “events” which gives you information on whether the exact month of an event is known or whether the begin or end of a spell reflects the month of the interview due to the lack of reported events. Finally a new category “added spell” has been introduced into the variable remark, which lets you distinguish between spells that have been edited (value 2) and spells that have been added (value 3). For further information, please see the new documentation on BIOMARSM/Y. The variable SPELLTYP is set to -3 if information is implausible.

    BIOMARSM/Y: Because BIOMARSM is derived from the new version of BIOCOUPLM, we have copied the category “married, separated” from BIOCOUPLM. It reflects the time between a reported separation and divorce or the death of the spouse. Most of these spells of BIOCOUPLM were set to “married” in BIOMARSM, but for those spells without a reported end, event spells were set to “married, separated” and the end of the spells to missing. Parallel spells from the category “divorced or widowed” were added, whereas the outset of those spells was set to missing. Finally a new category “added spell” has been introduced into the variable remark, which let you distinguish between spells that have been edited (value 2) and spells that have been added (value 3). For further information, please see the new documentation on BIOCOUPLM/Y. The variable SPELLTYP is set to -3 if information is implausible.

    4.4 $regtyp: conversion to urban / rural area

    The new typology of German BBSR describes the settlement structure allowing for categorization into four types of regions. But the use of these four categories would, on the other hand, allow for the identification of specific administrative districts (Landkreise) in the counties of Saxonia, Mecklenburg-Western Pomerania, and Baden-Württemberg. Therefore, we must use a condensed two-category classification: urban and rural areas.

    Data distribution v29 (Wave BC)

    The new data distribution (1984-2012) "SOEP v29" provides, for the most recent survey year 2012, the usual wave-specific data files BCPBRUTTO, BCP, BCPKAL, BCPGEN, BCPAGE17, BCHBRUTTO, BCH, BCHGEN, BCKIND, and BBPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors).

    1. New subsample K

    In 2012, we added a new refreshment sample with 1,526 new households (Sample K). In total, 12,322 households were interviewed as part of the 2012 fieldwork. As with previous general population samples, the refreshment sample K was  realized by using a multi-stage stratified sampling design. Refreshment sample K resulted in a very similar response rate of 34.7 % compared to our last Refreshment Sample J. Thus, the general downward trend in participation was successfully stopped through a range of measures including centralized face-to-face interviewer training, better pay for interviewers, and more attractive incentives for respondents.

    In the current refreshment samples, fieldwork is conducted exclusively by CAPI, as it was with the previous refreshments H (2006), I (2009), and J (2011). Similarly to our other refreshment samples, data collection is focused on three main questionnaires: the household, the individual, and the youth questionnaire. Thus, no supplementary questionnaires were used with respondents in wave 1. The reason for focusing on the key questionnaires is to avoid "overburdening" respondents with a lengthy wave 1 interview.

    2. Revision of the weighting and estimation procedure

    In version SOEP v29 of the SOEP data, the data from subsamples J and K (first collected in 2011 and 2012, respectively) have been adjusted to the German Microcensus for the number of employed people in households of different sizes as well as for the number of private households receiving Unemployment Benefit II (ALG II). This correction prevents an overestimation of households receiving ALG II in the unweighted samples J and K.

    Also, for all newly drawn samples since 1998, a minor adjustment has been made to the definition of households containing foreign nationals. The criterion is no longer the household head but the presence of at least one person of foreign nationality in the household. The revision was made due to a slightly increasing discrepancy between the reference person chosen in the German Microcensus and the household head in the SOEP.

    3. New datasets / variables

    • In 2012, the SOEP replicated its wealth module for the fourth time after 1988, 2002, and 2007. Due to the higher response burden in first-wave respondents, we did not survey wealth in the most recent refreshment sample K (N=1506 households). For the estimation of totals, we therefore recommend to use the cross-sectional household and person weights covering "old" samples A through J only and excluding wave 1 units emanating from Sample K, i.e. BCHRFAJ and BCPHRFAJ.
    • COGNIT: For the short cognitive tests implemented in the survey year 2006 we can now provide the first repeat, including an additional word knowledge test. The name of the dataset changed from COGNIT06 to COGNIT, because both survey years are now included in long format. A detailed documentation of the first test can be found in Schupp et al. (2008) Erfassung kognitiver Leistungspotentiale Erwachsener im Sozio-oekonomischen Panel (SOEP), DIW Berlin, Data Documentation 32 | PDF, 447.63 KB .
    • Two new variables in $PGEN: The variable SNDJOB$$ represents the imputed current gross labor income from a second job, generated for all SOEP respondents who are employed in each respective wave. Information about gross income from the second job was first asked in 1995 (wave L). The respective imputation flag is the variable IMPSND$$.
    • For the first time, respondents were asked their place of birth. This information including the coordinates of the respective municipality is available at our guest workstations at the Research Data Center SOEP.
    • A new dataset HCONSUM with generated data from the consumption module used in the SOEP in the year 2010. A detailed documentation | PDF, 1.5 MB is available online.


    4. Improvements and Bug Fixes

    • Revision of the $STELL codes (Relationship to the head of household) to differentiate between biological child and stepchild, adoptive child etc. :
    Value  Wave BB (2011)             Wave BC (2012)                                        
       0  Head Of Household          Head Of Household                                     
       1  Spouse Of HH Head                                                                
       2  Life Partner                                                                     
       3  Son, Daughter                                                                    
       4  Foster Child                                                                     
       5  Son, Daughter-In-Law                                                             
       6  Father, Mother                                                                   
       7  Parent-In-Law                                                                    
       8  Brother, Sister,-In Law                                                          
       9  Grandchild                                                                       
      10  Other Relative                                                                   
      11  Non-Relative               Spouse Of HH Head                                     
      12  Child of HH-Heads Partner  Same-Sex Spouse                                       
      13  Same-Sex Spouse            Life Partner                                          
      21                              Son, Daughter                                         
      22                             Stepchild (Child of the Partner)                      
      23                             Adoptive Child                                        
      24                             Foster Child                                          
      25                             Grandchild                                            
      26                              Great-Grandchild                                      
      27                             Son, Daughter-In-Law                                  
      31                             Father, Mother                                        
      32                             Step Father / Step Mother / Spouse of Father or Mother
      33                             Adoptive Father or Mother                             
      34                              Foster Father or Mother                               
      35                             Parent-In-Law                                         
      36                             Grandparents                                          
      41                             Brother, Sister                                       
      42                             Half-Brother, Half-sister                             
      43                              Stepbrother, Stepsister                               
      44                             Adoptive Brother/Sister                               
      45                             Foster Brother/Sister                                 
      51                             Brother, Sister -in Law (spouse of brother/sister)    
      52                             Brother, Sister -in Law (brother/sister of spouse)    
      61                              Aunt, Uncle                                           
      62                             Niece/ Nephew                                         
      63                             Cousin/Cousine                                        
      64                             Other Relative                                        
      71                             Others                                                
      99  Unknown                     Unknown                                               

    Please note that this also affects the corresponding variables in the dataset $KIND (and KIDLONG) and BIOPAREN.

    • We do no longer have additional variables on birth date (GEBJAHR and GEBMONAT) and sex (SEX) in our dataset KIDLONG, please use instead the more intensively checked versions in PPFAD.
    • Last year, we already provided the interviewer data with a new variable, INTID, which is unified across all waves and takes the place of the respective file-specific variables ($INTNR). The new variable was determined through one-time generation of a random number; it is therefore fixed and remains consistent in an integrated master file (not contained in the data distribution) for SOEPcore as well as for FiD (Families in Germany) and SOEP-IS (innovation panel). In addition to generating the INTIDs and updating the interviewer characteristics in INTVIEW, we have also made the following revisions:
      • The dataset INTVIEW no longer contains just the interviewer with interviewer characteristics but also all available interviewer numbers. To provide this information, we extracted all interviewer numbers from all available datasets. Flag variables in INTVIEW show whether other interviewer characteristics are available for this particular INTID or not.
      • A total of 181 INTIDs were newly assigned in the updated data, allowing these to be directly linked with the respective interviewer characteristics. This is due to the assignment of numbers by Infratest in East Germany from 1990 to 1995, when there were still some independent interviewers (IBB-numbers) for the East sample whose numbers were assigned according to a different system. These had to be harmonized with the interviewer numbers that were merged later.
    • BIOAGE03: the codes for personality was changed from 1-11 to 0-10 and is now consistent with the codes for personality in bioage06.
    • BIOAGE06: in 2008, for personality, the value zero was mistakenly coded -2. This mistake was corrected. This resulted in up to 65 additional valid cases for some traits in the survey year 2008.
    • $FAMSTD: In generating current marital status, current and previous year were switched for some cases in 2011 in v28.
    • In 2012, the questionnaire provides one-time-only information on the size of the local establishment in addition to the size of the entire company (BETR$$). The enriched questionnaire revealed that in previous interviews, some individuals mistakenly provided information on the local establishment size instead of the entire company size, especially if their entire company had 2000 or more employees. Due to the importance of longitudinal consistency, these persons were identified, and their 2012 original value of the entire company size BETR12 was replaced by their value of the local establishment size. These modifications also affected the variable ALLBET12. Please see the data documentation for further details.
    • The variable RUEBSTD ("overtime hours during last month" in 2001) had cases with incorrect non-response missings (-1), since respondents without overtime mistakenly were assigned to this category. In the corrected version, the value for these respondents is correctly coded as zero overtime hours.
    • With the variable vh4601 and the equivalent variables in the following years, the label "contributions over 2,500 euros" was used, but actually the questionnaire asked for "contributions over 500 euros". The label was corrected.
    • The variables ZERWZEIT and BAERWZEIT ("length of time with firm" in 2009 and 2010) had to be corrected for respondents in sample I who did not have their wave 2009 interview and wave 2010 interview in the respective year but at the beginning of the following year (2010 and 2011). Due to the longitudinal consistency check, these individuals mistakenly received an implausible value (-3) for BAERWZEIT. In the corrected version, the non-missing values of these respondents are considered to be valid and not set to missing.
    • LOC1989: In generating the data, persons are now included who never participated. As a result, the -2 means "does not apply, born before 1989" as planned for this variable. Respondents who have never participated and who were unable to gather information from other sources were set to -1 ("no answer").
    • The variables EXPFT$$, EXPPT$$, and EXPUE$$ (experience in full-time employment, part-time employment, and unemployment) have been improved. The variables reflect now the total length of full-time/part-time/unemployment in the respondent's career up to the point of the interview in a given year (instead of only up to December of the previous year). Since monthly employment activities are asked retrospectively in the following year, the variables cannot be updated for the most current wave.
    • The variable AHINC$$ in dataset $HGEN is no longer part of the data distribution, we recommend to use the completely (multiple) imputed monthly net household income from variables I$HINC$$ (or dataset MIHINC in long format over all years).
    • The variables ATATZEIT, AVEBZEIT, AUEBSTD and AERWZEIT were mixed up in the data distribution v28 and had to be corrected:
      • The correct values of ATATZEIT were found in the variable AERWZEIT.
      • The correct values of AVEBZEIT were found in the variable ATATZEIT.
      • The correct values of AUEBSTD were found in the variable AVEBZEIT.
      • The correct values of AERWZEIT were found in the variable AERWZEIT of the data distribution v27.

    Data distribution v28 (Wave BB)

    1. New additional missing codes

    With the integration of sample J in 2011, conducting of the biographical questionnaire was moved from the second to the first wave and combined with the individual questionnaire in an integrated survey. This means that there are some slight differences in the survey instrument between the old samples A-H and the supplementary sample J.

    The following additional missing codes have been introduced to the survey data to document these possible differences:

    -4 "Inadmissible multiple response"
    -5 "Not included in this version of the questionnaire"
    -6 "Version of questionnaire with modified filtering"

    2. Sample I now part of our new Innovation Sample

    The SOEP Innovation Sample has been launched now and includes, inter alia, sample I. Sample I is therefore no longer part of the main survey as of 2011. See SOEP-IS on our website for further information about the Innovation Sample and the possibility of including your own questions.

    3. New and renamed datasets

    BIOCOUPLM provides spell data on partnership histories from the first to last personal interview of a respondent. Spells are measured on a monthly basis.

    BIOCOUPLY provides spell data on partnership histories. It contains annual information on partnership status since the respondent’s year of birth, including available retrospective data and annually updated information.

    3.3 BIOSIB (beta version)
    The new file BIOSIB provides information on siblings living in the SOEP households. The dataset contains the person numbers of all siblings in an observed family. It includes information on their gender, their year of birth, and on the relationship between the observed siblings.
    BIOSIB is included as a beta version in the current data release. Please do not hesitate to send both positive and negative feedback or suggestions to Daniel Schnitzlein ().

    3.4 BIOEDU
    The BIOEDU dataset contains details on educational transitions beginning with entry into childcare up to tertiary education in a consistently structured form.

    3.5 BIOAGE long
    In the new integrated bioage long dataset (BIOAGEL), data are presented in “long” format, i.e. this dataset will contain information from BIOAGE01, BIOAGE03, BIOAGE06, as well as BIOAGE08a and BIOAGE08b.

    3.6 TRUST
    Dataset on the Economic Behavior Experiment on Trust and Trustworthiness in the 2003, 2004, & 2005 SOEP Survey

    This experiment to measure trust is based on the investment game introduced by Berg et al. (1995), a one-shot game for two players or movers who anonymously interact with each other. The first mover receives an endowment of 10 points and can transfer zero to ten points to the second mover. Every point that is transferred is doubled by the experimenters. The second mover is also given an endowment of ten points. After receiving points from the first mover, he/she decides on how much of the endowment to transfer back to the first mover (zero to ten points). As with the first mover's transfer, the back-transfer by the second mover is doubled by the experimenters. After the second mover's decision, the game ends and the subjects are paid their income in euros (one point equals one euro) by check sent a few days later.

    A fundamental component of the game is that the participants actually receive money in accordance with the fixed payout function, i.e., all the decisions always have monetary consequences. This version of the game was developed by Fehr, Fischbacher, Schupp, von Rosenbladt & Wagner (2002).

    The combination of representative survey and behavioral experiment was used in the SOEP main surveys in 2003, 2004, and 2005, with only minor modifications. Of the 1,432 original participants in 2003, 1,202 also took part in the experiment in 2004 and 2005.

    The data are available in long format in the "TRUST" dataset. Consequently, this dataset contains information from each of the three waves in which the behavioral experiment was conducted.

    3.6 TIMEPREF
    Dataset on the Economic Behavior Experiment on Time Preferences in the 2006 SOEP Survey

    In this experiment on economic behavior, respondents were asked to decide how they would like to receive €200 in prize money: if they would rather receive it immediately by check, or if they would prefer to wait and receive a larger amount later—that is, with interest. By splitting the sample (N = 1,503 persons) into random subsamples (splits), it was possible to vary both the time horizon and the implied interest rate to test possible incentive effects on the choice between a low payoff in the short term and a high payoff in the long term. The scientific director of the project was Prof. Dr. Armin Falk, CENs, University of Bonn.

    4. New or revised variables

    4.1 $HBRUTTO dataset

    The $HBRUTTO dataset will include a new variable to distinguish between urban, suburban and rural regions. This is based on the spatial categories of counties (as of December 31, 2009) used by the Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR). The following spatial structure characteristics are used to define the categories:

    • Share of county’s population in large or medium-sized cities
    • Population density of the county
    • Population density of the county without taking large or medium-sized cities into consideration

    Thus, three categories can be defined:

    1. Urban regions (Cities with at least 100,000 inhabitants and counties with at least 50% of the population living in large or medium-sized cities and with a population density of at least 150 inhabitants/km²; and counties with a population density not including large or medium-sized cities of at least 150 inhabitants/km²)
    2. Regions undergoing urbanization (Counties with at least 50% of the population living in large or medium-sized cities but a population density of below 150 inhabitants/km², and counties with less than 50% of the population living in large or medium-sized cities, and with a population density (excluding large or medium-sized cities) of at least 100 inhabitants/km²)
    3. Rural regions (counties with less than 50% of the population living in large or medium-sized cities and population density (excluding large or medium-sized cities) of below 100 inhabitants/km²).


    4.2 $PGEN dataset

    BILZTCH$$ indicates whether the respondents’ answers suggest a downward shift in years of education or training ($BILZEIT) since the last observation or an upward change since the last year which is inconsistent with additional information on education or training recently completed.
    is a flag variable which indicates whether the respondent showed some inconsistent change in $BILZEIT either upwards or downwards over the entire observation period.


    To be consistent with the FID dataset, the missing values of the variables $VEBZEIT and $UEBSTD were slightly recoded, as the missing value –2 is now assigned to self-employed individuals. In previous waves, self-employed persons had the missing value –3 (implausible answer).

    For $UEBSTD, the value –3 (implausible answer) is assigned to all individuals with more than ten hours of weekly overtime AND who also had an agreed working time of over 80 weekly hours ($VEBZEIT is implausible, value –3) or actual weekly working time of more than 80 hours a week ($TATZEIT is implausible, value –3).

    4.3 BIOPAREN dataset

    Seven new variables have been added to BIOPAREN:
    VAORT11 and MAORT11 indicate the mother and father’s current place of residence.
    provide information on siblings. The variable GESCHW indicates whether the respondent ever had any siblings at the time of the interview. GESCHWUP gives information about the year the sibling information was collected. NUMB and NUMS provides information on the number of brothers or sisters the respondent reports and TWIN indicates whether any of these are TWIN siblings (and of which type) of the respondent.

    Data distribution v27 (Wave BA)

    The release of the 1984-2010 SOEP data (waves A-BA) will contain the usual year-specific data files (BAP, BAH, BAPGEN, BAHGEN, BAPKAL, BAPBRUTTO, BAHBRUTTO, BAKIND and ZPLUECKE) and the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). The respondents of Sample I (Incentives Sample) answered the biographical background questionnaire for the very first time in 2010.
    Since minor changes have been made to many of the older datasets as well, we strongly recommend reinstalling all of the datasets from the new DVD.

    1. New two-letter prefix (BA)

    This SOEP data release (v27) will include, for the first time in the survey's 27 years, a two-letter rather than a single-letter wave prefix. Since we came to the end of the Latin alphabet with the letter Z in our last data release, we decided to use the wave prefix BA for the cross-sectional data format.

    2. Updated beta version in "long format"

    The SOEP data are now also available in "long format" as a beta version in addition to the usual data format. SOEPlong refers to a compressed form of the SOEP data. Rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format). The data are available on the second DVD. For details, see SOEPnewsletter No. 90/2010 | PDF, 3.53 MB .

    3. Elimination of fakes

    When the data for the second wave of our newest sample I were checked, 36 households were identified as faked interviews and will therefore no longer be included in this data release.

    4. New and renamed datasets

    4.1 BIOAGE08[A|B]

    The BIOAGE08 dataset contains data from the new "parent questionnaire" which is given to the mothers and fathers of seven- to eight-year-old children. Thus, data are now available on the 2002/2003 birth cohorts that were first observed with the "newborn questionnaire." The new "parent questionnaire" is given to both mothers and fathers and thus provides two sets of responses on many of the children in the sample. Therefore, the file was split into two on the basis of the parent's gender and the household type the respondent is living in. BIOAGE08A includes only mothers and some fathers, where there was no information from the mothers available. BIOAGE08B includes fathers only. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).


    The LIFESPELL dataset contains data from the follow-up studies of SOEP dropouts (1992, 2001, 2006, and 2008), which were not previously included in the regular data release. The follow-up studies, based on information from public registers, serve to identify the current residence of former SOEP respondents, and thus allow studies of life expectancy and decisions to emigrate for a large percentage of SOEP respondents, even long after they have dropped out of the study. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).

    4.3 BIOEDU (beta version)

    The BIOEDU dataset, which is being released this year in provisional form (beta release), contains details on educational transitions beginning with entrance into childcare up to tertiary education in consistently structured form. Users who work with these data are requested to report on their experiences (especially any problems they might have), so that a final version can be released next year.
    Detailed documentation is in the DIW Data Documentation 58 | PDF, 383.03 KB .

    Because of its provisional form, this dataset is not part of the normal distribution and you will find the data on the DVD in an extra archive (link available in the NEWS file on the DVD).

    5. New Variables

    5.1 $PEQUIV dataset

    In the $PEQUIV-files, there will be a new additional variable on support payments. With the 2010 questionnaire, the SOEP has split the item "support payments" into two separate items. The first one now collects information on "alimony from legal spousal support, child support, and child care support" (ALIM$$), while the second item asks about "advance child maintenance payments" (IACHM$$). More information about the $PEQUIV files and the new variables is available in the DIW Data Documentation 57 | PDF, 0.54 MB .

    5.2 PFLEGE dataset

    The PFLEGE file now includes two new additional variables. "FURTHER" gives the number of further persons requiring help in the household. This question has been asked since 2009. "CARECOST" represents the regular monthly costs for care that a household normally spends. This question has been asked since 2010.

    5.3 Dataset $PGEN

    We are now providing detailed data on educational degrees and training qualifications prior to joining the panel: life course questionnaires have been distributed since 2001 to collect data on apprenticeship occupation, type of qualification (e.g., diploma), and the field of study for those who have obtained a degree. Up to now, data from these open-answer questions were not included in the data release. From now on, however, these data will be released in coded form. The classifications used for the data from the individual questionnaire have been slightly modified in the process of these revisions. A more detailed description is available within the PGEN documentation.
    The new variables are

    FIELD$$ Field of tertiary education
    DEGREE$$ Type of tertiary degree
    TRAINA$$ Apprenticeship-two-digit occupation KldB92
    TRAINB$$ Vocational school-twodigit occupation KldB92
    TRAINC$$ Higher vocational school-twodigit occupation KldB92
    TRAIND$$ Civil servant training-twodigit occupation KldB92
    FDT_F$$ Data source FIELD, DEGREE, TRAIN

    6. Revised Variables

    6.1 $P dataset

    Name changes to the variables in the different classifications for occupation and sector in $P: the variables contained in the $P datasets are collected in alternate years from all respondents and from those individuals who changed occupations. Simultaneously, we also generate and distribute all the information on all years and all individuals in the $PGEN datasets. To more clearly distinguish the generated variables from the originally surveyed variables, and to establish a clear connection to the question number in the respective questionnaire, we have renamed the variables according to the following system:


    Old variable name New variable name
    $IS88 $pXX_IS88
    $IS88 $pXX_IS88

    Thus, ZIS88, for example, is now ZP29_IS88. The corresponding variables of all datasets from the individual questionnaire ($P) have been renamed, but not the variables recommended for use from the generated datasets ($PGEN, e.g., IS8809 from ZP).

    6.2 $HBRUTTO dataset

    The coding of the variables identifying the federal state ($BULA, Bundesland) in which a household was included, is now consistent with the coding of the official statistics.

    $bula (old codings) $bula (new codings)
    0 Berlin  
    1 Schleswig - Holstein 1 Schleswig-Holstein
    2 Hamburg 2 Hamburg
    3 Niedersachsen 3 Niedersachsen
    4 Bremen 4 Bremen
    5 Nordrhein-Westfalen 5 Nordrhein-Westfalen
    6 Hessen 6 Hessen
    7 Rheinl.-Pfalz, Saarl. 7 Rheinland-Pfalz
    8 Baden-Wuerttemberg 8 Baden-Wuerttemberg
    9 Bayern 9 Bayern
      10 Saarland
    11 Berlin (Ost) 11 Berlin
    12 Mecklenburg-Vorpommern  12 Brandenburg
    13 Brandenburg  13 Mecklenburg-Vorpommern
    14 Sachsen-Anhalt 14 Sachsen
    15 Thueringen 15 Sachsen-Anhalt
    16 Sachsen 16 Thueringen

     A differentiation between East and West Berlin can still be achieved by a combination with $SAMPREG (Sample Region in $PPFAD).

    6.3. Minor bug fixes

    • In the BIOAGE17 file from data release v26, variables classifying the preferred job reported in the youth questionnaire (byklas, bymps, byisco88, byegp, byisei, bysiops) are deficiently coded and contain too many missing values. With the new data release v27, this bug has been fixed.
    • Some households in Berlin (only in 2006) were wrongly classified according to the variable of settlement pattern ($GTYP) in the dataset GGKBOU. This bug has also been fixed.


    Data distribution v26 (Wave Z)

    The 2010 data distribution (data for years 1984-2009) has comprehensive improvements, additions, and modifications. For the most recent survey year 2009, it also provides the usual wave-specific data ZPBRUTTO, ZP, ZPKAL, ZPGEN, ZPAGE17, ZHBRUTTO, ZH, ZHGEN, ZKIND and YPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

    1. Beta verson in 'long format'

    The SOEP data are being provided for the first time ever as a beta version in "long format" in addition to the standard data format. SOEPlong refers to a compressed form of the SOEP data: rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format).

    2. New Subsample I
    As part of the recent SOEP innovations, fieldwork began in fall 2009 on a new subsample (Sample I). The subsample is currently being used to test the effect of different incentive strategies on participation in the SOEP and it will become part of the innovation sample. See SOEPnewsletter 89 | PDF, 1.37 MB for more on this new sample.
    In four randomly assigned groups, the following strategies were used:

    1. SOEP standard incentives (one lottery ticket per respondent),
    2. Choice of eather a lottery ticket or five euros per individual interview,
    3. Five euros per individual interview,
    4. Ten euros per individual interview.

    The data from the new Sample I has been included in the new release of SOEP data (SOEP, v26), but due to the particular features of the subsample, it does not have an integrated weighting framework with the rest of the SOEP samples. For subsample I we are conducting a mail survey of all non-participants in the four groups. Since this is the first wave of subsample I, we were not able to integrate the biographical information from Sample I into the existing biography files. The same applies to the biographical information in the dataset PPFAD, e.g. the variable MIGBACK is completely set to -2.

    Apart from that the following additions and modifications have been made:

    3. New and Renamed Datasets 

    • Data on cognitive tests (COGDJ)
      For the first time, all available data on the cognitive tests of young people ("Denksport Jugend", DJ) are included in the SOEP data release. Since 2006, these tests have been given to young respondents (aged 16) the first time they participate in the SOEP survey.

    4. New Variables

    4.1 Dataset $HGEN
    Two new variables describing the quality of the dwelling:

    • EQPLIF$$ "Dwelling has an elevator"
    • EQPNRJ$$ "Dwelling has alternative energy source"

    4.2 Dataset $PEQUIV:

    • There is a new variable on additional child benefits together with the corresponding imputation flag variable (ADCHB$$ and FADCHB$$)

    4.3 Dataset $HBRUTTO - Calender Year of Interview

    • We distribute now a variable( ZDATUMY) describing the calendar year of the interview for the first time. Because of the additional Sample I (and therefore a longer fieldwork period), there were some few cases with a successful interview in 2010.

    5. Revised Variables

    5.1 Datasets $HGEN  

    • The variables on household type TYP1HH$$ and TYP2HH$$ were completely revised and tested for intertemporal consistency.

    5.2 Datasets $KIND – KIDLONG

    • The variables were also completely revised and are now also provided in longitudinal form (KIDLONG) as well as in cross-sectional form in $KIND. This made it necessary to change the variable names in KIDLONG to be consistent over time.

    5.3 Datasets BIOMARSM/BIOMARSY  

    • The biographical data set on marital status was revised.

    5.4 Dataset BIOTWIN

    The dataset BIOTWIN contains 100 additional cases since wave Z. This considerable increase in case numbers is due to an adjustment in the data generation procedure: In contrast to the previous generation, all siblings with an identical year of birth are consdered twins under the condition that the information on the month of birth remains missing. This less restricitve generation is based on the assumption that two separate births in a single calendar year remain rare occurences. Nevertheless, the number of false positives in this group with a missing month of birth is likely to exceed the BIOTWIN average. Hence a new value label was introduced with the variable INFOTWIN in order to flag these twin groups for the user (Code "6": Coverage since 2007, congruent year of birth, missing month; see in contrast to this code "5": Coverage since 2007, congruent year & month of birth).
    In its current state (wave Z) the dataset BIOTWIN covers 250 sets of twins and 5 sets of triplets.

    [1] Twins - Not in 2006 (gen.)
    [2] Twins - 2006 (Answer Not Verifiable)
    [3] Twins - 2006 (Answer Refused)
    [4] Twins - 2006 (Answer Validated)
    [5] Twins - since 2007 (gen.)
    [6] Twins - since 2007 (gen.)


    5.5 Minor bugs fixed

    • Correction of MONTH08
    • Correction of some very few cases on IMMIYEAR
    • Change in the variable names for questions 25 and 26 in YH and ZH

    Data distribution 2008 (Wave Y)

    The new dataset (Waves 1-25, 1984-2008) contains extensive improvements, additions, and modifications. Besides the usual wave-specific data YPRUTTO, YP, YPKAL, YPGEN, YHBRUTTO, YH, YHGEN, YKIND, and XPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors).
    We now also provide-in a beta-release-the data in a more user-friendly format called "SOEPlong". We announced this in SOEPnewsletter 80/2008 and thank all those who provided input on this issue. The new and preliminary version of the SOEP data in long format can be obtained upon request. We suggest that only "power users" should order this version of the data who would like to work with us to improve data management. This version contains all data and thus can essentially already be used for final analyses. This is a preliminary version. We do not recommend the new format for inexperienced users. If new SOEP users want to work with the new format, they should at least be familiar with other panel datasets.

    The most important improvements in the new data istribution are listed in the following:

    1. New Datasets

    1.1 Dataset BIOAGE06
    The new data distribution contains the new file BIOAGE06. For the first time in 2008, it includes the information collected using the special motherchild questionnaire, usually from mothers of fiveto- six-year-old pre-schoolers. The data are thus on a birth cohort that was first “surveyed” in the year 2002/2003 with a special Newborn Questionnaire. The new data on pre-school-age children contains children’s height and weight, health, care situation, activities with and without the mother, and media usage. Detailed questions address the care situation. Furthermore, valid information on the child’s personality is collected for the first time (based on the “Big Five” personal traits indicator in the main questionnaire for adults) and their socio-emotional behavior (surveyed with a modified version of the Strength and Difficulties Questionnaire).

    1.2 Dataset MOVEDIST
    We provide a new dataset on the change of residence. Based on the geo coordinates at block level we will provide the information about the distance (in meter) between the former and the present residence. However the information will only be available for moves since 2000 and is NOT available on this DVD! We distribute this data together with data on the spatial planning regions (ROR) on an extra CD-ROM. You need an extended data distribution contract including a data protection concept if you want to use this kind of data. After signing your contract extension, you will receive this data on CD-ROM (at no additional cost).

    2 New Variables 

    2.1 Dataset PPFAD

    • MIGBACK / MIGINFO: MIGBACK provides time-invariant information on an individual’s migration background resulting from own and parental data. MIGINFO indicates the sources of the information used in order to provide users with highest possible transparency. A detailed description is available in the extensive biography documentation (see chapter on PPFAD).

    2.2 Dataset PFLEGE

    • Pay / Stufe: two new variables about paid care (PAY) and the care level (STUFE) accordingly to the German compulsory long term care insurance.

    2.3 Dataset PBIOSPE

    The data generation process has been updated completely but without changing the basic principles. Therefore, there are only a few barely discernible deviations in the main variables (due to slight changes in the consistency checks of the data). But there are a number of visible changes in the form of additional variables or additional values in already existing variables. A detailed description is available in our documentation on biography and life history data.

    2.4 Dataset BIOPAREN

    • BIO: origin of information is $LELA or $JUGEND
    • ALTER / VALTER/ MALTER: age of respondent/ father / mother -all at the time of biography interview.
    • Attention: A bug had been discovered in the dataset shortly after completing the DVD. For updating the information about the parental religious affiliation, please see our site Known Bugs/Fixes.

    3 Revised Variables

    3.1 Dataset PWEALTH and HWEALTH
    In the year 2007, all individuals aged 17 and up were again surveyed on wealth, just as they were for the first time in 2002. These “raw” data were already part of the standard data distribution for Wave 24 and will be distributed with the upcoming data distribution in a file containing the data for 2002 and 2007 in “long format” – the file PWEALTH for individual data, HWEALTH with data aggregated according to household context. Missing values due to item or partial unit non-response (e.g., missing interviews with individual household members in interviewed households) will be subjected to multiple imputations in complex procedures taking longitudinal information into account. Documentation on this is under preparation. An initial analysis of the new wealth data for 2002 and 2007 is provided in: Joachim R. Frick und Markus M. Grabka. 2009. Wealth Inequality on the Rise in Germany. Weekly Report 5 (10), 62-73 | PDF, 383.22 KB .

    3.2 Dataset $PEQUIV

    3.3 Dataset HHRF/PHRF

    3.4 Dataset $PGEN

    • EMPLST$$: A new category has been added to this variable ("Employment status") From 1998 on, the SOEP data contains information on working in a sheltered workshop for the disabled. Since these persons do not provide information on whether they work full-time, parttime, or on an irregular basis, the new category "sheltered workshop" has been included.

    3.5 Dataset $HGEN
    The domicile-related variables in the wave-specific $HGEN files have been completely revised. New additions include the full imputation of missing values (due to item-non-response) for the housingrelated variables number of rooms, heating costs, gross rent excluding heating, as well as the newly generated variable on utility costs in addition to rent. Finally, “flag variables” show the imputation status, if relevant. Experienced SOEP users may also note the change of the various variable names in the file $HGEN.

    3.6 Dataset PPFAD

    • TODJAHR / TODINFO: To separate panel mortality from demographic reasons for dropping out from the SOEP sample, TNS Infratest carried out several studies to determine the current residence of panel dropouts, i.e. earlier respondents who no longer take part in the SOEP. This entailed locating 17,195 persons. These investigations allowed 981 cases to be identified in which the dropout had died. However, until 2008 all in all 3791 deaths have been identified in the SOEP (see also the documentation on the variables TODJAHR and TODINFO in the file PPFAD). Additionally, there exists a documentation in German language from our fieldwork organization TNS Infratest (“Wiederbefragung von Panelausfällen | PDF, 368.88 KB ” and an English language summary | PDF, 36.18 KB ).

    Data distribution 2007 (Wave X)

    The 2008 data distribution (1984-2007) provides, for the year 2007, the usual wave-specific data XPBRUTTO, XP, XPKAL, XPGEN, XHBRUTTO, XH, XHGEN, XKIND and WPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

    In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Biographical background information has been collected from respondents in sample H for the first time in 2007. This data has been fully integrated into alle relevant biography files (BIOxxxx).

    As part of the SOEP innovations projects TNS Infratest Sozialforschung conducted in December 2006 a postal survey among former SOEP panel members from households which had been classified as final refusals in 2001-2004. As a byproduct we could change the information on year of birth from missing to a valid value for 21 of these persons (more information can be found in the executive summary | PDF, 36.18 KB executive summary of the TNS Infratest Methodenbericht).

    Furthermore the following additions and modifications have been made:

    A. New and Renamed Datasets

    In the 2006 survey year, for the first time, short cognitive tests were carried out with a subsample of the SOEP. The goal was to employ a robust set of instruments that could be administered easily by trained interviewers in just a few minutes. Close to 80% of all persons chosen for participation in the cognitive test provided valid answers. Thus, for the first time, the SOEP now contains indicators of cognitive potentials for more than 5,500 persons, along with diverse educational information based on degrees and certifications. It is planned that the first repeat of the test will take place in the 2010 survey year. A detailed documentation and selection analyses can be found in Schupp et al. (2008) Erfassung kognitiver Leistungspotentiale Erwachsener im Sozio-oekonomischen Panel (SOEP), DIW Berlin, Data Documentation 32 | PDF, 447.63 KB .

    These two datasets replace the former dataset YPBRUTTO, however this year both variants are available 

    Multiple imputed dataset on monthly net household income for the years 1996 to 2007. The dataset is stored in long format (long format: hhnrakt, svyyear, mj, also called mim format within stata). Each item non-response on net household income was imputed 10 times. More information can be found in HGEN.pdf | PDF, 0.64 MB

    B. New Variables 

    B.1 Dataset XPBRUTTO

    • XEWSTATU: Proxy information on non-responding persons regarding their labor force status in households with partial unit non-response.


    B.2 Dataset $PEQUIV

    • P11101$$: Copy of the wave specific variables on overall life satisfaction.

    B.3 Dataset $HGEN

    • I_HINC$$: Multiple imputed version of HINC$$, the monthly net household income. Imputations 1-5 are available as wide format in $HGEN (only 1996-2007), all generated imputations (10) are available in an extra dataset called MIHINC in long format, additional information can be found in HGEN.pdf ( document,169 KB).
    • FHINC$$: Imputation flag for I_HINC$$, 0 means not imputed and 1 otherwise.

    C. Revised Variables

    C.1 In the Dataset $PKAL

    • $P2D03 + $P2E03: In the waves U-W (years 2004-2006) for some cases an incorrect "does not apply" missing (-2) was corrected to an "no answer" missing (-1).

    C.2 In the Dataset HHRF/PHRF

    • WPHRF*: All weighting factor for the year 2006 are now based on microcensus benchmark data from 2006.

      However, the weighting factors for the year 2007are also based on (newest available) microcensus benchmark data from 2006; they are therefore only provisional with regard to the figures given for households and individuals in Germany.

    • VHHRF + VHHRF1: 1 Household from sample G was corrected and set to 0.


    C.3 In the Dataset $PGEN

    • LFS$$: The variable „labor force status" has been improved over all waves with respect to the accuracy of classifying individuals as „non-working and older than 65"(category 2). Now, the information on the month of birth of a person is used in order to determine whether the person was older than 65 at the time of the interview.


    D. Error Updates

    D.1 In the Dataset VH and WH

    • We have corrected value labels for the variables indicating the owner of the dwelling (VH27 and WH27), please note the relevant corrections in the table below.

      Variable Label: Owner Of The Dwelling

      -2 Does not apply Does not apply
      -1 No answer No answer
      1 Self Owned Res. Property Local Govt. Apt.
      2 Local Govt. Apt. Co-Operative Apt.
      3 Co-Operative Apt. Company Apt.
      4 Company Apt. Private Owner
      5 Private Owner Do Not Know

    D.2 In the dataset $PGEN

    • EGP$$: The variable "Erikson and Goldthorpe Class Category" (international socio-economic index of occupational status) has been corrected with respect to the assignment of individuals to category (18) "not working - pensioner". Up to now, all pension recipients, i.e. recipients of retirement pension and recipients of widow's/orphan's pension have been erroneously classified as "not working - pensioner" if none of the other categories applied. In the corrected generation of the EGP$$ variable, which applies to all waves, non-working persons are only assigned to this category if they are recipients of a retirement pension or if they are recipients of orphan's/widow's pension AND are older than 60 years. Moreover, if there is missing information on pension receipt, additional information from ARTKALEN (retrospective information from the activity calendar for the previous year) is used in the generation process to determine if a person was in retirement or early retirement ("Vorruhestand") at the time of the interview. All other non-working persons are assigned to category (-2) "does not apply" as long as they are not registered as unemployed (category 15).
    • STIB$$: The same problem of misclassification of individuals to the category "pensioner" (13) applied to the variable for the "Occupational position", and has been corrected for all waves in the same way as for EGP$$ .
    • NACE$$: The variable for the "two-digit NACE Industry - Sector" had several inconsistencies with respect to the labeling. In particular, the labels for code (90) "Sewage And Refuse Disposal, Sanitation And Related" and code (95) "Private Households With Employed Persons" had to be swapped. Some other labels were not accurate, and have been stated more precisely for all waves.
    • IS88$$, ISEI$$, MPS$$, SIOPS$$, KLAS$$, EGP$$: The questions which refer to these variables are not asked from all employed persons annually. In the survey years 1985, 1986, 1987, 1988, 1990 (West), 1992 (West), 1994, 1996, 1999, 2001, 2003, 2005, and 2006 only those employed persons who changed jobs and first-time respondents are asked to provide up-to-date information. Hence, in years with a partial survey, these variables should contain the available previous year's information for all employed persons without a job change who did not update the information on their current occupation. However, for some individuals, the previous year's data was not used by mistake. This mistake was corrected by newly generating these variables for all the waves in an accurate and consistent way.

    Data distribution 2006 (Wave W)

    The 2007 data distribution (1984-2006) provides, for the year 2006, the usual wave-specific data WPBRUTTO, WP, WPKAL, WPGEN, WHBRUTTO, WH, WHGEN, WKIND and VPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

    In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Detailed information on the integration of this sample and additional changes in both files using weighting and extrapolation factors can be found below (see item 2).

    A further important change is the introduction of a new survey instrument for first-time respondents at the age of 17 years. These persons now receive an expanded youth questionnaire, which provides current information as a supplement to the biographical data already collected, thus rendering the previous individual questionnaire used for this group obsolete. This also means that the survey population for the standard individual questionnaire (stored in the files $P) changed slightly, 17-year-olds are not included since survey year 2006 (first-time surveying of sample H constitutes an exception, since here the 17-year-olds have been surveyed again with the individual questionnaire since the biographical survey in new subsamples starts only with the second wave. The revised $NETTO-variables and the file $PAGE17 are of interest in this context (see below).

    The educational variables in the generated datasets ($PGEN) have been revised: the integration of vocational qualifications attained abroad has been improved, and the corresponding variables have been subjected to extensive testing for consistency. These variables will be described in greater detail below.

    The information on twins in SOEP was validated by a special survey of "potential" twins, and is integrated into the dataset BIOTWIN.

    This year as in all previous years, the variables contained in the file WPEQUIV (wave 2006) relating to previous year's income take into account the various structural changes in the tax and transfer system, using these as part of the basic informational framework for generating and simulating annual income. Not only do the changes in the 2005 tax rate (reduction of the top tax rate, personal exemption) play an important role here but also the new guidelines contained in the Old Age Income Act (Alterseinkünftegesetz). The introduction of Unemployment Benefit II (Arbeitslosengeld II) also plays an important role, along with the extensive changes in the transfer system it entails (Social Security, Rent Subsidy, etc.). The generated information on (previous) year's income from SOEP survey year 2006 has thus been subject to thorough testing for internal and external consistency.

    This year, the data is being distributed for the first time on DVD. This means that the language of variable and value labels can be chosen even more easily: right in the SOEP data installation program. If you install the data in Windows Vista using our setup program, please follow the installation instructions on DVD.

    The following additions and modifications have been made:

    New and Renamed Datasets 2006  

    From 2007 on, persons who have reached the age of their first individual SOEP interview (17 years) are not given the usual individual questionnaire but a special youth questionnaire. Wave-specific information not contained in the biographical data or other generated datasets (like $PGEN, HEALTH) are given in the dataset $PAGE17. Youth questionnaire respondents are identifiable with the help of the new $NETTO code "17" (see also the changes in the $NETTO variables in PPFAD). More information can be found in the biography documentation on our homepage and on the new DVD.

    Starting in 2007, the information on SOEP sample design previously compiled in the dataset VARIANZ (Spiess 2001) is now being disseminated in a revised and amended dataset DESIGN. Preliminary documentation can be found in designdoku.pdf | PDF, 57.99 KB on our homepage and on the new DVD.

    Starting with 2002, the SOEP health module in the individual questionnaire has been revised and put on a two-year replication period. In the HEALTH file, users find the generated SF-12 variables (measuring health related quality of life) as well as variables on height and weight with imputation flags and a user-friendly, longitudinally checked generated variable of the Body Mass Index (BMI). More information in the health.pdf | PDF, 110.01 KB on the SOEP homepage or on the new DVD.

    The wealth data asked in 2002 were thoroughly revised and checked for inconsistencies. The data are now provided in two (multiply) imputed datasets for the individual and the household level, with the corresponding flag variables for identification of the imputed values. The two datasets also each contain a generated variable on "net wealth" (see SOEPpapers No. 18 | PDF, 0.76 MB ).

    Interviewer Survey
    The interviewer dataset-available up to 2006 only as a "stand-alone" version-is now integrated into the standard data distribution under the name INTVIEW and thus provided in the different software formats (SAS, SPSS, STATA).

    Cross-Sectional Weighting Scheme 2006  

    With the 2006 data distribution, important changes have been made in the cross-sectional weights. They are described in detail (in German) in the DIW Data Documentation 22 | PDF, 310.7 KB .

    1. Types of Weighting Factors Redefined
    Each cross-sectional weight is designated $xHRFy. Here, $ represents the wave identifier, x the differentiation between households (x = H) and persons (x = P) and y an additional identifier that describes the type of weighting factor.

    • $xHRF are the weighting factors that have been used since the beginning. They contain all samples with the exception of high-income sample G.
    • $xHRF1 are the standard weighting factors, where-additionally to the exclusion of sample G-the weights of new subsamples have been set to zero. Why? Using a complex survey construct, respondents in the first waves showed "worse" answering behavior than respondents in later waves (for example regarding life satisfaction and annual income). sample C offers an exception: respondents in the former GDR in 1990 did not exhibit the typical problems of first-time respondents (that is, GxHRF and GxHRF1 are identical).
      For standard cross-sectional analyses, we recommend the use of the $xHRF1 as a standard weighting factor. In this way, the information from the first waves of the different subsamples is automatically left out.
    • $xHRFALL include all available samples.
    • $xHRFD, $xHRFF and $xHRFG designate the isolated weights for immigrant sample D, for refreshment sample F and for high-income sample G.
    • The Variable $PHRFXX in PHRF und HHRF has been deleted.

    2. Modifications to the External Information Used in the Weighting Scheme
    Since the year 2005, the Federal Statistical Office no longer provides data differentiating between East and West Berlin. This has led to the minor retrospective changes in the external information on number of households since survey year 2005.

    3. New Refreshment Sample H
    In 2006, SOEP expanded to include another sample-refreshment sample H. The new households, which are representative for Germany as a whole, were also included in the weighting scheme. The integration of sample H is currently underway. Tests are still being conducted to determine if and how sample H can be further adapted to the external information. This is not a serious problem since in any case, for descriptive analyses we recommend the use of the weighting factors WxHRF1, which exclude sample H .

    4. Weighting Factors are Based on Benchmark Data from the 2005 Microcensus
    The weighting factors for the year 2006 are based on microcensus benchmark data from 2005; they are therefore only provisional with regard to the figures given for households and individuals in Germany. Please address any questions to .

    BIOAGE01 and BIOAGE17 2006  

    1. BIOAGE01
    Four new variables on pregnancy status have been generated, based essentially on the month of the interview from $P and the month and year of the child's birth, as well as the duration of pregnancy in weeks from BIOAGE01.

    BCPREGY 'Mother: pregnant at the time of individual interview wave ($)?'
    Value Labels:
    2002 | Pregnant at Time of Personal Interview 2002
    2003 | Pregnant at Time of Personal Interview 2003
    2004 | Pregnant at Time of Personal Interview 2004
    2005 | Pregnant at Time of Personal Interview 2005
    2006 | Pregnant at Time of Personal Interview 2006
    2007 | Pregnant at Time of Personal Interview 2007

    BCPREGMO 'Mother: estimated month of pregnancy at the time of individual interview, wave($)'
    Value Labels:
    1 | First Month of Pregnancy
    2 | Second Month of Pregnancy
    3 | Third Month of Pregnancy
    4 | Fourth Month of Pregnancy
    5 | Fifth Month of Pregnancy
    6 | Sixth Month of Pregnancy
    7 | Seventh Month of Pregnancy
    8 | Eighth Month of Pregnancy
    9 | Ninth Month of Pregnancy
    10 | Last Month of Pregnancy or after Birth

    Furthermore the beginning and end of pregnancy are also available as spell data. Analogously to BIOMARSM, for example, we start counting with month 1 (January 1983), such that December 2007 is month 300. The data are generated based on month of birth and duration of pregnancy in weeks from BIOAGE01.

    PREGBEGM 'Spell - Month beginning of pregnancy / conception (1 = Jan 1983)'

    PREGENDM 'Spell - Month end of pregnancy / Birth (1 = Jan 1983)'

    2. BIOAGE17
    You will find detailled information on the structure and the content of the dataset in the documentation of the biographical data on our SOEP homepage or the DVD.

    $HGEN 2006

    In addition to the Bundesland (federal state) variable, starting this year, the corresponding NUTS (Nomenclature of Territorial Units for Statistics) Level 1 Variable is also provided. This variable is generally identical with $BULA in $HBRUTTO but without pooling Rheinland-Pfalz/Saarland (from 2000 on) and without differentiating between East and West Berlin.

    $PGEN 2006 

    1. New Variables

    A variable for identification of job change was generated to supplement ERWTYP$$ (and eventually to replace it). The categories for this variable are independent of whether the information was obtained in a first-time or a subsequent interview. For respondents to a subsequent interview, JOBCH$$ refers to job changes since the last interview and for first-time respondents, it refers to job changes since the beginning of the previous year. Respondents who started their first job and respondents who made a job change are reported separately. In contrast to ERWTYP$$, JOBCH$$ has been subjected to a check for longitudinal consistency. Cases showing inconsistences-such as duplicate entries of the same job change in two subsequent interviews-have been corrected.
    Value Labels:
    1 | Not Employed
    2 | Employed No Change
    3 | Employed No Info If Change
    4 | Employed With Change
    5 | First- Time Employed

    2. Revised Variables
    For the years 1990 and 1991, values for job tenure are now provided for sample C (East) as well. Given the potentially limited comparability due to the East German transformation process, this data should be handled with particular care.

    Job tenure has been tested for longitudinal consistency due to repeated evidence of inconsistencies. Cases that proved longitudinally inconsistent were corrected using the following procedure:

    1. Start of employment at current job as stated in the respondent's first survey is generally given precedence, and is carried on in subsequent years if no change of job occurred or the respondent did not take a new job after a break in employment.
    2. In the case of a change of job (change of employer / change to self-employment) current data on the time of job change is used and carried on in subsequent years.
    3. In the case where a respondent has taken up a new job after a break in employment, we assume that he or she returned to the old employer if the current data show a start of employment prior to the last survey year. In this case, we do not use the start of employment provided in the current survey but the start of employment from the last survey. If the current data show a start of employment since the last survey year, however, we assume that the respondent changed employer since the previous survey, and update the start of employment using the data from the current survey.

    From the longitudinally consistent start of employment with current employer, we determine the duration of job tenure. When a respondent who started working again after a break can be assumed to have returned to his or her former employer, the full duration of job tenure is taken. The period of the break in employment is then not subtracted, potentially resulting in an implicit overestimation of firm-specific human capital.

    Since 1999, the required job training variable has distinguished between studies at universities and technical colleges, and now, different categories have also been created for the years prior to and since 1999. For the years since 1999, separate categories have been introduced explicitly differentiating among these different kinds of educational qualifications. Furthermore, technical colleges and technical schools are now designated separately.
    AUSB$$ 'required job training'
    Value Labels:
    1 | No Training
    2 | Introduction to Job
    3 | On-The-Job Training
    4 | Courses
    5 | Vocational Training
    6 | Technical School, Engineering (East) 1990-96
    7 | Technical College or University, up to 1998
    8 | Technical College, since 1999
    9 | University, since 1999

    For waves U,V, and W, values for Wegener's Magnitude Prestige Scale have been added for respondents without a household interview ($NETTO=19).

    For the employment type variable, the old categories have been maintained but due to the common value for first-time job holders and those who have made a job change, the label for this category has been changed. Thus, the label 'employed, with change or first time employed' is now applied to the value 6.

    3. Update Educational Variables
    Thanks to our users, an error was identified in the generation of the educational variables in $PGEN, which had crept in some time ago in the process of retrospective generation for the years 2000 and 2001 and continued on since then. The error was in the variable $PBBIL02, and consisted in assigning foreign university degrees too high a value. The error came about through the integration of the variables $PBBILA and $PBBIL02 in these two years. All educational degrees have therefore now been generated again retrospectively for the years 2000 to 2006. The resulting variables $BILZEIT, ISCED$$ and CASMIN$$ have also been updated retrospectively from 2000 on.

    PPFAD 2006 

    Revision of the $NETTO Codes

    With this year's wave W (23rd survey wave), 2006, the compilation of data on the survey population has changed fundamentally. Previously, an individual interview was carried out with all household members above the age of 16. As of 2006, the regular individual interviews based on the standard adult questionnaire are introduced one year later when household members reach the age of 18. Seventeen-year-olds instead receive an expanded youth questionnaire in their first year as SOEP respondents. (This applies to the old samples A-G; for the new sample H, distribution of this youth questionnaire will start next year, while this year's 17-year-olds have received the regular individual questionnaire, in line with the old system).
    This means that we now have two instruments instead of one to obtain data on respondents: the individual and the youth questionnaire. To ensure a consistent differentiation over time, it will therefore be necessary either to include the youth population of the current year or to increase the age limit for all previous years.
    The newly revised $NETTO variable assists retrospectively in both differentiations for the entire survey period. The connection between survey population and survey instrument can be retraced with the help of the variable $NETTO in PPFAD or $HNETTO in HPFAD. As a result of the change in the survey population as well as the expansion of the survey instrument to include detailed information on biographical contexts, the corresponding variable $NETTO in PPFAD has been fundamentally revised and is now provided as a two-digit variable. To ease the transition to the new variable, the old one-digit variable is still provided as well under a different name $NETOLD; the variable $HNETTO in HPFAD is unaffected by this and remains unchanged.
    Value Labels:
    10 | Respondent Completed Interview
    11 | Individual Questionnaire
    12 | Individual Questionnaire and Biography
    13 | Individual and Youth Questionnaire
    14 | Individual and other Questionnaires
    15 | Individual Questionnaireand Experiments, Tests
    16 | Individual Questionnaire, First-Time Respondent, Age 17
    17 | Youth Questionnaire, First-Time Respondent, Age 17
    19 | Individual Questionnaire without Household Interview

    20 | Children in Household Interviewed ($KIND)
    21 | Children with Mother-Child Questionnaire I, Age 0-1
    22 | Children with Mother-Child Questionnaire II, Age 2-3

    30 | Persons in successfully interviewed household without Individual Interview
    31 | Completed Gap Interview ($LUECKE)
    32 | Completed Biography Questionnaire
    33 | Successful Youth Questionnaire
    34 | Successful Tests and Experiments

    60 | Only Questionnaire without Individual or Household Interview
    61 | Gap Interview without household reference
    62 | Gap Interview with drop out
    70 | Only Participation in Tests, Experiments, etc.

    80 | Individual did not withdraw from panel population
    81 | Previous respondent lacking current information
    89 | Repatriate - (was Drop Out)

    90 | Individual Dropouts $YPBRUTTO
    91 | Moved abroad
    99 | Died

    In the new variable $NETOLD, the old $NETTO code can still be used. Persons at the age of 17 who either filled out a youth questionnaire (n=307) or an individual questionnaire (sample H, n=31) are each coded with the value of 1. In this way, the selection on (WNETTO == 1 | WNETTO == 5) is not identical with the population in WP.

    $PEQUIV 2006 

    1. New Variables
    ALG2$$: Sum of all transfers from Unemployment Benefit II (Arbeitslosengeld II) received by the household. FALG2$$: Flag to identify the imputation of Unemployment Benefit II (ALG2$$).
    IDEMY$$: Sum of indemnity payments received in the previous year.
    FDEMY$$: Flag to identify the imputation of indemnity payments (IDEMY$$).
    ITRAY$$: Sum of commuting and travel grants received in the previous year.
    FTRAY$$: Flag to identify the imputation of commuting and travel grants (ITRAY$$).

    2. Revised Variables
    The variable (rental value of personally used living space = imputed rent) has been generated up to now only for persons living in owner-occupied housing. In line with recent research findings and also European Community guidelines for the generation of imputed rent in EU-SILC, this fictitious income advantage is now generated for persons in rental households as well who claim to pay below-market rental prices. These include people in rent-free housing, in socially subsidized housing, and in rental properties offered at a special rate (company dwellings, apartments provided by relatives at reduced rent, etc.).

    W11101$$ and W11102$$
    Due to the changes to the weighting factors in the files PHRF and HHRF, the variable W11101$$ now contains the individual weighting factor $PHRF1 (from the file PHRF) and the variable W11102$$ now contains the household weighting factor $HHRF1 (from the file HHRF).
    First-time SOEP respondents show a significantly higher rate of item non-response in their first wave, which cannot be corrected adequately through imputation. For this reason, these two weights do not take into account the first wave of each new SOEP subsample. Furthermore, high-income subsample G has been excluded from the weighting scheme in order to prevent structural breaks in the analysis of income with vs. without this subsample. These two weighting variables are thus particularly well suited to a consistent time series of income inequality analysis.

    The variable W11105$$ now contains the individual weighting factor $PHRFALL (from the file PHRF). This weighting variable takes into account all SOEP subsamples.

    The content of the variable E11105$$ is now based on the ISCO88 International Standard Classification of Occupations.

    E11106$$ and E111076$$
    The variables E11106$$ and E11107$$ now provide information on sector affiliation in the form of a one or two-digit number according to the NACE scheme, the International Standard Industrial Classification of all Economic Activities.

    3. Deleted Variable
    W11106$$ 'HH-Weight immigrant sample'


    With wave W, the birth biographies of men (BIOBRTHM)-like those of women (BIOBIRTH)-include not only the year of birth (KIDGEB[n], with n = (1...15), but also the month of birth for each child (KIDMON[n]). This birth month is identical with the child's birth month given in PPFAD.  


    In 2006, a separate survey was carried out in all households with twins. This twin survey had the goal of validating the data on all twins in SOEP and gaining new information. The following variables have been changed or added in BIOTWIN as a result:

    The variable BIOMONOZ differentiates between identical and fraternal twins based on a question asked to first-time respondents. This information used to be obtained through a question asking whether the twins were of the same or different sexes. New codes have been introduced for the variable BIOMONOZ to reflect the improved information available. The values are thus no longer compatible with those from prior to wave W contained in variable BIOMONOZ in the dataset BIOTWIN.

    The variable INFOTWIN has been introduced. This variable tells whether information on twins was given in the 2006 twin survey, whether the information was derived from previously exisiting SOEP data, and whether previously existing data on the twins coincides with the results of the twin survey.


    The variable "Erikson and Goldthorpe Class Category" (International Socio-Economic Index of Occupational Status) has been corrected with regard to the categorization of freelance academics, who were previously grouped together with the self-employed (values of 5 or 6). The corrected generation process assigns academic freelancers to the upper service class, which corresponds to a value of 1.

    nach oben

    Data distribution 2005 (Wave V)

    The 2006 SOEP data distribution (1984-2005, Waves A-V) includes the usual wave-specific data VPBRUTTO, VP, VPKAL, VPGEN, VHBRUTTO, VH, VHGEN, VKIND and UPLUECKE, as well as updated versions of all datasets with a longitudinal component (spell data, biographical data, and weights).

    The first CD-ROM contains, as usual, all SOEP data with variable labels and value labels in German, and the second contains all SOEP data with variable labels and value labels in English.

    Please also note the following improvements and changes:

    New and renamed datasets 2005 

    With the current data distribution, we renamed all SOEP datasets based on age-specific biographical questionnaires (e.g., "Mother and Child") in a more consistent manner. Since all these datasets are saved in long format, the names now start with "BIOAGE" and a two-digit suffix. This suffix gives the maximum age of the individuals in question during the survey year.

    New name for the dataset BIOCHILD up to the present (based on the questionnaire for mothers with a newborn child below the age of 15 months).

    New dataset based on mother-and-child questionnaire for mothers with a child between the ages of 2 and 3 years. For further information, please see the biographical data documentation.

    New name for the dataset previously known as BIOYOUTH (based on a survey of adolescents between 16 and 17 years old).

    Weighting 2005

    The 2005 cross-sectional weights are provisional - an update of VPHRF and VHHRF will be released in fall 2006

    The wave-specific projection and weighting variables will be adjusted annually to external official data to ensure the accuracy of marginal distributions on age, sex, household size and nationality. The source of the data is the German Federal Statistical Office's official microcensus. From 2005 on, the data on Berlin will no longer be reported separately for the areas comprising former West Berlin / East Berlin; rather, Berlin will be considered part of East Germany. As a consequence, the data required to adjust our weights to the official marginal distributions will not be available before fall 2006.

    To prevent this from causing a delay in the distribution of the SOEP data up to Wave V (2005), the weights (VPHRF* and VHHRF*) have been adjusted to the data used for Wave U (2004).

    From our experience, there is a very low deviation in the benchmark data over the years (the new definition for West Berlin / East Berlin being one exception). Please keep in mind the provisional nature of the weighting scheme, and indicate this explicitly in any publications using the weights for Wave V. We will inform you as soon as the final version, based on the 2005 microcensus data, becomes available via the SOEP NEWSLETTER and listserver.

    $HGEN 2005 

    The adjusted screener (AHINC$$) is now available for all waves (Exception: Sample C in 1990/1991).  

    $PGEN 2005 

    ALLBET$$ (new)
    Raw categories for the size of the company. A consistent variable over all waves for the size of the company ("least common denominator" of the variable BETR$$).


    1.  "less than 20"
    2. "20 to 200"
    3. "200 to 2000"
    4. "2000 and above"
    5. "Self-employed with no other employees"

    BETR$$ (revised):

    The variable BETR$$ now has eleven instead of nine categories. The reason is the more detailed questions from Wave V onwards. The old category "5 to 20 employees" is now split into two categories ("5 to 10 employees" and "11 to 20 employees").

    The new categories are:

    1. "less than 5"
    2. "GE 5 LE 10"
    3. "11 LT 20"
    4. "up to 1990: LT 20"
    5. "1991-2004: 5 LT 20"
    6. "GE 20 LT 100"
    7. "GE 100 LT 200"
    8. "up to 1998: GE 20 LT 200"
    9. "GE 200 LT 2000"
    10. "GE 2000"
    11. "Self-employed without employees"

    TIP: The variable ALLBET$$ in the dataset $PGEN offers consistent data on company size thoughout all waves of the SOEP, although with fewer categories in a less detailed classification.

    EMPLST$$ (new):
    Employment Status. A consistent variable over all waves to differentiate employment status (in addition to the variable LFS$$, which differentiates non-employed persons).


    1. "Full-time employment"
    2. "Regular part-time employment"
    3. "Vocational training"
    4. "Marginal, irregular part-time employment"
    5. "Not employed"

    EXPFT$$ (new):
    Working experience full-time employment. Coverage of complete working experience in full-time employment (in years, one digit after the decimal point).

    EXPPT$$ (new):
    Working experience part-time employment. Coverage of complete working experience in part-time employment (in years, one digit after the decimal point).

    EXPUE$$ (new):
    Unemployment experience. Coverage of unemployment experience throughout the entire period of working life (in years, one digit after the decimal point).


    $PEQUIV 2005  

    SSOLD$$ (new):
    Social assistance to the elderly ("Grundsicherung im Alter").

    FSSOLD$$ (new):
    Imputation flag: Social assistance to the elderly.

    LOSSR$$ (new):
    Losses from renting and leasing.

    FLOSSR$$ (new):
    Imputation flag: losses from renting and leasing.

    LOSSC$$ (new):
    Losses from capital investment.

    FLOSSC$$ (new):
    Imputation flag: losses from capital investment.

    D11112LL (new):
    Race of individual

    D11110$$ (erased):
    data already included in the variables M11124$$.

    D11111$$ (erased):
    data already included in the variables M11125$$ .


    Bug fixes  

    Correction of [T-U]HPOP in HPFAD.
    Correction of some individual and household weights for the years 2003 and 2004 (THHRF, UPHRF, and UHHRF). 

    nach oben

    Data Distribution 2004 (Wave U)

    PPFAD 2004

    The basic demographic information in PPFAD has been expanded to include location of residence in 1989, i.e., where an individual lived when the Berlin wall fell (variable LOC1989). This information is differentiated into the categories "East Germany", "West Germany", and "Abroad" and is available for all respondents (adults and children, see further documentation in Biography and Life History Data).

    PGEN 2004 

    LABGRO$$ and LABNET$$
    New variables have been generated for all waves (A-U) providing information on monthly gross and net labor income (LABGRO$$ and LABNET$$), consistently declared in euro. Missing values in case of item non-response are imputed as indicated by the corresponding imputation flag variables IMPGRO$$ and IMPNET$$ respectively (see also additional documentation in PGEN.PDF | PDF, 232.61 KB ).

    HGEN 2004  

    $HGEN now includes the monthly net household income consistently named (HINC$$) and declared in euro over all waves (A-U).
    Contact: or

    A new variable has been generated for waves L-U (1995-2004) providing information on monthly net household income adjusted for possible underreporting (AHINC$$), also consistently declared in euro. Possible underreporting is checked with the help of the current individual incomes of all household members (see also additional documentation in HGEN.PDF | PDF, 61.43 KB ).
    Contact: or

    $PEQUIV or SOEP-CNEF 2004  

    The files $PEQUIV now also include a set of cross-nationally harmonized health-related variables M11101$$-M11127$$ (see also the additional documentation in the Codebook for the $PEQUIV File 1984 - 2004 | PDF, 0.55 MB ).

    nach oben

    Data Distribution 2003 (Wave T)

    The data of the German SOEP (100% version) are distributed on three CD-ROMs covering the years 1984-2003. New data sets for the survey year 2003 are the usual wave-specific data TPBRUTTO, TP, TPKAL, TPGEN, THBRUTTO, TH, THGEN, TKIND and SPLUECKE. There are also updates of data sets with a longitudinal component (biographical data and weights). The information collected for the first time in 2003 in the biographical questionnaire for sample G ("high-income sample") has been completely integrated into the user-friendly biographical data sets.

    As of this year, the data on CD-ROM #2 also contains all SOEP data with variable labels and value labels in English (including the data from the 1988 financial statement in file EV).

    In addition, we have made the following additions and changes:

    Sample G "High Income Sample" (Start 2002)  

    The revised sampling design, using a higher income threshold, results in a smaller number of observations in wave 2.

    HHRF and PHRF 2003 

    The standard weighting variables for waves S and T (SPHRF, TPHRF or SHHRF, THHRF) are based on sub-samples A-F, that is, without considering high-income sample G. In addition, we now offer a new integrated weighting variable for all sub-samples A-G (variables $PHRFAG or $HHRFAG, see also documentation | PDF, 267.43 KB on the integrated weights for A-G vs. A-F ).

    Rectypes 2003

    1. BIOCHILD: Information from the 'Mother and Child Questionnaire'
    In this new file, information on newborns in the SOEP will be collected each year from now on (see further documentation in Biography Data).

    2. BIORESID: Information on second residence in the first interview
    The data set BIORESID includes information on length of residency, and on second residence. The information comes from the biographical questionnaire, which has consistently contained questions on this since 1994 (see further documentation in Biography Data).
    Contact: Thorsten Schneider

    3. BIOBRTHM: Birth biography information for men - from 2001 on
    This new data set includes information on the birth biographies of men interviewed with this modified questionnaire since 2001. BIOBRTHM is structured analogously to BIOBIRTH, based on a question fomerly only answered by women (see further documentation in Biography Data).

    4. BIOTWIN: data for identifying births of twins, triplets, etc.
    BIOTWIN includes all identifiable births of twins, triplets, etc. in the SOEP. Identifiers (PERSNR) for the mother and siblings are included (see further documentation in Biography Data).
    Contact: and

    5. HBRUTT98:
    This new file contains the complete gross population of sample E in the year 1998. It is useful in attrition analysis of the first wave of this sample.

    BIOPAREN 2003

    Variables on the nationalities of parents have been corrected (see further documentation in Biography Data).

    PGEN 2003  

    MODE$$und MONTH$$
    Two new variables have been generated for all previous waves to describe interview method and month (MODE$$ or MONTH$$. See also additional documentation | PDF, 0.66 MB .

    Update of $PSBIL: For foreigners, the category "leave without graduating" [code 6] had to be updated in 2000, which in turn made it necessary to update $BILZEIT, ISCED$$ und CASMIN$$.
    Contact: Bettina Isengard and

    The variable for martial status has been updated.

    HGEN 2003  

    HMODE$$ and HMONTH$$
    Two new variables were generated for all previous waves to describe interview method and month (HMODE$$ or HMONTH$$). See also additional documentation | PDF, 0.64 MB .

    PPFAD 2003  

    The central demographic information in PPFAD has been expanded to the month of birth (variable GEBMONAT). This information is now collected for all adults and children as well (see further documentation in Biography Data).

    Update of EINTRITT, ERSTBEFR, AUSTRITT, LETZTBEF (see further documentation | PDF, 0.53 MB ).

    BIOBIRTH 2003  

    The information on women's birth biographies was expanded to include information from the Youth Questionnaire, which is given to 16-17 year-olds being interviewed for the first time instead of the standard biographical questionnaire (see further documentation in Biography Data).

    BIOIMMIG 2003  

    This data was corrected to fix a case of miscoding in past years that occurred due to a reversal of the item sequence. This applies to the variables BIEXPRLV, BIEXPRAC and BIEXPRAN (see further documentation in Biography Data).

    PFLEGE 2003

    The new variable PNRCARE is now available for the years since 1999, that is, for waves P - T. PNRCARE is an invariable number identifying the primary caregiver in a household. In three cases, the person identified as caregiver was identical with the person being cared for. In these cases, PNRCARE was set at -3 (implausible value). For the waves prior to 1999, PNRCARE has been assigned the value -2.
    Contact: Rainer Pischner 

    YPBRUTTO 2003  

    Revision of HHNRAKT and HHNROLD for persons listed doubly while living in a previous household.

    $EQUIV 2003

    All income data since 1984 is coded in EURO.

    As a supplement to the annual income aggregates offered thus far, we now add the individual income components (sum of all income earned by all household members, variables I111xx$$) with consistent variable names over time.

    All information missing due to item-non-response was imputed and marked using flag variables.

    All income variables are also included for sample G, but standard weights were used on the basis of sub-samples A-F (see also the additional documentation | PDF, 41.18 KB ).


    nach oben

    Data Distribution 2002 (Wave S)

    Rectypes 2002

    1. HBRUTT02
    In addition to the continuous, wave-specific brutto information regarding progress in the field (SPBRUTTO, SHBRUTTO), households which were not surveyed have been included in the new subsample G for the file HBRUTT02. HBRUTT02 therefore contains all the households selected for subsample G; while the information on households who were surveyed for subsample G is also to be founding the continuous household-brutto SHBRUTTO. This matches the approach used for samples A (HBRUTT84), E (HBRUTT98) and F (HRBRUTT00).

    2. BIOSOC
    The new data set BIOSOC contains youth information on everybody who has completed the biography questionnaire since 2000. This includes information such as arguments with parents, leisure activities, school grades and the federal state where they last attended school.
    Contact: Thorsten Schneider

    BIOJOB 2002

    The data set BIOJOB contains detailed information on first jobs. As of now this also includes ISCO88 data, occupational scales, classification schemes (ISEI, SIOPS, EGP, MPS) as well as information about the sector (BRANCHE). Information regarding last jobs is a new addition and can be found in BIOJOB.
    Contact: Thorsten Schneider 

    BIOPAREN 2002 

    The person to contact for the update of the Prestige-Scores for parents is .

    PGEN 2002 

    This new variable is based on the answers to 'Occupational Status' and represents the degree of autonomy in a person's occupation.

    This variable unifies the answers to 'Occupational Status' over all waves.

    The wave specific files $PGEN have been retroactively (from 1984 onwards) expanded to include two further education variables ($ISCED and $CASMIN), which are respectively based on the international classification schemes ISCED (International Standard Classification of Education) and CASMIN (Comparative Analysis of Social Mobility in Industrial Nations). This will help improve comparisons of education-related analyses based SOEP data.
    Contact: Bettina Isengard

    $EQUIV 2002 

    Compared to the last data set, there have been fundamental changes to the handling of Item-Nonresponse for annually-based income information and the aggregated income information contained in $PEQUIV. The established longitudinal procedure used for the imputation of Item-Nonresponse has been expanded to include a purely cross-sectional imputation for all income variables, which, however, are only to be used in the case of individual longitudinal information being unavailable. This has resulted in a complete replacement of all the missing income data in the $PEQUIV files (for further information for the methodical procedure for additional imputation cf. Frick, J.R. and Grabka, M. (2003): Missing Income Data in the GSOEP: Incidence, Imputation and its Impact on the Income Distribution | PDF, 1.01 MB ).

    Due to this, all the so-called imputation flags have been revised. This now reproduces the share of the imputated income in the respective income aggregate, i.e. if all information is present the value will be 0 and if any Item-Nonresponses are present then the value may be anything up to 100.

    In addition, complete income information for the new sample F for years from 2000 to 2002 now also available.

    The CNEF data is not yet available for the first wave of sample G, as the methodically demanding imputation algorithms applied by the SOEP require longitudinal data.

    DM-EURO conversion

    The income in $PEQUIV always refers to that of the previous year; this means that data collected in 2002 for the 2001 income year will still be in DM. There will be a conversion to Euros for all the $PEQUIV information in the next data distribution. Besides that, all the data contained in the $P files corresponds with the information collected with the original questionnaire, i.e. the data collected in Euros in 2002 or the data collected in DM in 2001 is respectively stored in the currency used in the questionnaire.


    nach oben

    Data Distribution 2001 (Wave R)

    With the 18th wave of the SOEP the concept for the construction of cross-sectional weights has slightly been changed. This change affects neither the derivation of the staying probabilities nor the construction of the weights for Subsample D units.

    For more details, please see the Newsletter 60, April 2003.

    With the current release of SOEP data (survey years 1984-2001), the coding frame for industry and occupation (first and second job) has been changed to the international standard of NACE and ISCO88, respectively. Especially „old friends" among SOEP-users should be aware that the variables ISCO$$, ISCOU$$, ISCOH$$ and $BRANCHE are no longer available. The respective new variables in the files $P and $PGEN are described in detail in the documentation of the generated variables at person-level (see file pgen.pdf | PDF, 0.66 MB ).

    Other than that, the SOEP-group at DIW is currently fixing some minor bugs and deficiencies in the current data release. Firstly, the variables TODJAHR and TODINFO in the file PPFAD, which give year of death and the source of death information, will include all mortality information as given by a recent follow-up study („Verbleibstudie 2001") carried out by Infratest. Secondly, the variable $ERWZEIT in the file $PGEN will be updated so that there is valid information on the number of years with the current employer for all employed respondents in subsample C. Thirdly, the variable RP4002 in the file RP (occupational status: self-employed) and the variables RHHTAGIN, RHHMONIN, RHINTNR in RH (day and month of the interview as well as the interviewer‘s ID) had not been defined properly. All these problems will be fixed with the next release of data. However, users who need to use these variables should subscribe to our listserver so they will receive information about these updates sooner.

    nach oben

    Data Distribution 2000 (Wave Q)

    Rectypes 2000

    1. VARIANZ
    In addition to the household indicator this file contains the variables STRAT1, STRAT2, SAMPOINT and INTNR. Some software packages (such as STATA, SUDAAN) are able to use these to estimate variances. All four variables provide information on the respective subsample for the start of each first wave, i.e. they are saved at the case-level (variable HHNR).
    STRAT1 identifies the levels, which were relevant for pulling the Primary Sampling Units for the respective sample. For subsample B, these were the five nationalities. Therefore, "artificial" levels were created for subsample B corresponding to the other subsamples and filed under STRAT2.
    The variable SAMPOINT identifies the respective PSU (e.g. in subsample A voting constituencies, in Subsample D not present).
    Due to data protection laws the various values of the variables STRAT1, STRAT2 and SAMPOINT were given transformed values, in order to prevent regional units from being identified.
    The variable INTNR is a variable to which every interviewer assigns a number, so that clusters of households that were surveyed by the same interviewer can be identified.

    2. HBRUTT00
    Similarly to the collection of the supplementary sample 1998 (sample E), this file contains all Brutto information from all households in the Innovation Sample in the year 2000 that were recently surveyed using the Random-Route-Method. In this case, it doesn't matter if these households were successfully surveyed or not. Information such as this can be accessed for the use of methodical investigations through the participation of households in (SOEP) surveys.

    3. QJUGEND
    In the year 2000, a youth questionnaire was introduced to be used instead of the biography questionnaire. This was aimed at all "new" participants who had reached the minimum age of 16 and were therefore able to take part in the SOEP survey. The 232 data sets that exist as of now supplement the information collected from the likewise first-time answering of the person questionnaire, in order to gain retrospective details on education, as well as basis indicators on education success. A thorough revision, as well as a supplementation of the youth questionnaire indicators took place in 2001, in addition to the fact that the youth participants of sample F took answered this new questionnaire for the first time. As a result, the data set QJUGEND represents, so to speak, a type of pre-test for the recently prepared biography data set BIOYOUTH (available from 2001 onwards).

    Reworking of labels  

    The VAR LABELS and VALUE LABELS have been be completely reworked for all previous years (up to and including 1999). Missing labels were included where applicable and the systematic was standardised (for instance for sub-items or variables with just one answer category). Furthermore, the labels were made consistent over time. At the same time the reworked label text was transferred to the English labels, so that these too were retrospectively fully identical to the German systematic.  

    $PGEN 2000

    For the current data distribution, extensive revisions were made to the variables from earlier waves. For instance, note that there are far fewer missing values -1 (k.A.) for many variables related to the occupations. The education variables in all $PGEN were reworked and supplemented. New variables include a differentiated labour force status for all participants and education information generated on the basis of data first collected in the year 2000 which dealt with the highest level of education and employment achieved up till now. The existing generated education variables were retrospectively reworked, extrapolated, as well as supplemented: you will now be able to access data on the temporarily absent respondents, as well as information on current school attendance, apprenticeship or studies. Furthermore the variable BETR$$ in $PGEN was recoded (the data on the size of the firm and therefore the codes in SOEP have changed over time). We would like you to take this into account when updating programs.
    Contact: and

    $PEQUIV 2000  

    The $PEQUIV files were updated. This affects:

    • the extension of the population
    • the reworking of the variable IMPUTED RENT
    • new variables used to generate equivalence scales
    • a reworking of the variables related to ANNUAL WORKING HOURS


    nach oben

    Data Distribution 1999 (Wave P)

    Rectype 1999

    This interviewer data set contains information about sex, age, education, occuaption and marital status from 1048 interviewers that work on sample A, B, C and D from survey wave 1 up to 12 (Documentation | PDF, 75.75 KB ).

    For more information concerning the data distributions back to 1995 please refer to our German Site.