SOEP-Core v27 - Changes in the Dataset

The release of the 1984-2010 SOEP data (waves A-BA) will contain the usual year-specific data files (BAP, BAH, BAPGEN, BAHGEN, BAPKAL, BAPBRUTTO, BAHBRUTTO, BAKIND and ZPLUECKE) and the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). The respondents of Sample I (Incentives Sample) answered the biographical background questionnaire for the very first time in 2010.
Since minor changes have been made to many of the older datasets as well, we strongly recommend reinstalling all of the datasets from the new DVD.

1. New two-letter prefix (BA)

This SOEP data release (v27) will include, for the first time in the survey's 27 years, a two-letter rather than a single-letter wave prefix. Since we came to the end of the Latin alphabet with the letter Z in our last data release, we decided to use the wave prefix BA for the cross-sectional data format.

2. Updated beta version in "long format"

The SOEP data are now also available in "long format" as a beta version in addition to the usual data format. SOEPlong refers to a compressed form of the SOEP data. Rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format). The data are available on the second DVD. For details, see SOEPnewsletter No. 90/2010.

3. Elimination of fakes

When the data for the second wave of our newest sample I were checked, 36 households were identified as faked interviews and will therefore no longer be included in this data release.

4. New and renamed datasets

4.1 BIOAGE08[A|B]

The BIOAGE08 dataset contains data from the new "parent questionnaire" which is given to the mothers and fathers of seven- to eight-year-old children. Thus, data are now available on the 2002/2003 birth cohorts that were first observed with the "newborn questionnaire." The new "parent questionnaire" is given to both mothers and fathers and thus provides two sets of responses on many of the children in the sample. Therefore, the file was split into two on the basis of the parent's gender and the household type the respondent is living in. BIOAGE08A includes only mothers and some fathers, where there was no information from the mothers available. BIOAGE08B includes fathers only. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).


The LIFESPELL dataset contains data from the follow-up studies of SOEP dropouts (1992, 2001, 2006, and 2008), which were not previously included in the regular data release. The follow-up studies, based on information from public registers, serve to identify the current residence of former SOEP respondents, and thus allow studies of life expectancy and decisions to emigrate for a large percentage of SOEP respondents, even long after they have dropped out of the study. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).

4.3 BIOEDU (beta version)

The BIOEDU dataset, which is being released this year in provisional form (beta release), contains details on educational transitions beginning with entrance into childcare up to tertiary education in consistently structured form. Users who work with these data are requested to report on their experiences (especially any problems they might have), so that a final version can be released next year.
Detailed documentation is in the DIW Data Documentation 58.

Because of its provisional form, this dataset is not part of the normal distribution and you will find the data on the DVD in an extra archive (link available in the NEWS file on the DVD).

5. New Variables

5.1 $PEQUIV dataset

In the $PEQUIV-files, there will be a new additional variable on support payments. With the 2010 questionnaire, the SOEP has split the item "support payments" into two separate items. The first one now collects information on "alimony from legal spousal support, child support, and child care support" (ALIM$$), while the second item asks about "advance child maintenance payments" (IACHM$$). More information about the $PEQUIV files and the new variables is available in the DIW Data Documentation 57.

5.2 PFLEGE dataset

The PFLEGE file now includes two new additional variables. "FURTHER" gives the number of further persons requiring help in the household. This question has been asked since 2009. "CARECOST" represents the regular monthly costs for care that a household normally spends. This question has been asked since 2010.

5.3 Dataset $PGEN

We are now providing detailed data on educational degrees and training qualifications prior to joining the panel: life course questionnaires have been distributed since 2001 to collect data on apprenticeship occupation, type of qualification (e.g., diploma), and the field of study for those who have obtained a degree. Up to now, data from these open-answer questions were not included in the data release. From now on, however, these data will be released in coded form. The classifications used for the data from the individual questionnaire have been slightly modified in the process of these revisions. A more detailed description is available within the PGEN documentation.
The new variables are

FIELD$$ Field of tertiary education
DEGREE$$ Type of tertiary degree
TRAINA$$ Apprenticeship-two-digit occupation KldB92
TRAINB$$ Vocational school-twodigit occupation KldB92
TRAINC$$ Higher vocational school-twodigit occupation KldB92
TRAIND$$ Civil servant training-twodigit occupation KldB92

6. Revised Variables

6.1 $P dataset

Name changes to the variables in the different classifications for occupation and sector in $P: the variables contained in the $P datasets are collected in alternate years from all respondents and from those individuals who changed occupations. Simultaneously, we also generate and distribute all the information on all years and all individuals in the $PGEN datasets. To more clearly distinguish the generated variables from the originally surveyed variables, and to establish a clear connection to the question number in the respective questionnaire, we have renamed the variables according to the following system:


Old variable name New variable name
$IS88 $pXX_IS88
$IS88 $pXX_IS88

Thus, ZIS88, for example, is now ZP29_IS88. The corresponding variables of all datasets from the individual questionnaire ($P) have been renamed, but not the variables recommended for use from the generated datasets ($PGEN, e.g., IS8809 from ZP).

6.2 $HBRUTTO dataset

The coding of the variables identifying the federal state ($BULA, Bundesland) in which a household was included, is now consistent with the coding of the official statistics.

$bula (old codings) $bula (new codings)
0 Berlin  
1 Schleswig - Holstein 1 Schleswig-Holstein
2 Hamburg 2 Hamburg
3 Niedersachsen 3 Niedersachsen
4 Bremen 4 Bremen
5 Nordrhein-Westfalen 5 Nordrhein-Westfalen
6 Hessen 6 Hessen
7 Rheinl.-Pfalz, Saarl. 7 Rheinland-Pfalz
8 Baden-Wuerttemberg 8 Baden-Wuerttemberg
9 Bayern 9 Bayern
  10 Saarland
11 Berlin (Ost) 11 Berlin
12 Mecklenburg-Vorpommern  12 Brandenburg
13 Brandenburg  13 Mecklenburg-Vorpommern
14 Sachsen-Anhalt 14 Sachsen
15 Thueringen 15 Sachsen-Anhalt
16 Sachsen 16 Thueringen

 A differentiation between East and West Berlin can still be achieved by a combination with $SAMPREG (Sample Region in $PPFAD).

6.3. Minor bug fixes

  • In the BIOAGE17 file from data release v26, variables classifying the preferred job reported in the youth questionnaire (byklas, bymps, byisco88, byegp, byisei, bysiops) are deficiently coded and contain too many missing values. With the new data release v27, this bug has been fixed.
  • Some households in Berlin (only in 2006) were wrongly classified according to the variable of settlement pattern ($GTYP) in the dataset GGKBOU. This bug has also been fixed.