SOEP-Core v26 - Changes in the Dataset

Änderungen am Datensatz

Dataset Information

The 2010 data distribution (data for years 1984-2009) has comprehensive improvements, additions, and modifications. For the most recent survey year 2009, it also provides the usual wave-specific data ZPBRUTTO, ZP, ZPKAL, ZPGEN, ZPAGE17, ZHBRUTTO, ZH, ZHGEN, ZKIND and YPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

1. Beta verson in 'long format'

The SOEP data are being provided for the first time ever as a beta version in "long format" in addition to the standard data format. SOEPlong refers to a compressed form of the SOEP data: rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format).

2. New Subsample I
As part of the recent SOEP innovations, fieldwork began in fall 2009 on a new subsample (Sample I). The subsample is currently being used to test the effect of different incentive strategies on participation in the SOEP and it will become part of the innovation sample. See SOEPnewsletter 89 for more on this new sample.
In four randomly assigned groups, the following strategies were used:

  1. SOEP standard incentives (one lottery ticket per respondent),
  2. Choice of eather a lottery ticket or five euros per individual interview,
  3. Five euros per individual interview,
  4. Ten euros per individual interview.

The data from the new Sample I has been included in the new release of SOEP data (SOEP, v26), but due to the particular features of the subsample, it does not have an integrated weighting framework with the rest of the SOEP samples. For subsample I we are conducting a mail survey of all non-participants in the four groups. Since this is the first wave of subsample I, we were not able to integrate the biographical information from Sample I into the existing biography files. The same applies to the biographical information in the dataset PPFAD, e.g. the variable MIGBACK is completely set to -2.

Apart from that the following additions and modifications have been made:

3. New and Renamed Datasets 

  • Data on cognitive tests (COGDJ)
    For the first time, all available data on the cognitive tests of young people ("Denksport Jugend", DJ) are included in the SOEP data release. Since 2006, these tests have been given to young respondents (aged 16) the first time they participate in the SOEP survey.

4. New Variables

4.1 Dataset $HGEN
Two new variables describing the quality of the dwelling:

  • EQPLIF$$ "Dwelling has an elevator"
  • EQPNRJ$$ "Dwelling has alternative energy source"

4.2 Dataset $PEQUIV:

  • There is a new variable on additional child benefits together with the corresponding imputation flag variable (ADCHB$$ and FADCHB$$)

4.3 Dataset $HBRUTTO - Calender Year of Interview

  • We distribute now a variable( ZDATUMY) describing the calendar year of the interview for the first time. Because of the additional Sample I (and therefore a longer fieldwork period), there were some few cases with a successful interview in 2010.

5. Revised Variables

5.1 Datasets $HGEN  

  • The variables on household type TYP1HH$$ and TYP2HH$$ were completely revised and tested for intertemporal consistency.

5.2 Datasets $KIND – KIDLONG

  • The variables were also completely revised and are now also provided in longitudinal form (KIDLONG) as well as in cross-sectional form in $KIND. This made it necessary to change the variable names in KIDLONG to be consistent over time.


  • The biographical data set on marital status was revised.

5.4 Dataset BIOTWIN

The dataset BIOTWIN contains 100 additional cases since wave Z. This considerable increase in case numbers is due to an adjustment in the data generation procedure: In contrast to the previous generation, all siblings with an identical year of birth are consdered twins under the condition that the information on the month of birth remains missing. This less restricitve generation is based on the assumption that two separate births in a single calendar year remain rare occurences. Nevertheless, the number of false positives in this group with a missing month of birth is likely to exceed the BIOTWIN average. Hence a new value label was introduced with the variable INFOTWIN in order to flag these twin groups for the user (Code "6": Coverage since 2007, congruent year of birth, missing month; see in contrast to this code "5": Coverage since 2007, congruent year & month of birth).
In its current state (wave Z) the dataset BIOTWIN covers 250 sets of twins and 5 sets of triplets.

[1] Twins - Not in 2006 (gen.)
[2] Twins - 2006 (Answer Not Verifiable)
[3] Twins - 2006 (Answer Refused)
[4] Twins - 2006 (Answer Validated)
[5] Twins - since 2007 (gen.)
[6] Twins - since 2007 (gen.)


5.5 Minor bugs fixed

  • Correction of MONTH08
  • Correction of some very few cases on IMMIYEAR
  • Change in the variable names for questions 25 and 26 in YH and ZH