Skip to content!

SOEP-Core v24 - Changes in the Dataset

Änderungen am Datensatz

Dataset Information

The 2008 data distribution (1984-2007) provides, for the year 2007, the usual wave-specific data XPBRUTTO, XP, XPKAL, XPGEN, XHBRUTTO, XH, XHGEN, XKIND and WPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Biographical background information has been collected from respondents in sample H for the first time in 2007. This data has been fully integrated into alle relevant biography files (BIOxxxx).

As part of the SOEP innovations projects TNS Infratest Sozialforschung conducted in December 2006 a postal survey among former SOEP panel members from households which had been classified as final refusals in 2001-2004. As a byproduct we could change the information on year of birth from missing to a valid value for 21 of these persons (more information can be found in the executive summaryexecutive summary of the TNS Infratest Methodenbericht).

Furthermore the following additions and modifications have been made:

A. New and Renamed Datasets

In the 2006 survey year, for the first time, short cognitive tests were carried out with a subsample of the SOEP. The goal was to employ a robust set of instruments that could be administered easily by trained interviewers in just a few minutes. Close to 80% of all persons chosen for participation in the cognitive test provided valid answers. Thus, for the first time, the SOEP now contains indicators of cognitive potentials for more than 5,500 persons, along with diverse educational information based on degrees and certifications. It is planned that the first repeat of the test will take place in the 2010 survey year. A detailed documentation and selection analyses can be found in Schupp et al. (2008) Erfassung kognitiver Leistungspotentiale Erwachsener im Sozio-oekonomischen Panel (SOEP), DIW Berlin, Data Documentation 32.

These two datasets replace the former dataset YPBRUTTO, however this year both variants are available 

Multiple imputed dataset on monthly net household income for the years 1996 to 2007. The dataset is stored in long format (long format: hhnrakt, svyyear, mj, also called mim format within stata). Each item non-response on net household income was imputed 10 times. More information can be found in HGEN.pdf

B. New Variables 

B.1 Dataset XPBRUTTO

  • XEWSTATU: Proxy information on non-responding persons regarding their labor force status in households with partial unit non-response.


B.2 Dataset $PEQUIV

  • P11101$$: Copy of the wave specific variables on overall life satisfaction.

B.3 Dataset $HGEN

  • I_HINC$$: Multiple imputed version of HINC$$, the monthly net household income. Imputations 1-5 are available as wide format in $HGEN (only 1996-2007), all generated imputations (10) are available in an extra dataset called MIHINC in long format, additional information can be found in HGEN.pdf ( document,169 KB).
  • FHINC$$: Imputation flag for I_HINC$$, 0 means not imputed and 1 otherwise.

C. Revised Variables

C.1 In the Dataset $PKAL

  • $P2D03 + $P2E03: In the waves U-W (years 2004-2006) for some cases an incorrect "does not apply" missing (-2) was corrected to an "no answer" missing (-1).

C.2 In the Dataset HHRF/PHRF

  • WPHRF*: All weighting factor for the year 2006 are now based on microcensus benchmark data from 2006.

    However, the weighting factors for the year 2007are also based on (newest available) microcensus benchmark data from 2006; they are therefore only provisional with regard to the figures given for households and individuals in Germany.

  • VHHRF + VHHRF1: 1 Household from sample G was corrected and set to 0.


C.3 In the Dataset $PGEN

  • LFS$$: The variable „labor force status" has been improved over all waves with respect to the accuracy of classifying individuals as „non-working and older than 65"(category 2). Now, the information on the month of birth of a person is used in order to determine whether the person was older than 65 at the time of the interview.


D. Error Updates

D.1 In the Dataset VH and WH

  • We have corrected value labels for the variables indicating the owner of the dwelling (VH27 and WH27), please note the relevant corrections in the table below.

    Variable Label: Owner Of The Dwelling

    -2 Does not apply Does not apply
    -1 No answer No answer
    1 Self Owned Res. Property Local Govt. Apt.
    2 Local Govt. Apt. Co-Operative Apt.
    3 Co-Operative Apt. Company Apt.
    4 Company Apt. Private Owner
    5 Private Owner Do Not Know

D.2 In the dataset $PGEN

  • EGP$$: The variable "Erikson and Goldthorpe Class Category" (international socio-economic index of occupational status) has been corrected with respect to the assignment of individuals to category (18) "not working - pensioner". Up to now, all pension recipients, i.e. recipients of retirement pension and recipients of widow's/orphan's pension have been erroneously classified as "not working - pensioner" if none of the other categories applied. In the corrected generation of the EGP$$ variable, which applies to all waves, non-working persons are only assigned to this category if they are recipients of a retirement pension or if they are recipients of orphan's/widow's pension AND are older than 60 years. Moreover, if there is missing information on pension receipt, additional information from ARTKALEN (retrospective information from the activity calendar for the previous year) is used in the generation process to determine if a person was in retirement or early retirement ("Vorruhestand") at the time of the interview. All other non-working persons are assigned to category (-2) "does not apply" as long as they are not registered as unemployed (category 15).
  • STIB$$: The same problem of misclassification of individuals to the category "pensioner" (13) applied to the variable for the "Occupational position", and has been corrected for all waves in the same way as for EGP$$ .
  • NACE$$: The variable for the "two-digit NACE Industry - Sector" had several inconsistencies with respect to the labeling. In particular, the labels for code (90) "Sewage And Refuse Disposal, Sanitation And Related" and code (95) "Private Households With Employed Persons" had to be swapped. Some other labels were not accurate, and have been stated more precisely for all waves.
  • IS88$$, ISEI$$, MPS$$, SIOPS$$, KLAS$$, EGP$$: The questions which refer to these variables are not asked from all employed persons annually. In the survey years 1985, 1986, 1987, 1988, 1990 (West), 1992 (West), 1994, 1996, 1999, 2001, 2003, 2005, and 2006 only those employed persons who changed jobs and first-time respondents are asked to provide up-to-date information. Hence, in years with a partial survey, these variables should contain the available previous year's information for all employed persons without a job change who did not update the information on their current occupation. However, for some individuals, the previous year's data was not used by mistake. This mistake was corrected by newly generating these variables for all the waves in an accurate and consistent way.