Skip to content!

SOEP-Core v23 - Changes in the Dataset

Änderungen am Datensatz

Dataset Information

The 2007 data distribution (1984-2006) provides, for the year 2006, the usual wave-specific data WPBRUTTO, WP, WPKAL, WPGEN, WHBRUTTO, WH, WHGEN, WKIND and VPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).

In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Detailed information on the integration of this sample and additional changes in both files using weighting and extrapolation factors can be found below (see item 2).

A further important change is the introduction of a new survey instrument for first-time respondents at the age of 17 years. These persons now receive an expanded youth questionnaire, which provides current information as a supplement to the biographical data already collected, thus rendering the previous individual questionnaire used for this group obsolete. This also means that the survey population for the standard individual questionnaire (stored in the files $P) changed slightly, 17-year-olds are not included since survey year 2006 (first-time surveying of sample H constitutes an exception, since here the 17-year-olds have been surveyed again with the individual questionnaire since the biographical survey in new subsamples starts only with the second wave. The revised $NETTO-variables and the file $PAGE17 are of interest in this context (see below).

The educational variables in the generated datasets ($PGEN) have been revised: the integration of vocational qualifications attained abroad has been improved, and the corresponding variables have been subjected to extensive testing for consistency. These variables will be described in greater detail below.

The information on twins in SOEP was validated by a special survey of "potential" twins, and is integrated into the dataset BIOTWIN.

This year as in all previous years, the variables contained in the file WPEQUIV (wave 2006) relating to previous year's income take into account the various structural changes in the tax and transfer system, using these as part of the basic informational framework for generating and simulating annual income. Not only do the changes in the 2005 tax rate (reduction of the top tax rate, personal exemption) play an important role here but also the new guidelines contained in the Old Age Income Act (Alterseinkünftegesetz). The introduction of Unemployment Benefit II (Arbeitslosengeld II) also plays an important role, along with the extensive changes in the transfer system it entails (Social Security, Rent Subsidy, etc.). The generated information on (previous) year's income from SOEP survey year 2006 has thus been subject to thorough testing for internal and external consistency.

This year, the data is being distributed for the first time on DVD. This means that the language of variable and value labels can be chosen even more easily: right in the SOEP data installation program. If you install the data in Windows Vista using our setup program, please follow the installation instructions on DVD.

The following additions and modifications have been made:

New and Renamed Datasets 2006  

From 2007 on, persons who have reached the age of their first individual SOEP interview (17 years) are not given the usual individual questionnaire but a special youth questionnaire. Wave-specific information not contained in the biographical data or other generated datasets (like $PGEN, HEALTH) are given in the dataset $PAGE17. Youth questionnaire respondents are identifiable with the help of the new $NETTO code "17" (see also the changes in the $NETTO variables in PPFAD). More information can be found in the biography documentation on our homepage and on the new DVD.

Starting in 2007, the information on SOEP sample design previously compiled in the dataset VARIANZ (Spiess 2001) is now being disseminated in a revised and amended dataset DESIGN. Preliminary documentation can be found in designdoku.pdf on our homepage and on the new DVD.

Starting with 2002, the SOEP health module in the individual questionnaire has been revised and put on a two-year replication period. In the HEALTH file, users find the generated SF-12 variables (measuring health related quality of life) as well as variables on height and weight with imputation flags and a user-friendly, longitudinally checked generated variable of the Body Mass Index (BMI). More information in the health.pdf on the SOEP homepage or on the new DVD.

The wealth data asked in 2002 were thoroughly revised and checked for inconsistencies. The data are now provided in two (multiply) imputed datasets for the individual and the household level, with the corresponding flag variables for identification of the imputed values. The two datasets also each contain a generated variable on "net wealth" (see SOEPpapers No. 18).

Interviewer Survey
The interviewer dataset-available up to 2006 only as a "stand-alone" version-is now integrated into the standard data distribution under the name INTVIEW and thus provided in the different software formats (SAS, SPSS, STATA).

Cross-Sectional Weighting Scheme 2006  

With the 2006 data distribution, important changes have been made in the cross-sectional weights. They are described in detail (in German) in the DIW Data Documentation 22.

1. Types of Weighting Factors Redefined
Each cross-sectional weight is designated $xHRFy. Here, $ represents the wave identifier, x the differentiation between households (x = H) and persons (x = P) and y an additional identifier that describes the type of weighting factor.

  • $xHRF are the weighting factors that have been used since the beginning. They contain all samples with the exception of high-income sample G.
  • $xHRF1 are the standard weighting factors, where-additionally to the exclusion of sample G-the weights of new subsamples have been set to zero. Why? Using a complex survey construct, respondents in the first waves showed "worse" answering behavior than respondents in later waves (for example regarding life satisfaction and annual income). sample C offers an exception: respondents in the former GDR in 1990 did not exhibit the typical problems of first-time respondents (that is, GxHRF and GxHRF1 are identical).
    For standard cross-sectional analyses, we recommend the use of the $xHRF1 as a standard weighting factor. In this way, the information from the first waves of the different subsamples is automatically left out.
  • $xHRFALL include all available samples.
  • $xHRFD, $xHRFF and $xHRFG designate the isolated weights for immigrant sample D, for refreshment sample F and for high-income sample G.
  • The Variable $PHRFXX in PHRF und HHRF has been deleted.

2. Modifications to the External Information Used in the Weighting Scheme
Since the year 2005, the Federal Statistical Office no longer provides data differentiating between East and West Berlin. This has led to the minor retrospective changes in the external information on number of households since survey year 2005.

3. New Refreshment Sample H
In 2006, SOEP expanded to include another sample-refreshment sample H. The new households, which are representative for Germany as a whole, were also included in the weighting scheme. The integration of sample H is currently underway. Tests are still being conducted to determine if and how sample H can be further adapted to the external information. This is not a serious problem since in any case, for descriptive analyses we recommend the use of the weighting factors WxHRF1, which exclude sample H .

4. Weighting Factors are Based on Benchmark Data from the 2005 Microcensus
The weighting factors for the year 2006 are based on microcensus benchmark data from 2005; they are therefore only provisional with regard to the figures given for households and individuals in Germany. Please address any questions to Martin Kroh.

BIOAGE01 and BIOAGE17 2006  

Four new variables on pregnancy status have been generated, based essentially on the month of the interview from $P and the month and year of the child's birth, as well as the duration of pregnancy in weeks from BIOAGE01.

BCPREGY 'Mother: pregnant at the time of individual interview wave ($)?'
Value Labels:
2002 | Pregnant at Time of Personal Interview 2002
2003 | Pregnant at Time of Personal Interview 2003
2004 | Pregnant at Time of Personal Interview 2004
2005 | Pregnant at Time of Personal Interview 2005
2006 | Pregnant at Time of Personal Interview 2006
2007 | Pregnant at Time of Personal Interview 2007

BCPREGMO 'Mother: estimated month of pregnancy at the time of individual interview, wave($)'
Value Labels:
1 | First Month of Pregnancy
2 | Second Month of Pregnancy
3 | Third Month of Pregnancy
4 | Fourth Month of Pregnancy
5 | Fifth Month of Pregnancy
6 | Sixth Month of Pregnancy
7 | Seventh Month of Pregnancy
8 | Eighth Month of Pregnancy
9 | Ninth Month of Pregnancy
10 | Last Month of Pregnancy or after Birth

Furthermore the beginning and end of pregnancy are also available as spell data. Analogously to BIOMARSM, for example, we start counting with month 1 (January 1983), such that December 2007 is month 300. The data are generated based on month of birth and duration of pregnancy in weeks from BIOAGE01.

PREGBEGM 'Spell - Month beginning of pregnancy / conception (1 = Jan 1983)'

PREGENDM 'Spell - Month end of pregnancy / Birth (1 = Jan 1983)'

You will find detailled information on the structure and the content of the dataset in the documentation of the biographical data on our SOEP homepage or the DVD.

$HGEN 2006

In addition to the Bundesland (federal state) variable, starting this year, the corresponding NUTS (Nomenclature of Territorial Units for Statistics) Level 1 Variable is also provided. This variable is generally identical with $BULA in $HBRUTTO but without pooling Rheinland-Pfalz/Saarland (from 2000 on) and without differentiating between East and West Berlin.

$PGEN 2006 

1. New Variables

A variable for identification of job change was generated to supplement ERWTYP$$ (and eventually to replace it). The categories for this variable are independent of whether the information was obtained in a first-time or a subsequent interview. For respondents to a subsequent interview, JOBCH$$ refers to job changes since the last interview and for first-time respondents, it refers to job changes since the beginning of the previous year. Respondents who started their first job and respondents who made a job change are reported separately. In contrast to ERWTYP$$, JOBCH$$ has been subjected to a check for longitudinal consistency. Cases showing inconsistences-such as duplicate entries of the same job change in two subsequent interviews-have been corrected.
Value Labels:
1 | Not Employed
2 | Employed No Change
3 | Employed No Info If Change
4 | Employed With Change
5 | First- Time Employed

2. Revised Variables
For the years 1990 and 1991, values for job tenure are now provided for sample C (East) as well. Given the potentially limited comparability due to the East German transformation process, this data should be handled with particular care.

Job tenure has been tested for longitudinal consistency due to repeated evidence of inconsistencies. Cases that proved longitudinally inconsistent were corrected using the following procedure:

  1. Start of employment at current job as stated in the respondent's first survey is generally given precedence, and is carried on in subsequent years if no change of job occurred or the respondent did not take a new job after a break in employment.
  2. In the case of a change of job (change of employer / change to self-employment) current data on the time of job change is used and carried on in subsequent years.
  3. In the case where a respondent has taken up a new job after a break in employment, we assume that he or she returned to the old employer if the current data show a start of employment prior to the last survey year. In this case, we do not use the start of employment provided in the current survey but the start of employment from the last survey. If the current data show a start of employment since the last survey year, however, we assume that the respondent changed employer since the previous survey, and update the start of employment using the data from the current survey.

From the longitudinally consistent start of employment with current employer, we determine the duration of job tenure. When a respondent who started working again after a break can be assumed to have returned to his or her former employer, the full duration of job tenure is taken. The period of the break in employment is then not subtracted, potentially resulting in an implicit overestimation of firm-specific human capital.

Since 1999, the required job training variable has distinguished between studies at universities and technical colleges, and now, different categories have also been created for the years prior to and since 1999. For the years since 1999, separate categories have been introduced explicitly differentiating among these different kinds of educational qualifications. Furthermore, technical colleges and technical schools are now designated separately.
AUSB$$ 'required job training'
Value Labels:
1 | No Training
2 | Introduction to Job
3 | On-The-Job Training
4 | Courses
5 | Vocational Training
6 | Technical School, Engineering (East) 1990-96
7 | Technical College or University, up to 1998
8 | Technical College, since 1999
9 | University, since 1999

For waves U,V, and W, values for Wegener's Magnitude Prestige Scale have been added for respondents without a household interview ($NETTO=19).

For the employment type variable, the old categories have been maintained but due to the common value for first-time job holders and those who have made a job change, the label for this category has been changed. Thus, the label 'employed, with change or first time employed' is now applied to the value 6.

3. Update Educational Variables
Thanks to our users, an error was identified in the generation of the educational variables in $PGEN, which had crept in some time ago in the process of retrospective generation for the years 2000 and 2001 and continued on since then. The error was in the variable $PBBIL02, and consisted in assigning foreign university degrees too high a value. The error came about through the integration of the variables $PBBILA and $PBBIL02 in these two years. All educational degrees have therefore now been generated again retrospectively for the years 2000 to 2006. The resulting variables $BILZEIT, ISCED$$ and CASMIN$$ have also been updated retrospectively from 2000 on.

PPFAD 2006 

Revision of the $NETTO Codes

With this year's wave W (23rd survey wave), 2006, the compilation of data on the survey population has changed fundamentally. Previously, an individual interview was carried out with all household members above the age of 16. As of 2006, the regular individual interviews based on the standard adult questionnaire are introduced one year later when household members reach the age of 18. Seventeen-year-olds instead receive an expanded youth questionnaire in their first year as SOEP respondents. (This applies to the old samples A-G; for the new sample H, distribution of this youth questionnaire will start next year, while this year's 17-year-olds have received the regular individual questionnaire, in line with the old system).
This means that we now have two instruments instead of one to obtain data on respondents: the individual and the youth questionnaire. To ensure a consistent differentiation over time, it will therefore be necessary either to include the youth population of the current year or to increase the age limit for all previous years.
The newly revised $NETTO variable assists retrospectively in both differentiations for the entire survey period. The connection between survey population and survey instrument can be retraced with the help of the variable $NETTO in PPFAD or $HNETTO in HPFAD. As a result of the change in the survey population as well as the expansion of the survey instrument to include detailed information on biographical contexts, the corresponding variable $NETTO in PPFAD has been fundamentally revised and is now provided as a two-digit variable. To ease the transition to the new variable, the old one-digit variable is still provided as well under a different name $NETOLD; the variable $HNETTO in HPFAD is unaffected by this and remains unchanged.
Value Labels:
10 | Respondent Completed Interview
11 | Individual Questionnaire
12 | Individual Questionnaire and Biography
13 | Individual and Youth Questionnaire
14 | Individual and other Questionnaires
15 | Individual Questionnaireand Experiments, Tests
16 | Individual Questionnaire, First-Time Respondent, Age 17
17 | Youth Questionnaire, First-Time Respondent, Age 17
19 | Individual Questionnaire without Household Interview

20 | Children in Household Interviewed ($KIND)
21 | Children with Mother-Child Questionnaire I, Age 0-1
22 | Children with Mother-Child Questionnaire II, Age 2-3

30 | Persons in successfully interviewed household without Individual Interview
31 | Completed Gap Interview ($LUECKE)
32 | Completed Biography Questionnaire
33 | Successful Youth Questionnaire
34 | Successful Tests and Experiments

60 | Only Questionnaire without Individual or Household Interview
61 | Gap Interview without household reference
62 | Gap Interview with drop out
70 | Only Participation in Tests, Experiments, etc.

80 | Individual did not withdraw from panel population
81 | Previous respondent lacking current information
89 | Repatriate - (was Drop Out)

90 | Individual Dropouts $YPBRUTTO
91 | Moved abroad
99 | Died

In the new variable $NETOLD, the old $NETTO code can still be used. Persons at the age of 17 who either filled out a youth questionnaire (n=307) or an individual questionnaire (sample H, n=31) are each coded with the value of 1. In this way, the selection on (WNETTO == 1 | WNETTO == 5) is not identical with the population in WP.

$PEQUIV 2006 

1. New Variables
ALG2$$: Sum of all transfers from Unemployment Benefit II (Arbeitslosengeld II) received by the household. FALG2$$: Flag to identify the imputation of Unemployment Benefit II (ALG2$$).
IDEMY$$: Sum of indemnity payments received in the previous year.
FDEMY$$: Flag to identify the imputation of indemnity payments (IDEMY$$).
ITRAY$$: Sum of commuting and travel grants received in the previous year.
FTRAY$$: Flag to identify the imputation of commuting and travel grants (ITRAY$$).

2. Revised Variables
The variable (rental value of personally used living space = imputed rent) has been generated up to now only for persons living in owner-occupied housing. In line with recent research findings and also European Community guidelines for the generation of imputed rent in EU-SILC, this fictitious income advantage is now generated for persons in rental households as well who claim to pay below-market rental prices. These include people in rent-free housing, in socially subsidized housing, and in rental properties offered at a special rate (company dwellings, apartments provided by relatives at reduced rent, etc.).

W11101$$ and W11102$$
Due to the changes to the weighting factors in the files PHRF and HHRF, the variable W11101$$ now contains the individual weighting factor $PHRF1 (from the file PHRF) and the variable W11102$$ now contains the household weighting factor $HHRF1 (from the file HHRF).
First-time SOEP respondents show a significantly higher rate of item non-response in their first wave, which cannot be corrected adequately through imputation. For this reason, these two weights do not take into account the first wave of each new SOEP subsample. Furthermore, high-income subsample G has been excluded from the weighting scheme in order to prevent structural breaks in the analysis of income with vs. without this subsample. These two weighting variables are thus particularly well suited to a consistent time series of income inequality analysis.

The variable W11105$$ now contains the individual weighting factor $PHRFALL (from the file PHRF). This weighting variable takes into account all SOEP subsamples.

The content of the variable E11105$$ is now based on the ISCO88 International Standard Classification of Occupations.

E11106$$ and E111076$$
The variables E11106$$ and E11107$$ now provide information on sector affiliation in the form of a one or two-digit number according to the NACE scheme, the International Standard Industrial Classification of all Economic Activities.

3. Deleted Variable
W11106$$ 'HH-Weight immigrant sample'


With wave W, the birth biographies of men (BIOBRTHM)-like those of women (BIOBIRTH)-include not only the year of birth (KIDGEB[n], with n = (1...15), but also the month of birth for each child (KIDMON[n]). This birth month is identical with the child's birth month given in PPFAD.  


In 2006, a separate survey was carried out in all households with twins. This twin survey had the goal of validating the data on all twins in SOEP and gaining new information. The following variables have been changed or added in BIOTWIN as a result:

The variable BIOMONOZ differentiates between identical and fraternal twins based on a question asked to first-time respondents. This information used to be obtained through a question asking whether the twins were of the same or different sexes. New codes have been introduced for the variable BIOMONOZ to reflect the improved information available. The values are thus no longer compatible with those from prior to wave W contained in variable BIOMONOZ in the dataset BIOTWIN.

The variable INFOTWIN has been introduced. This variable tells whether information on twins was given in the 2006 twin survey, whether the information was derived from previously exisiting SOEP data, and whether previously existing data on the twins coincides with the results of the twin survey.


The variable "Erikson and Goldthorpe Class Category" (International Socio-Economic Index of Occupational Status) has been corrected with regard to the categorization of freelance academics, who were previously grouped together with the self-employed (values of 5 or 6). The corrected generation process assigns academic freelancers to the upper service class, which corresponds to a value of 1.