The new refresher sample, Subsample O, contains 1,000 new households. These were selected in cooperation with BBSR using a new sampling design based on regional data in areas where the “Soziale Stadt” (social city) urban development project is being carried out. Based on the digital data available on the boundaries of the “Soziale Stadt” areas, it was possible to create a new variable going back to the year 2000 that shows whether or not a household’s address is within an area covered by the project (see Variable Description below under 4.4).
We have made the following important changes over and beyond to our normal annual updates:
4.1. Early Childhood
4.2. Your Life in the GDR
4.3. Biography follow-up survey
4.4. New variable SOCURBAN in dataset HBRUTTO
4.5. New variables in dataset EQUIV
4.6. New variables in dataset BIOJOB
5.1. Weighting variable PHRF in the dataset PPATHL
5.2. Variables representing occupational codes
5.3. Educational variables
5.4. Dataset KIDLONG
5.5. Dataset BHKIND
5.6. Dataset BIKIND
5.7. Variable PARID in the dataset PPATHL
5.8. Variable HGOWNER in the dataset HGEN
5.9. Dataset INTERVIEWER
5.10. Dataset BIOAGE17
5.11. Dataset BIOAGEL
The new wave of the SOEP-Core study incorporates our “wide” and “long” data formats, which used to be provided to users separately. Our aim is to eliminate any confusion about what is available in which format and to make data use easier overall. After several years of testing SOEPlong as an additional service designed to facilitate analysis for both experienced and new users, we will now be providing all datasets in the “long” format as a standard part of our SOEP data release. This means that you will find the different SOEP data formats listed below in your data file, some of which will be contained in separate subdirectories.
Please make sure that you unpack the entire directory structure when unpacking your data.
1.1. SOEP in “long” format on the top level
In the top-level (or root) directory, you will find all of the datasets provided up to now with SOEPlong (pl, ppfadl, etc.) as well as all of the additional datasets formerly provided only in our classic “wide” format (biographical or spell data such as bioparen, artkalen, etc.). All of the data in the main SOEP-Core study are therefore contained in the datasets in the top-level directory.
Feedback from experienced and beginning users over the past several years shows that the “long” data offer significant advantages in ease of use, particularly for beginners. We have therefore decided to use this as our primary data format in future data releases.
All available individual year-specific datasets are pooled into a single dataset (e.g., all $P datasets are integrated into the PL dataset). In some cases, this means that we have to harmonize variables in order to be able to define them consistently over time. For instance, income information is given in euros up to 2001 and not in deutschmarks, and in cases where questionnaires have changed, the categories are modified over time. All changes are presented to users in a clear and understandable way, and if harmonization is necessary, all input variables are provided in their original form (see below _v*-variables). SOEPlong thus significantly reduces the number of datasets and the number of variables.
A more detailed description of the format of our SOEP-Core data release can be found in our new SOEPcompanion.
1.1.1. Most important changes to v33 in the long format
1.2. Classic format in the subdirectory raw
Since we know that many users have existing scripts that are based on the original data format, and to enable users to understand the process of generating the “long” data, we provide all of the datasets in their original SOEP format in the directory raw.
Users who want to continue using the old format simply need to switch into subdirectory rawand use the datasets there.
The only change is that there are now additional identifiers in all of the datasets in the raw directory with the name in the long format (PID and PERSNR or HID and $HHRNAKT) and a survey year variable (SYEAR) so that users can easily merge variables from the two data formats.
1.3. New EU-SILC clone in the subdirectory eu-silc-clone
Many users are undoubtedly aware that the SOEP supports cross-national analysis with CNEF through the dataset PEQUIV. We have now produced a data product that allows you to use the SOEP data in comparative analyses with the EU-SILC (European Union Statistics on Income and Living Conditions) data. EU-SILC, which is provided by Eurostat upon request, offers cross-sectional and longitudinal information for many European countries. Up to now, only cross-sectional information has been available for Germany. The EU-SILC clone offers longitudinal information on private households in Germany based on the SOEP data. All of the information contained in it can be directly compared with the EU-SILC longitudinal information on other European countries.
The EU-SILC clone is integrated into the standard SOEP data release (in subdirectory eu-silc-clone).
Documentation on the 2005-2016 EU-SILC clone can be found here (PDF, 3.01 MB).
The new SOEP data release (v34) will be the first to contain data from the IAB-BAMF-SOEP Survey of Refugees in Germany as Sample M5, as well as the continuation of the PIAAC-L Survey, as Sample N.
2.1. IAB-BAMF-SOEP Survey of Refugees (M5)
The SOEP, in cooperation with the Institute for Employment Research (IAB) and the Federal Office for Migration and Refugees (BAMF), has succeeded in integrating a third sample of refugee households (M5) into the SOEP study. The survey was launched in 2017. The population of M5 covers adult refugees who have applied for asylum in Germany since January 1, 2013, and are currently living in Germany. M5 added another 1,519 households of refugees who have migrated to Germany since 2013 to the SOEP framework.
2.2. Integration of respondents from PIAAC-L as Subsample N
Sample N integrated 2,314 households of former participants of the Program for the International Assessment of Adult Competencies (PIAAC and PIAAC-L) in 2017. This is the most recent addition to the SOEP-Core samples. Fieldwork in sample N was conducted between mid-March and mid-August and thus slightly later than the majority of samples A–L1. More information on the PIAAC-L project can be found on the project homepage.
In the IAB-BAMF-SOEP Survey of Refugees (M3-M5), there were translation errors in some some of the questions on income components in translated versions of the household questionnaire. Answers for these variables are therefore not comparable with other answers. The corresponding variables were set to -3.
In the process of data preparation, three interviewers were identified who had not conducted interviews in line with the standards of the IAB-BAMF-SOEP group (more information here). The interviewers in question were responsible for 88 households in 2016 and 112 households in 2017. The households affected in the first wave of the survey (2016) were completely removed from the dataset. The households affected in 2017, who were supposed to be interviewed for the second time, were deleted for 2017 but left in the dataset for 2016. There are no indications that the first interviews (by a different interviewer) were not conducted in line with IAB-BAMF-SOEP standards. The interviews and cases deleted from the data release may be accessed upon request from a guest work station at the SOEP-RDC for survey methodological analysis. After these lines were deleted from all datasets, the following adjustments were made:
The extended variable naming convention is applied only to data sets from wave BH onwards and only applicable for the datasets $P, $H, $KIND. We added underscores between unit of analysis, question identifier, and item identifier to clearly separate the analysis unit, question, and item visually. In addition, a questionnaire identifier was introduced, which is also separated by an underscore from the item. This new version of naming variables is only used if the survey instrument differs from the “original” SOEP-Core instrument.
Due to our different samples in the SOEP, there are some respondents that receive sample-specific questions, such as the refugee sample that started in 2016. For that specific group, we created an extended individual questionnaire with some specific questions along with the standard SOEP questions that are asked every year. For the specific questions, you can use the instrument variable to see the source of the variables.
Examples and more detailed descriptions can be found in the chapter on this subject in the SOEP Companion.
6.1. Dataset PPATH / PPATHL (in raw: PPFAD)
6.1.1. SEXOR
6.1.2 PARINFO
6.1.3 Migration information
6.1.4. Asylum-Seekers and Refugees
6.2. Dataset PGEN
6.2.1 Partner pointer
6.2.2. Volunteer work and side jobs
6.2.3. Educational degrees
6.2.4. AUTONO
6.3 Dataset PEQUIV
6.4. Dataset BIOAGEL and BIOPUPIL
6.5. Dataset HGEN
A number of changes have taken place in recent years in questions on home rental. The first change took place in the hosehold questionnaire of wave BF (2014). The question asked about the costs of utilities in such detail that respondents were not able to provide correct answers. This led to underestimation of both base rent and utilities.
It emerged that this led to a slight break in the time series. Rent has increased continuously over the years since 1984. In 2014 and 2015, however, rental costs fell and have been increasing again sharply since 2016. This break can be explained by the change in the questionnaire.
Starting with wave BH, respondents are being asked about rent in the same way as in wave BG (2016) and in wave BD (2013) in order to maintain long-term comparability. In addition, with wave BH, the new migration sample M5 and the new refresher sample N are part of the SOEP. Since Sample M5 was not surveyed on utility costs in a comparable way and since many of these respondents probably live in group housing or receive subsidies to cover living costs, no rent variable was generated for them.
|
v33 - rent |
v34 - rent |
|
2010: 486.25 |
2010: 486.21 |
|
2011: 484.93 |
2011: 485.64 |
|
2012: 491.01 |
2012: 490.75 |
|
2013: 505.00 |
2013: 505.59 |
|
2014: 470.95 |
2014: 473.74 |
|
2015: 507.06 |
2015: 508.57 |
|
2016: 545.53 |
2016: 541.90 |
|
|
2017: 550.67 |
6.6. Dataset BIOIMMIG
6.7. Dataset HHRF/PHRF
6.7.1. Revisions and Bugfixes
In the process of preparations for the next wave of the IAB-BAMF-SOEP Survey of Refugees, the survey institute determined that an interviewer had not conducted interviews correctly, affecting six percent of the household interviews in the sample. These households were removed from the dataset, but are available upon request for survey methodological analysis at a guest work station at the SOEP Research Data Center. In addition to deleting these lines of all affected datasets, we also made the following modifications:
Datasets from the current BG wave contained errors in the assignment of interviewer IDs. These were corrected.
Inconsistencies between key variables on population assignment in the PPFAD and $$KIND datasets were corrected. There was an error of one year in the definition of the target population in the $$KIND datasets from 2014 to 2016. In some cases, this led to a lack of information on the year of birth in files on children:
These corrections also affect the number of cases in the file KIDLONG, which was corrected correspondingly.
3.1 Change in the $$NETTO codes in 96 cases (children) in the years 2014-2016
In the process of data checks, the $$NETTO codes in PPFAS were also compared and corrected. In survey years 2014 to 2016, some children had been incorrectly assigned the code 20 instead of 30 on the variable $$NETTO in the PPFAD dataset. This error has been corrected in v33.1 with the correction of the variable $$NETTO. The update also made it necessary to correct person weights in the affected survey years (dataset PHRF), because the determination of which individuals in interviewed households should be assigned a valid weight is based on the variable $$NETTO. The updated weight is also contained in v33.1.
In BIOAPREN, a number of missing values in the flag variables for parental (professional) education and the years of death of the parents were updated and filled in.
The algorithm for imputation of missing dates in the spells were optimized. As a result, in v33.1, the imputed variables and the variables imputed from these were changed, specifically all variables with the suffixes _imp and the variable staytime. The changes affected a total of 349 of 15,640 spells.
The variable AUSB16 (“profession requires vocational training”) from BGPGEN were updated. The correction substantially decreased the number of missings [-1].
The new data distribution (1984–2016) “SOEP v36” provides, for the most recent survey year 2016, the usual wave-specific data files BGPBRUTTO, BGP, BGPKAL, BGPGEN, BGHBRUTTO, BGH, BGHGEN, BGKIND, and BFPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). Additional new samples, datasets, or variables are listed below:
1 New Sample IAB-BAMF-SOEP Geflüchtetenstichprobe (Samples M3/M4)
With version 33 of the SOEP data, we will also be releasing the integrated data from the 2016 IAB-BAMF-SOEP Survey of Refugees in Germany as two supplementary samples to the SOEP. The samples are based on the overall population of refugees, independent of residency status, who arrived in Germany between January 1, 2013, and January 31, 2016. The samples were surveyed through additional funding from BA/IAB/BMAS in the case of M3 and from BMBF in the case of M4. Sample M4 contains a higher number of refugee families containing children and teenagers.
The German Central Registry of Foreigners (AZR) provided the data basis from which the sample was drawn. In this sample, 4,816 adults in 3,554 households were surveyed in 2016, and basic data was collected on 5,717 minors living in the same households. To conduct the survey, the questionnaire was translated into seven languages, in some cases interpreters were available. To address the specific situation of the target group, audio-assisted versions of the survey instruments were developed.
The survey is comprised of an integrated individual—life course questionnaire, a household questionnaire, and a questionnaire for interviewers. As was the case with samples M1 and M2 (IAB-SOEP Migration Surveys), participants were first asked for consent to linking their survey data with the IAB Integrated Employment Biography data. The Research Data Centre of the Federal Employment Agency at IAB will be providing the linked data to researchers as of Spring 2018.
The study design is described in detail in Martin Kroh et al. 2016. Das Studiendesign der IAB-BAMF-SOEP-Befragung von Geflüchteten. SOEP Survey Papers 365: Series C. Berlin: DIW Berlin / SOEP
1.1 Integration into SOEP and original data (BGP and BGP_REF)
The original data from the survey instruments used in Samples M3 and M4 can be found in original format in the dataset BGPREF, where the individual and the biographical questionnaires are combined. The variables are also integrated into the other standard or generated datasets:
2.1 Datasets directly based on survey instrument (like $P and $H)
$PINSTRUMENT and $HINSTRUMENT
Additional variables on occupational codes
$JUGEND
2.2 New variables in PPFAD
SEXOR
BIRTHREGION
2.3 BIOAGEL
Bioagel now contains information from the new questionnaire for 13- to 14-year-olds that was introduced in 2016 (v33). The questionnaire contains items on personality, leisure time activities, personal networks, educational aspirations, and family life, and is completed by the young people themselves. Many of the questions included here stem from the questionnaire for 11- to 12-year-olds introduced in 2014 (v31). For those respondents who already completed this questionnaire, we now provide longitudinal information on development in many areas such as personality and educational aspirations.
2.4 REFUGSPELL
For migration biographies in the refugee samples, we created the new spell data set REFUGSPELL. The variables in MIGSPELL and REFUGSPELL are derived from different instruments and only partially overlap. The data structure allows the two data sets to be linked if desired. Detailed documentation will be provided by the biographical data documentation of the SOEP.
2.5 New variables in $PGEN
$P_RELIGION (Religious affiliation)
PICORIG[A-C]$$ (Party Identification in Country of Origin)
Current residency status
2.6. New variables in $PEQUIV
2.7 Gripstrength data for 2016
GRIPSTR update: The data on grip strength from the survey year 2016 is now included in the GRIPSTR dataset.
3.1 Variables in PPFAD
GERMBORN, CORIGIN and IMMIYEAR
MIGBACK and MIGINFO
LOC1989
3.2 BIOIMMIG
The target population and sample size of the BIOIMMIG dataset has changed. The dataset is no longer limited to respondents who were born abroad and had non-German citizenship. In addition, cases without valid BIOIMMIG information in any wave or only with valid information on BISCGCF, are no longer included in the dataset.
3.3 Update of PWEALTH and HWEALTH
Up to now, the former FiD samples were not integrated into the data for 2012 even though these samples also received wealth questionnaires. With the current data release, these cases have now been integrated.
3.4 BIOPAREN
BIOPAREN was build new from the scratch. We excluded redundant variables and changed the variable names to English for consistency (which means that VNR is now FNR). Please see the documentation for a full list of changes and an overview of the new variable names.
3.5 $PGEN
With variables DEGREE$$, FIELD$$, and TRAIN[A-D]$$, discrepancies had appeared in the retrospectively reported data. The classification was improved to deal with inconsistencies in multiple answers. The conversion key for occupational information based on the ISCO-88 classification for TRAIN[A-D]$$ in KldB92 was extended from two to four digits, and the conversion key itself was revised.
3.6 $HGEN
In Wave BF, a major change took place in the way respondents were asked about rent, and respondents were asked to provide more detailed information on their rental expenses. It became clear from ex post analysis that the way these questions were asked was too complicated for some respondents and that it had resulted in a discontinuity in the time series. As a result, according to SOEP—as well as in the comparative statistics— the average rent had risen systematically over time, but not in 2014 or 2015. This discontinuity can be explained by the change in the questionnaire. Starting with wave BG, the questions on rent were therefore changed back to those in wave BD. In wave BG, migration samples M3 and M4 were are also part of the SOEP. Since these respondents were not asked what they paid for utilities, no rental variable was generated for this group.
SOEP-Core soep.v32.1
einstieg_artk and einstieg_pbio: SOEP has offered two additional labor market entry variables since providing data version 32 as part of the BIOJOB file. They were constructed on the basis of employment history information to the exact year and month. They refer to a generic uniform definition of the first survey period after the transition from the educational system to the labor market. The construction details for these variables are documented in detail in the SOEP Survey Paper 429, a short version of the description is also available in the BIOJOB documentation. (SOEP Survey Paper 418)
SOEP-Core soep.v32
The new data release (1984–2015) "SOEP.v32" provides, for the most recent survey year 2015, the usual wave-specific data files BFPBRUTTO, BFP, BFPEQUIV, BFP_MIG, BFPKAL, BFPGEN, BFPAGE17, BFHBRUTTO, BFH, BFHGEN, BFKIND, and BEPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors).
1. New migrant subsample (M2)
In 2013, we conducted the first IAB-SOEP Migration Sample in partnership with the Institute for Employment Research (IAB) in Nuremberg (for an overview of M1, see SOEP Survey Paper 216). The households from the second IAB-SOEP Migration Sample surveyed in 2015 are now also included in the SOEP data. The target population of the second IAB-SOEP Migration Sample consists of immigrants to Germany who have arrived between 2010 and 2013. Migrants from the new EU member states in Eastern Europe dominate this group. This focus will make it possible to better describe the dynamic recent evolution of immigration to Germany. The sample M2 consists of 1,096 households, and was, like sample M1, drawn from register data from the Federal Employment Agency.
Record Linkage
Please note that data from both samples can be linked with administrative employment and income data: Survey respondents are asked to provide explicit consent to record linkage. But since this linked dataset contains social data, these weakly anonymized data are only accessible on site at the Research Data Center of the German Federal Employment Agency at the IAB (FDZ IAB). Researchers can access FDZ IAB data through a guest visit to the IAB or through remote data processing, also arranged with the IAB. The linked data will soon be available to external researchers. Requests for data access should be directed to FDZ IAB, since a contract with IAB for data use is required.
For more information, see the FDZ IAB website.
2. Weighting
3. Changed datasets or variables
4. New datasets or variables
1. Integration of the FiD study (data from 2010 ongoing)
We are pleased to announce that the data release v31 will include the data from “Familien in Deutschland” (Families in Germany, FiD) which is being retrospectively integrated into the SOEP and made available in user-friendly form to all SOEP users. The survey has been carried out in parallel to the SOEP as a so-called “SOEP-related study” from 2010 to 2013.
The original SOEP-related study FiD
The idea of FiD was to evaluate the full range of public benefits in Germany for married people and families on behalf of the Federal Ministry for Family Affairs. The datasets available—including the SOEP—were not sufficient for differentiated analysis of the segments of the population targeted by family policies. Particularly problematic were the very small percentages of single parents, families with more than two children, low-income families, and families with very young children in the German population. These groups are of course included in the SOEP, but the number of observations is too small for sound statistical analysis.
Since 2010, the SOEP Research Infrastructure at DIW Berlin has been working in collaboration with TNS Infratest Sozialforschung to survey more than 4,500 households every year. The FiD sample consists of the following subsamples:
A description of the original FiD study can be found in article “Familien in Deutschland – FiD” by Mathis Schröder, Rainer Siegers, and C. Katharina Spieß, Schmollers Jahrbuch 133 (4), 2013, 595-606. (http://dx.doi.org/10.3790/schm.133.4.595). (Pre-published 2013: SOEPpapers 556 (PDF, 160.15 KB). Berlin: DIW Berlin).
Integration into SOEP-Core
Starting with Version 31 of the data, the FiD sample will be integrated completely into the SOEP-Core data—that is, as if it were a new sample drawn as part of SOEP-Core in 2010 and 2011. The integration of the FiD sample will result in a significant increase by almost one-third in the number of cases in SOEP-Core since 2010. The figure shows how the new FID samples L1 to L3 have affected cross-sectional sample size since 2010. The retrospective integration meant that the sample variables had to be adjusted as other subsamples have been added to SOEP-Core since 2010 (see adjustment of the sample variables).
In total, 14,166 variables from 64 datasets have been integrated into the various SOEP datasets, and the generated data sets or variables have been adjusted. Variables in the FiD survey instruments that were not contained in the corresponding SOEP survey instruments have been included in the respective datasets as additional variables (with the original FiD variable names starting with “fyy”, where “yy” is a two-digit year identifier). The table below gives an overview of the number of variables in each of the two main questionnaires that could be integrated.
| Year | Individual questionnaire (–p) Number of variables integrated |
Household questionnaire (–h) Number of variables integrated |
| 2010 | 314 | 274 |
| 2011 | 472 | 172 |
| 2012 | 350 | 188 |
| 2013 | 363 | 169 |
This means that from 2010 on, SOEP users have more cases in their study population—automatically, as it were—without having to make any changes in scripts. Of course, it may be that certain variables were not collected in FiD and are therefore unavailable for these cases. Here, please refer to our conventional approach to missings, which makes this easy to see on the variable level:
| Code | Meaning |
| -1 | no answer / don’t know |
| -2 | does not apply |
| -3 | implausible value |
| -4 | Inadmissible multiple response |
| -5 | Not included in this version of the questionnaire |
| -6 | Version of questionnaire with modified filtering |
| -8 | Question not part of the survey program this year* |
*Only applicable for datasets in long format.
The Federal Statistical Office plans to adjust the already-released Microcensus data from 2011 and 2012 based on the 2011 census data. This means that in the present SOEP data release (v31), the weights for waves BB and BC will change due to the adjustment to the 2011 census data.
Because v31 will include the data from the SOEP-related study FiD, the integration of these households into the SOEP will increase the overall case number by around one-third and it will also affect the integrated weighting variables. This is due to the additional households as well as to the differentiated consideration of official information on family types in the weighting process. To allow users to test how a new sample may affect their research using the SOEP data, we provide both integrated weights and also separate weights for the old and new samples in the year when a refresher sample was integrated into the SOEP.
3.1 Adjustment of the psample / hsample variables
Due to the retrospective integration of the FiD sample, the psample variable in ppfad and the corresponding hsample variable hpfad had to be adjusted.
sample variables
| Value | Old Labels (v30) | New Label (v31) |
| 1 | A German West | A Original Sample (DE-West) |
| 2 | B Foreigner West | B Migration (up to 1983, DE-West) |
| 3 | C German East | C Original Sample (DE-East) |
| 4 | D 84-93 Immigrant (West) | D 1994/5 Migration (1984-92/94 DE-West) |
| 5 | E Refreshment 1998 | E 1998 Refreshment |
| 6 | F ISOEP 2000 | F 2000 Refreshment |
| 7 | G High-Income Test 2002 | G 2002 High-Income |
| 8 | H Refreshment 2006 | H 2006 Refreshment |
| 9 | I Incentives 2009 | I 2009 Incentivization |
| 10 | J Refreshment 2011 | J 2011 Refreshment |
| 11 | K Refreshment 2012 | K 2012 Refreshment |
| 12 | L1 2010 Birth Cohorts (2007-2009) | |
| 13 | M Migration 2013 | L2 2010 Family Types |
| 14 | L3 2011 Family Types | |
| 15 | M1 2013 Migration (1995-2010) |
3.2. Biographical data sets
The following datasets with biographical information were pooled to keep the number of life-courserelated datasets to a reasonable level:
biobirth and biobirthm -> biobirth
Women’s (biobirth) and men’s (biobirthm) childbirth biographies are merged into the dataset biobirth as of v31, of course along with a gender variable.
bioage01 to bioage12 -> bioagel
Starting with data distribution v31, the age-specific data from the mother/parent-child questionnaires are provided only in the user-friendly “long” format: Rather than as age-specific individual files (e.g., bioage01, bioage03, ...), all mother-child and parent-child questionnaires are now pooled in the bioagel dataset. Consequently, all information on children can now easily be found in one dataset. The documentation on the biographical data includes a syntax to generate the age-specific individual files for those who do need them and information on how to use the new bioagel “long” data set most efficiently with SPSS and Stata.
The dataset bioage17 derived from the youth questionnaire is not included in this bioagel dataset.
3.3 Changes in $HGEN
The file HGEN v31.1 now contains the variable gas$$, which states the household’s gas costs starting in 2014. The variables $$eqplif and $$eqpnrj have now been carried forward from the last two years if a household did not provide a response in a given year.
3.4 Other changes in SOEP v31.1
The updates in v31.1 only affected the values of various variables. For detailed information please see doi soep.v31.1
The new data distribution (1984–2013) “SOEP v30” provides, for the most recent survey year 2013, the usual wave-specific data files BDPBRUTTO, BDP, BDPKAL, BDPGEN, BDPAGE17, BDHBRUTTO, BDH, BDHGEN, BDKIND, and BCPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). Additional new samples, datasets, or variables are listed below:
1. Cross-sectional weights 2013
We are pleased that with the figures now available from the official statistical agencies, we are now able to provide you the finalized weighting variables in this version of the data (doi:10.5684/soep.v30). As is always the case in years of refresher and enlargement samples, we are providing weights for the old and new samples, both separately and together. These different sets of weights are designed to make it easier for users to study how the integration of a new sample affects the analysis of specific research topics.
Please also note that the government census carried out in 2011 replaced the projected population figures, which had been regularly updated based on the last census in 1987, with current population of the Federal Statistical Office. This means that the post-stratification of SOEP weights from wave BD in data release v30 are based on a version of the Microcensus from 2013 that considers the 2011 census for the first time. It is therefore possible that changes in weighted analyses of the SOEP between 2012 (BC) and 2013 (BD) are the result of the government statistics switching over to the more recent census. The correction is evident in the fact that the estimated total number of individuals living in private households in Germany fell from 81 million in 2012 to less than 80 million in 2013.
Given the retrospective revision of the 2011 and 2012 Microcensus data to account for the census results, our next data release (soep.v31) will include retrospectively revised weighting variables for the 2011 and 2012 survey data.
If you have any comments on the weighting variables, we would be happy to hear from you (mkroh@diw.de).
2. New IAB-SOEP Migration Sample (Sample M)
The new IAB-SOEP Migration Sample (Sample M) is a joint project with the Institute for Employment Research (IAB). It is therefore provided as part of the normal SOEP distribution (see, for example, variable psample in dataset ppfad), but also as a separate study including only Sample M households (10.5684/soep.iab-soep-mig.2013).
The new sample takes into account changes in the structure of migration to Germany since 1995. It covers not only direct immigration but also the “second generation,” the children of immigrants. The new sample opens up new perspectives for migration research and provides insights into the lives of new immigrants to Germany. The new sample has the following key features:
3. New datasets / variables
3.1. MIGSPELL
3.2. BDP_MIG
The original data from the Sample M specific survey instrument is included in the dataset BDPMIG, combining the individual and the biographical questionnaire. The variables are also included in the other standard or generated datasets:3.3. JOBEND$$
Since a number of changes occurred in the categories for reasons for job dismissal, a new longitudinally consistent variable (JOBEND$$) is now offered in the $PGEN data sets./p>
3.4. New additional occupations codes
The data on occupations in the individual questionnaire are now additionally coded using KldB2010 and partly also ISCO-08. The following variables are included in the dataset BDP:
|
Varname |
Variable Label |
|
bdp38_kldb2010 |
Current Occupational Classification (KldB2010) |
|
bdp38_isco08 |
Current Occupational Classification (ISCO-08) |
|
bdp81_kldb2010 |
Current Occupational Classification Secondary Employment (KldB2010) |
|
bdp81_isco08 |
Current Occupational Classification Secondary Employment (ISCO-08) |
|
bdp9005_trainkldb2010 |
Vocational Training / Education Degree Prev. Yr. (KldB2010) |
However, variables of derived scales (e.g. prestige scores in $$PGEN) are still based on ISCO-88.
3.5. Grip strength data for 2012
GRIPSTR update: The data on grip strength from the survey year 2012 is now included in the GRIPSTR dataset.
3.6. Wealth data for 2012
PWEALTH and HWEALTH updated: In the year 2012, all individuals aged 17 and over were again surveyed on wealth, just as they were in 2002 and 2007. These “raw” data were already part of the standard data distribution for Wave 29 and will be included in the upcoming data distribution in a file containing the data for 2002, 2007, and 2012 in “long format”—the file PWEALTH for individual data, HWEALTH with data aggregated according to household context. Values that are missing due to item or partial unit non-response (e.g., missing interviews with individual household members in interviewed households) will be subjected to multiple imputations in complex procedures taking longitudinal information into account.
3.7. BIOEDU now part of the regular data distribution
After it became impossible to update the beta version of this data set in version 29, the data have now been updated and incorporated into the regular data distribution. The information from the new IAB-SOEP Migration Sample was also integrated.
3.8. INTERVIEWER dataset
The dataset comprises demographic and employment information about interviewers, aggregated data on the interviewers’ fieldwork in each wave, as well as personal details that they provided in the two interviewer surveys of 2006 and 2012. In the process of creating the INTERVIEWER dataset, all interviewer indicators (INTID) in all of the SOEP datasets were checked thoroughly and in some cases revised.
4. Revisions and Bug fixes
4.1. Corrections in BILZTCH$$ and BILZTEV$$
The variables BILZTCH$$ and BILZTEV$$ lacked information on a number of waves up to now. As a result, false values were ascribed to variables in a number of cases: a total of 638 previously consistent cases proved to be inconsistent increases in educational levels and 2,582 previously inconsistent cases proved consistent.4.2. Corrections in DUEBSTD
In addition to the generation of overtime work for 1984 and 1985 overtime work has now been generated for 1987 as well. For these years, overtime hours result from the difference between contractually agreed working hours and the number of hours actually worked per week.
4.3. Revisions of marital and relationship status
$FAMSTD: As a result of a new process for generating BIOMARSM/Y and BIOCUPLM/Y, two changes occurred in $FAMSTD: Since 2010 the question on marital status has included the categories “registered same-sex partnership, living together” and “registered same-sex partnership, not living together”. These two categories are also included in $FAMSTD as values “7” and “8”. Furthermore all spells of BIOMARSM/Y in the category “widowed or divorced” have been set to “not valid” in $FAMSTD. These changes were also applied to previous waves. The variable $FAMSTD is set to -3 if information is implausible, to -5 if persons were not interviewed, and to -1 if persons did not answer the question.
BIOMARSM/Y: Because BIOMARSM is derived from the new version of BIOCOUPLM, we have copied the category “married, separated” from BIOCOUPLM. It reflects the time between a reported separation and divorce or the death of the spouse. Most of these spells of BIOCOUPLM were set to “married” in BIOMARSM, but for those spells without a reported end, event spells were set to “married, separated” and the end of the spells to missing. Parallel spells from the category “divorced or widowed” were added, whereas the outset of those spells was set to missing. Finally a new category “added spell” has been introduced into the variable remark, which let you distinguish between spells that have been edited (value 2) and spells that have been added (value 3). For further information, please see the new documentation on BIOCOUPLM/Y. The variable SPELLTYP is set to -3 if information is implausible.
4.4 $regtyp: conversion to urban / rural area
The new typology of German BBSR describes the settlement structure allowing for categorization into four types of regions. But the use of these four categories would, on the other hand, allow for the identification of specific administrative districts (Landkreise) in the counties of Saxonia, Mecklenburg-Western Pomerania, and Baden-Württemberg. Therefore, we must use a condensed two-category classification: urban and rural areas.
The new data distribution (1984-2012) "SOEP v29" provides, for the most recent survey year 2012, the usual wave-specific data files BCPBRUTTO, BCP, BCPKAL, BCPGEN, BCPAGE17, BCHBRUTTO, BCH, BCHGEN, BCKIND, and BBPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors).
1. New subsample K
In 2012, we added a new refreshment sample with 1,526 new households (Sample K). In total, 12,322 households were interviewed as part of the 2012 fieldwork. As with previous general population samples, the refreshment sample K was realized by using a multi-stage stratified sampling design. Refreshment sample K resulted in a very similar response rate of 34.7 % compared to our last Refreshment Sample J. Thus, the general downward trend in participation was successfully stopped through a range of measures including centralized face-to-face interviewer training, better pay for interviewers, and more attractive incentives for respondents.
In the current refreshment samples, fieldwork is conducted exclusively by CAPI, as it was with the previous refreshments H (2006), I (2009), and J (2011). Similarly to our other refreshment samples, data collection is focused on three main questionnaires: the household, the individual, and the youth questionnaire. Thus, no supplementary questionnaires were used with respondents in wave 1. The reason for focusing on the key questionnaires is to avoid "overburdening" respondents with a lengthy wave 1 interview.
2. Revision of the weighting and estimation procedure
In version SOEP v29 of the SOEP data, the data from subsamples J and K (first collected in 2011 and 2012, respectively) have been adjusted to the German Microcensus for the number of employed people in households of different sizes as well as for the number of private households receiving Unemployment Benefit II (ALG II). This correction prevents an overestimation of households receiving ALG II in the unweighted samples J and K.
Also, for all newly drawn samples since 1998, a minor adjustment has been made to the definition of households containing foreign nationals. The criterion is no longer the household head but the presence of at least one person of foreign nationality in the household. The revision was made due to a slightly increasing discrepancy between the reference person chosen in the German Microcensus and the household head in the SOEP.
3. New datasets / variables
4. Improvements and Bug Fixes
| Value | Wave BB (2011) | Wave BC (2012) |
| 0 | Head Of Household | Head Of Household |
| 1 | Spouse Of HH Head | |
| 2 | Life Partner | |
| 3 | Son, Daughter | |
| 4 | Foster Child | |
| 5 | Son, Daughter-In-Law | |
| 6 | Father, Mother | |
| 7 | Parent-In-Law | |
| 8 | Brother, Sister,-In Law | |
| 9 | Grandchild | |
| 10 | Other Relative | |
| 11 | Non-Relative | Spouse Of HH Head |
| 12 | Child of HH-Heads Partner | Same-Sex Spouse |
| 13 | Same-Sex Spouse | Life Partner |
| 21 | Son, Daughter | |
| 22 | Stepchild (Child of the Partner) | |
| 23 | Adoptive Child | |
| 24 | Foster Child | |
| 25 | Grandchild | |
| 26 | Great-Grandchild | |
| 27 | Son, Daughter-In-Law | |
| 31 | Father, Mother | |
| 32 | Step Father / Step Mother / Spouse of Father or Mother | |
| 33 | Adoptive Father or Mother | |
| 34 | Foster Father or Mother | |
| 35 | Parent-In-Law | |
| 36 | Grandparents | |
| 41 | Brother, Sister | |
| 42 | Half-Brother, Half-sister | |
| 43 | Stepbrother, Stepsister | |
| 44 | Adoptive Brother/Sister | |
| 45 | Foster Brother/Sister | |
| 51 | Brother, Sister -in Law (spouse of brother/sister) | |
| 52 | Brother, Sister -in Law (brother/sister of spouse) | |
| 61 | Aunt, Uncle | |
| 62 | Niece/ Nephew | |
| 63 | Cousin/Cousine | |
| 64 | Other Relative | |
| 71 | Others | |
| 99 | Unknown | Unknown |
Please note that this also affects the corresponding variables in the dataset $KIND (and KIDLONG) and BIOPAREN.
With the integration of sample J in 2011, conducting of the biographical questionnaire was moved from the second to the first wave and combined with the individual questionnaire in an integrated survey. This means that there are some slight differences in the survey instrument between the old samples A-H and the supplementary sample J.
The following additional missing codes have been introduced to the survey data to document these possible differences:
| -4 | "Inadmissible multiple response" |
| -5 | "Not included in this version of the questionnaire" |
| -6 | "Version of questionnaire with modified filtering" |
The SOEP Innovation Sample has been launched now and includes, inter alia, sample I. Sample I is therefore no longer part of the main survey as of 2011. See SOEP-IS on our website for further information about the Innovation Sample and the possibility of including your own questions.
3.1 BIOCOUPLM
BIOCOUPLM provides spell data on partnership histories from the first to last personal interview of a respondent. Spells are measured on a monthly basis.
3.2 BIOCOUPLY
BIOCOUPLY provides spell data on partnership histories. It contains annual information on partnership status since the respondent’s year of birth, including available retrospective data and annually updated information.
3.3 BIOSIB (beta version)
The new file BIOSIB provides information on siblings living in the SOEP households. The dataset contains the person numbers of all siblings in an observed family. It includes information on their gender, their year of birth, and on the relationship between the observed siblings.
BIOSIB is included as a beta version in the current data release. Please do not hesitate to send both positive and negative feedback or suggestions to Daniel Schnitzlein (dschnitzlein@diw.de).
3.4 BIOEDU
The BIOEDU dataset contains details on educational transitions beginning with entry into childcare up to tertiary education in a consistently structured form.
3.5 BIOAGE long
In the new integrated bioage long dataset (BIOAGEL), data are presented in “long” format, i.e. this dataset will contain information from BIOAGE01, BIOAGE03, BIOAGE06, as well as BIOAGE08a and BIOAGE08b.
3.6 TRUST
Dataset on the Economic Behavior Experiment on Trust and Trustworthiness in the 2003, 2004, & 2005 SOEP Survey
This experiment to measure trust is based on the investment game introduced by Berg et al. (1995), a one-shot game for two players or movers who anonymously interact with each other. The first mover receives an endowment of 10 points and can transfer zero to ten points to the second mover. Every point that is transferred is doubled by the experimenters. The second mover is also given an endowment of ten points. After receiving points from the first mover, he/she decides on how much of the endowment to transfer back to the first mover (zero to ten points). As with the first mover's transfer, the back-transfer by the second mover is doubled by the experimenters. After the second mover's decision, the game ends and the subjects are paid their income in euros (one point equals one euro) by check sent a few days later.
A fundamental component of the game is that the participants actually receive money in accordance with the fixed payout function, i.e., all the decisions always have monetary consequences. This version of the game was developed by Fehr, Fischbacher, Schupp, von Rosenbladt & Wagner (2002).
The combination of representative survey and behavioral experiment was used in the SOEP main surveys in 2003, 2004, and 2005, with only minor modifications. Of the 1,432 original participants in 2003, 1,202 also took part in the experiment in 2004 and 2005.
The data are available in long format in the "TRUST" dataset. Consequently, this dataset contains information from each of the three waves in which the behavioral experiment was conducted.
3.6 TIMEPREF
Dataset on the Economic Behavior Experiment on Time Preferences in the 2006 SOEP Survey
In this experiment on economic behavior, respondents were asked to decide how they would like to receive €200 in prize money: if they would rather receive it immediately by check, or if they would prefer to wait and receive a larger amount later—that is, with interest. By splitting the sample (N = 1,503 persons) into random subsamples (splits), it was possible to vary both the time horizon and the implied interest rate to test possible incentive effects on the choice between a low payoff in the short term and a high payoff in the long term. The scientific director of the project was Prof. Dr. Armin Falk, CENs, University of Bonn.
4.1 $HBRUTTO dataset
REGTYP$$:
The $HBRUTTO dataset will include a new variable to distinguish between urban, suburban and rural regions. This is based on the spatial categories of counties (as of December 31, 2009) used by the Federal Institute for Research on Building, Urban Affairs and Spatial Development (BBSR). The following spatial structure characteristics are used to define the categories:
Thus, three categories can be defined:
4.2 $PGEN dataset
BILZTCH$$ / BILZTEV$$:
BILZTCH$$ indicates whether the respondents’ answers suggest a downward shift in years of education or training ($BILZEIT) since the last observation or an upward change since the last year which is inconsistent with additional information on education or training recently completed.
BILZTEV$$ is a flag variable which indicates whether the respondent showed some inconsistent change in $BILZEIT either upwards or downwards over the entire observation period.
$VEBZEIT and $UEBSTD
To be consistent with the FID dataset, the missing values of the variables $VEBZEIT and $UEBSTD were slightly recoded, as the missing value –2 is now assigned to self-employed individuals. In previous waves, self-employed persons had the missing value –3 (implausible answer).
For $UEBSTD, the value –3 (implausible answer) is assigned to all individuals with more than ten hours of weekly overtime AND who also had an agreed working time of over 80 weekly hours ($VEBZEIT is implausible, value –3) or actual weekly working time of more than 80 hours a week ($TATZEIT is implausible, value –3).
4.3 BIOPAREN dataset
Seven new variables have been added to BIOPAREN:
VAORT11 and MAORT11 indicate the mother and father’s current place of residence.
GESCHW, GESCHWUP, NUMS, NUMB and TWIN provide information on siblings. The variable GESCHW indicates whether the respondent ever had any siblings at the time of the interview. GESCHWUP gives information about the year the sibling information was collected. NUMB and NUMS provides information on the number of brothers or sisters the respondent reports and TWIN indicates whether any of these are TWIN siblings (and of which type) of the respondent.
The release of the 1984-2010 SOEP data (waves A-BA) will contain the usual year-specific data files (BAP, BAH, BAPGEN, BAHGEN, BAPKAL, BAPBRUTTO, BAHBRUTTO, BAKIND and ZPLUECKE) and the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors). The respondents of Sample I (Incentives Sample) answered the biographical background questionnaire for the very first time in 2010.
Since minor changes have been made to many of the older datasets as well, we strongly recommend reinstalling all of the datasets from the new DVD.
1. New two-letter prefix (BA)
This SOEP data release (v27) will include, for the first time in the survey's 27 years, a two-letter rather than a single-letter wave prefix. Since we came to the end of the Latin alphabet with the letter Z in our last data release, we decided to use the wave prefix BA for the cross-sectional data format.
2. Updated beta version in "long format"
The SOEP data are now also available in "long format" as a beta version in addition to the usual data format. SOEPlong refers to a compressed form of the SOEP data. Rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format). The data are available on the second DVD. For details, see SOEPnewsletter No. 90/2010 (PDF, 3.53 MB).
3. Elimination of fakes
When the data for the second wave of our newest sample I were checked, 36 households were identified as faked interviews and will therefore no longer be included in this data release.
4. New and renamed datasets
4.1 BIOAGE08[A|B]
The BIOAGE08 dataset contains data from the new "parent questionnaire" which is given to the mothers and fathers of seven- to eight-year-old children. Thus, data are now available on the 2002/2003 birth cohorts that were first observed with the "newborn questionnaire." The new "parent questionnaire" is given to both mothers and fathers and thus provides two sets of responses on many of the children in the sample. Therefore, the file was split into two on the basis of the parent's gender and the household type the respondent is living in. BIOAGE08A includes only mothers and some fathers, where there was no information from the mothers available. BIOAGE08B includes fathers only. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).
4.2 LIFESPELL
The LIFESPELL dataset contains data from the follow-up studies of SOEP dropouts (1992, 2001, 2006, and 2008), which were not previously included in the regular data release. The follow-up studies, based on information from public registers, serve to identify the current residence of former SOEP respondents, and thus allow studies of life expectancy and decisions to emigrate for a large percentage of SOEP respondents, even long after they have dropped out of the study. The documentation of this new dataset is included as a new chapter in our documentation on biography and life history data in SOEP (coming soon).
4.3 BIOEDU (beta version)
The BIOEDU dataset, which is being released this year in provisional form (beta release), contains details on educational transitions beginning with entrance into childcare up to tertiary education in consistently structured form. Users who work with these data are requested to report on their experiences (especially any problems they might have), so that a final version can be released next year.
Detailed documentation is in the DIW Data Documentation 58 (PDF, 383.03 KB).
Because of its provisional form, this dataset is not part of the normal distribution and you will find the data on the DVD in an extra archive (link available in the NEWS file on the DVD).
5. New Variables
5.1 $PEQUIV dataset
In the $PEQUIV-files, there will be a new additional variable on support payments. With the 2010 questionnaire, the SOEP has split the item "support payments" into two separate items. The first one now collects information on "alimony from legal spousal support, child support, and child care support" (ALIM$$), while the second item asks about "advance child maintenance payments" (IACHM$$). More information about the $PEQUIV files and the new variables is available in the DIW Data Documentation 57 (PDF, 0.54 MB).
5.2 PFLEGE dataset
The PFLEGE file now includes two new additional variables. "FURTHER" gives the number of further persons requiring help in the household. This question has been asked since 2009. "CARECOST" represents the regular monthly costs for care that a household normally spends. This question has been asked since 2010.
5.3 Dataset $PGEN
We are now providing detailed data on educational degrees and training qualifications prior to joining the panel: life course questionnaires have been distributed since 2001 to collect data on apprenticeship occupation, type of qualification (e.g., diploma), and the field of study for those who have obtained a degree. Up to now, data from these open-answer questions were not included in the data release. From now on, however, these data will be released in coded form. The classifications used for the data from the individual questionnaire have been slightly modified in the process of these revisions. A more detailed description is available within the PGEN documentation.
The new variables are
| FIELD$$ | Field of tertiary education |
| DEGREE$$ | Type of tertiary degree |
| TRAINA$$ | Apprenticeship-two-digit occupation KldB92 |
| TRAINB$$ | Vocational school-twodigit occupation KldB92 |
| TRAINC$$ | Higher vocational school-twodigit occupation KldB92 |
| TRAIND$$ | Civil servant training-twodigit occupation KldB92 |
| FDT_F$$ | Data source FIELD, DEGREE, TRAIN |
6. Revised Variables
6.1 $P dataset
Name changes to the variables in the different classifications for occupation and sector in $P: the variables contained in the $P datasets are collected in alternate years from all respondents and from those individuals who changed occupations. Simultaneously, we also generate and distribute all the information on all years and all individuals in the $PGEN datasets. To more clearly distinguish the generated variables from the originally surveyed variables, and to establish a clear connection to the question number in the respective questionnaire, we have renamed the variables according to the following system:
| Old variable name | New variable name |
| $IS88 | $pXX_IS88 |
| $KLAS | $pXX_KLAS |
| $BACE | $pXX_NACE |
| $IS88 | $pXX_IS88 |
| $KLAS | $pXX_KLAS |
Thus, ZIS88, for example, is now ZP29_IS88. The corresponding variables of all datasets from the individual questionnaire ($P) have been renamed, but not the variables recommended for use from the generated datasets ($PGEN, e.g., IS8809 from ZP).
6.2 $HBRUTTO dataset
The coding of the variables identifying the federal state ($BULA, Bundesland) in which a household was included, is now consistent with the coding of the official statistics.
| $bula (old codings) | $bula (new codings) |
| 0 Berlin | |
| 1 Schleswig - Holstein | 1 Schleswig-Holstein |
| 2 Hamburg | 2 Hamburg |
| 3 Niedersachsen | 3 Niedersachsen |
| 4 Bremen | 4 Bremen |
| 5 Nordrhein-Westfalen | 5 Nordrhein-Westfalen |
| 6 Hessen | 6 Hessen |
| 7 Rheinl.-Pfalz, Saarl. | 7 Rheinland-Pfalz |
| 8 Baden-Wuerttemberg | 8 Baden-Wuerttemberg |
| 9 Bayern | 9 Bayern |
| 10 Saarland | |
| 11 Berlin (Ost) | 11 Berlin |
| 12 Mecklenburg-Vorpommern | 12 Brandenburg |
| 13 Brandenburg | 13 Mecklenburg-Vorpommern |
| 14 Sachsen-Anhalt | 14 Sachsen |
| 15 Thueringen | 15 Sachsen-Anhalt |
| 16 Sachsen | 16 Thueringen |
A differentiation between East and West Berlin can still be achieved by a combination with $SAMPREG (Sample Region in $PPFAD).
6.3. Minor bug fixes
The 2010 data distribution (data for years 1984-2009) has comprehensive improvements, additions, and modifications. For the most recent survey year 2009, it also provides the usual wave-specific data ZPBRUTTO, ZP, ZPKAL, ZPGEN, ZPAGE17, ZHBRUTTO, ZH, ZHGEN, ZKIND and YPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).
1. Beta verson in 'long format'
The SOEP data are being provided for the first time ever as a beta version in "long format" in addition to the standard data format. SOEPlong refers to a compressed form of the SOEP data: rather than being provided as wave-specific individual files, all available years and cohorts are pooled (long format).
2. New Subsample I
As part of the recent SOEP innovations, fieldwork began in fall 2009 on a new subsample (Sample I). The subsample is currently being used to test the effect of different incentive strategies on participation in the SOEP and it will become part of the innovation sample. See SOEPnewsletter 89 (PDF, 1.37 MB)for more on this new sample.
In four randomly assigned groups, the following strategies were used:
The data from the new Sample I has been included in the new release of SOEP data (SOEP, v26), but due to the particular features of the subsample, it does not have an integrated weighting framework with the rest of the SOEP samples. For subsample I we are conducting a mail survey of all non-participants in the four groups. Since this is the first wave of subsample I, we were not able to integrate the biographical information from Sample I into the existing biography files. The same applies to the biographical information in the dataset PPFAD, e.g. the variable MIGBACK is completely set to -2.
Apart from that the following additions and modifications have been made:
3. New and Renamed Datasets
4. New Variables
4.1 Dataset $HGEN
Two new variables describing the quality of the dwelling:
4.2 Dataset $PEQUIV:
4.3 Dataset $HBRUTTO - Calender Year of Interview
5. Revised Variables
5.1 Datasets $HGEN
5.2 Datasets $KIND – KIDLONG
5.3 Datasets BIOMARSM/BIOMARSY
5.4 Dataset BIOTWIN
The dataset BIOTWIN contains 100 additional cases since wave Z. This considerable increase in case numbers is due to an adjustment in the data generation procedure: In contrast to the previous generation, all siblings with an identical year of birth are consdered twins under the condition that the information on the month of birth remains missing. This less restricitve generation is based on the assumption that two separate births in a single calendar year remain rare occurences. Nevertheless, the number of false positives in this group with a missing month of birth is likely to exceed the BIOTWIN average. Hence a new value label was introduced with the variable INFOTWIN in order to flag these twin groups for the user (Code "6": Coverage since 2007, congruent year of birth, missing month; see in contrast to this code "5": Coverage since 2007, congruent year & month of birth).
In its current state (wave Z) the dataset BIOTWIN covers 250 sets of twins and 5 sets of triplets.
infotwin:
[1] Twins - Not in 2006 (gen.)
[2] Twins - 2006 (Answer Not Verifiable)
[3] Twins - 2006 (Answer Refused)
[4] Twins - 2006 (Answer Validated)
[5] Twins - since 2007 (gen.)
[6] Twins - since 2007 (gen.)
5.5 Minor bugs fixed
The new dataset (Waves 1-25, 1984-2008) contains extensive improvements, additions, and modifications. Besides the usual wave-specific data YPRUTTO, YP, YPKAL, YPGEN, YHBRUTTO, YH, YHGEN, YKIND, and XPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data, and weighting factors).
We now also provide-in a beta-release-the data in a more user-friendly format called "SOEPlong". We announced this in SOEPnewsletter 80/2008 and thank all those who provided input on this issue. The new and preliminary version of the SOEP data in long format can be obtained upon request. We suggest that only "power users" should order this version of the data who would like to work with us to improve data management. This version contains all data and thus can essentially already be used for final analyses. This is a preliminary version. We do not recommend the new format for inexperienced users. If new SOEP users want to work with the new format, they should at least be familiar with other panel datasets.
The most important improvements in the new data istribution are listed in the following:
1. New Datasets
1.1 Dataset BIOAGE06
The new data distribution contains the new file BIOAGE06. For the first time in 2008, it includes the information collected using the special motherchild questionnaire, usually from mothers of fiveto- six-year-old pre-schoolers. The data are thus on a birth cohort that was first “surveyed” in the year 2002/2003 with a special Newborn Questionnaire. The new data on pre-school-age children contains children’s height and weight, health, care situation, activities with and without the mother, and media usage. Detailed questions address the care situation. Furthermore, valid information on the child’s personality is collected for the first time (based on the “Big Five” personal traits indicator in the main questionnaire for adults) and their socio-emotional behavior (surveyed with a modified version of the Strength and Difficulties Questionnaire).
1.2 Dataset MOVEDIST
We provide a new dataset on the change of residence. Based on the geo coordinates at block level we will provide the information about the distance (in meter) between the former and the present residence. However the information will only be available for moves since 2000 and is NOT available on this DVD! We distribute this data together with data on the spatial planning regions (ROR) on an extra CD-ROM. You need an extended data distribution contract including a data protection concept if you want to use this kind of data. After signing your contract extension, you will receive this data on CD-ROM (at no additional cost).
2 New Variables
2.1 Dataset PPFAD
2.2 Dataset PFLEGE
2.3 Dataset PBIOSPE
The data generation process has been updated completely but without changing the basic principles. Therefore, there are only a few barely discernible deviations in the main variables (due to slight changes in the consistency checks of the data). But there are a number of visible changes in the form of additional variables or additional values in already existing variables. A detailed description is available in our documentation on biography and life history data.
2.4 Dataset BIOPAREN
3 Revised Variables
3.1 Dataset PWEALTH and HWEALTH
In the year 2007, all individuals aged 17 and up were again surveyed on wealth, just as they were for the first time in 2002. These “raw” data were already part of the standard data distribution for Wave 24 and will be distributed with the upcoming data distribution in a file containing the data for 2002 and 2007 in “long format” – the file PWEALTH for individual data, HWEALTH with data aggregated according to household context. Missing values due to item or partial unit non-response (e.g., missing interviews with individual household members in interviewed households) will be subjected to multiple imputations in complex procedures taking longitudinal information into account. Documentation on this is under preparation. An initial analysis of the new wealth data for 2002 and 2007 is provided in: Joachim R. Frick und Markus M. Grabka. 2009. Wealth Inequality on the Rise in Germany. Weekly Report 5 (10), 62-73 (PDF, 383.22 KB).
3.2 Dataset $PEQUIV
3.3 Dataset HHRF/PHRF
3.4 Dataset $PGEN
3.5 Dataset $HGEN
The domicile-related variables in the wave-specific $HGEN files have been completely revised. New additions include the full imputation of missing values (due to item-non-response) for the housingrelated variables number of rooms, heating costs, gross rent excluding heating, as well as the newly generated variable on utility costs in addition to rent. Finally, “flag variables” show the imputation status, if relevant. Experienced SOEP users may also note the change of the various variable names in the file $HGEN.
3.6 Dataset PPFAD
The 2008 data distribution (1984-2007) provides, for the year 2007, the usual wave-specific data XPBRUTTO, XP, XPKAL, XPGEN, XHBRUTTO, XH, XHGEN, XKIND and WPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).
In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Biographical background information has been collected from respondents in sample H for the first time in 2007. This data has been fully integrated into alle relevant biography files (BIOxxxx).
As part of the SOEP innovations projects TNS Infratest Sozialforschung conducted in December 2006 a postal survey among former SOEP panel members from households which had been classified as final refusals in 2001-2004. As a byproduct we could change the information on year of birth from missing to a valid value for 21 of these persons (more information can be found in the executive summary (PDF, 36.18 KB)executive summary of the TNS Infratest Methodenbericht).
Furthermore the following additions and modifications have been made:
A. New and Renamed Datasets
COGNIT06:
In the 2006 survey year, for the first time, short cognitive tests were carried out with a subsample of the SOEP. The goal was to employ a robust set of instruments that could be administered easily by trained interviewers in just a few minutes. Close to 80% of all persons chosen for participation in the cognitive test provided valid answers. Thus, for the first time, the SOEP now contains indicators of cognitive potentials for more than 5,500 persons, along with diverse educational information based on degrees and certifications. It is planned that the first repeat of the test will take place in the 2010 survey year. A detailed documentation and selection analyses can be found in Schupp et al. (2008) Erfassung kognitiver Leistungspotentiale Erwachsener im Sozio-oekonomischen Panel (SOEP), DIW Berlin, Data Documentation 32 (PDF, 447.63 KB).
PBR_EXIT and PBR_HHCH:
These two datasets replace the former dataset YPBRUTTO, however this year both variants are available
MIHINC:
Multiple imputed dataset on monthly net household income for the years 1996 to 2007. The dataset is stored in long format (long format: hhnrakt, svyyear, mj, also called mim format within stata). Each item non-response on net household income was imputed 10 times. More information can be found in HGEN.pdf (PDF, 0.64 MB)
B. New Variables
B.1 Dataset XPBRUTTO
B.2 Dataset $PEQUIV
B.3 Dataset $HGEN
C. Revised Variables
C.1 In the Dataset $PKAL
C.2 In the Dataset HHRF/PHRF
However, the weighting factors for the year 2007are also based on (newest available) microcensus benchmark data from 2006; they are therefore only provisional with regard to the figures given for households and individuals in Germany.
C.3 In the Dataset $PGEN
D. Error Updates
D.1 In the Dataset VH and WH
Variable Label: Owner Of The Dwelling
ValueWrongCorrect| -2 | Does not apply | Does not apply |
| -1 | No answer | No answer |
| 1 | Self Owned Res. Property | Local Govt. Apt. |
| 2 | Local Govt. Apt. | Co-Operative Apt. |
| 3 | Co-Operative Apt. | Company Apt. |
| 4 | Company Apt. | Private Owner |
| 5 | Private Owner | Do Not Know |
D.2 In the dataset $PGEN
The 2007 data distribution (1984-2006) provides, for the year 2006, the usual wave-specific data WPBRUTTO, WP, WPKAL, WPGEN, WHBRUTTO, WH, WHGEN, WKIND and VPLUECKE as well as the updated files with a longitudinal component (PFAD files, biography files, spell data and weighting factors).
In the survey year 2006, a representative supplementary sample for all of Germany was added: refreshment sample H. Detailed information on the integration of this sample and additional changes in both files using weighting and extrapolation factors can be found below (see item 2).
A further important change is the introduction of a new survey instrument for first-time respondents at the age of 17 years. These persons now receive an expanded youth questionnaire, which provides current information as a supplement to the biographical data already collected, thus rendering the previous individual questionnaire used for this group obsolete. This also means that the survey population for the standard individual questionnaire (stored in the files $P) changed slightly, 17-year-olds are not included since survey year 2006 (first-time surveying of sample H constitutes an exception, since here the 17-year-olds have been surveyed again with the individual questionnaire since the biographical survey in new subsamples starts only with the second wave. The revised $NETTO-variables and the file $PAGE17 are of interest in this context (see below).
The educational variables in the generated datasets ($PGEN) have been revised: the integration of vocational qualifications attained abroad has been improved, and the corresponding variables have been subjected to extensive testing for consistency. These variables will be described in greater detail below.
The information on twins in SOEP was validated by a special survey of "potential" twins, and is integrated into the dataset BIOTWIN.
This year as in all previous years, the variables contained in the file WPEQUIV (wave 2006) relating to previous year's income take into account the various structural changes in the tax and transfer system, using these as part of the basic informational framework for generating and simulating annual income. Not only do the changes in the 2005 tax rate (reduction of the top tax rate, personal exemption) play an important role here but also the new guidelines contained in the Old Age Income Act (Alterseinkünftegesetz). The introduction of Unemployment Benefit II (Arbeitslosengeld II) also plays an important role, along with the extensive changes in the transfer system it entails (Social Security, Rent Subsidy, etc.). The generated information on (previous) year's income from SOEP survey year 2006 has thus been subject to thorough testing for internal and external consistency.
This year, the data is being distributed for the first time on DVD. This means that the language of variable and value labels can be chosen even more easily: right in the SOEP data installation program. If you install the data in Windows Vista using our setup program, please follow the installation instructions on DVD.
The following additions and modifications have been made:
New and Renamed Datasets 2006
$PAGE17
From 2007 on, persons who have reached the age of their first individual SOEP interview (17 years) are not given the usual individual questionnaire but a special youth questionnaire. Wave-specific information not contained in the biographical data or other generated datasets (like $PGEN, HEALTH) are given in the dataset $PAGE17. Youth questionnaire respondents are identifiable with the help of the new $NETTO code "17" (see also the changes in the $NETTO variables in PPFAD). More information can be found in the biography documentation on our homepage and on the new DVD.
DESIGN
Starting in 2007, the information on SOEP sample design previously compiled in the dataset VARIANZ (Spiess 2001) is now being disseminated in a revised and amended dataset DESIGN. Preliminary documentation can be found in designdoku.pdf (PDF, 57.99 KB) on our homepage and on the new DVD.
HEALTH
Starting with 2002, the SOEP health module in the individual questionnaire has been revised and put on a two-year replication period. In the HEALTH file, users find the generated SF-12 variables (measuring health related quality of life) as well as variables on height and weight with imputation flags and a user-friendly, longitudinally checked generated variable of the Body Mass Index (BMI). More information in the health.pdf (PDF, 110.01 KB) on the SOEP homepage or on the new DVD.
PWEALTH and WEALTH
The wealth data asked in 2002 were thoroughly revised and checked for inconsistencies. The data are now provided in two (multiply) imputed datasets for the individual and the household level, with the corresponding flag variables for identification of the imputed values. The two datasets also each contain a generated variable on "net wealth" (see SOEPpapers No. 18 (PDF, 0.76 MB)).
Interviewer Survey
The interviewer dataset-available up to 2006 only as a "stand-alone" version-is now integrated into the standard data distribution under the name INTVIEW and thus provided in the different software formats (SAS, SPSS, STATA).
Cross-Sectional Weighting Scheme 2006
With the 2006 data distribution, important changes have been made in the cross-sectional weights. They are described in detail (in German) in the DIW Data Documentation 22 (PDF, 310.7 KB).
1. Types of Weighting Factors Redefined
Each cross-sectional weight is designated $xHRFy. Here, $ represents the wave identifier, x the differentiation between households (x = H) and persons (x = P) and y an additional identifier that describes the type of weighting factor.
2. Modifications to the External Information Used in the Weighting Scheme
Since the year 2005, the Federal Statistical Office no longer provides data differentiating between East and West Berlin. This has led to the minor retrospective changes in the external information on number of households since survey year 2005.
3. New Refreshment Sample H
In 2006, SOEP expanded to include another sample-refreshment sample H. The new households, which are representative for Germany as a whole, were also included in the weighting scheme. The integration of sample H is currently underway. Tests are still being conducted to determine if and how sample H can be further adapted to the external information. This is not a serious problem since in any case, for descriptive analyses we recommend the use of the weighting factors WxHRF1, which exclude sample H .
4. Weighting Factors are Based on Benchmark Data from the 2005 Microcensus
The weighting factors for the year 2006 are based on microcensus benchmark data from 2005; they are therefore only provisional with regard to the figures given for households and individuals in Germany. Please address any questions to Martin Kroh.
BIOAGE01 and BIOAGE17 2006
1. BIOAGE01
Four new variables on pregnancy status have been generated, based essentially on the month of the interview from $P and the month and year of the child's birth, as well as the duration of pregnancy in weeks from BIOAGE01.
BCPREGY 'Mother: pregnant at the time of individual interview wave ($)?'
Value Labels:
2002 | Pregnant at Time of Personal Interview 2002
2003 | Pregnant at Time of Personal Interview 2003
2004 | Pregnant at Time of Personal Interview 2004
2005 | Pregnant at Time of Personal Interview 2005
2006 | Pregnant at Time of Personal Interview 2006
2007 | Pregnant at Time of Personal Interview 2007
BCPREGMO 'Mother: estimated month of pregnancy at the time of individual interview, wave($)'
Value Labels:
1 | First Month of Pregnancy
2 | Second Month of Pregnancy
3 | Third Month of Pregnancy
4 | Fourth Month of Pregnancy
5 | Fifth Month of Pregnancy
6 | Sixth Month of Pregnancy
7 | Seventh Month of Pregnancy
8 | Eighth Month of Pregnancy
9 | Ninth Month of Pregnancy
10 | Last Month of Pregnancy or after Birth
Furthermore the beginning and end of pregnancy are also available as spell data. Analogously to BIOMARSM, for example, we start counting with month 1 (January 1983), such that December 2007 is month 300. The data are generated based on month of birth and duration of pregnancy in weeks from BIOAGE01.
PREGBEGM 'Spell - Month beginning of pregnancy / conception (1 = Jan 1983)'
PREGENDM 'Spell - Month end of pregnancy / Birth (1 = Jan 1983)'
2. BIOAGE17
You will find detailled information on the structure and the content of the dataset in the documentation of the biographical data on our SOEP homepage or the DVD.
$HGEN 2006
NUTS1$$
In addition to the Bundesland (federal state) variable, starting this year, the corresponding NUTS (Nomenclature of Territorial Units for Statistics) Level 1 Variable is also provided. This variable is generally identical with $BULA in $HBRUTTO but without pooling Rheinland-Pfalz/Saarland (from 2000 on) and without differentiating between East and West Berlin.
$PGEN 2006
1. New Variables
JOBCH$$
A variable for identification of job change was generated to supplement ERWTYP$$ (and eventually to replace it). The categories for this variable are independent of whether the information was obtained in a first-time or a subsequent interview. For respondents to a subsequent interview, JOBCH$$ refers to job changes since the last interview and for first-time respondents, it refers to job changes since the beginning of the previous year. Respondents who started their first job and respondents who made a job change are reported separately. In contrast to ERWTYP$$, JOBCH$$ has been subjected to a check for longitudinal consistency. Cases showing inconsistences-such as duplicate entries of the same job change in two subsequent interviews-have been corrected.
Value Labels:
1 | Not Employed
2 | Employed No Change
3 | Employed No Info If Change
4 | Employed With Change
5 | First- Time Employed
2. Revised Variables
GERWZEIT, HERWZEIT
For the years 1990 and 1991, values for job tenure are now provided for sample C (East) as well. Given the potentially limited comparability due to the East German transformation process, this data should be handled with particular care.
$ERWZEIT
Job tenure has been tested for longitudinal consistency due to repeated evidence of inconsistencies. Cases that proved longitudinally inconsistent were corrected using the following procedure:
From the longitudinally consistent start of employment with current employer, we determine the duration of job tenure. When a respondent who started working again after a break can be assumed to have returned to his or her former employer, the full duration of job tenure is taken. The period of the break in employment is then not subtracted, potentially resulting in an implicit overestimation of firm-specific human capital.
AUSB$$
Since 1999, the required job training variable has distinguished between studies at universities and technical colleges, and now, different categories have also been created for the years prior to and since 1999. For the years since 1999, separate categories have been introduced explicitly differentiating among these different kinds of educational qualifications. Furthermore, technical colleges and technical schools are now designated separately.
AUSB$$ 'required job training'
Value Labels:
1 | No Training
2 | Introduction to Job
3 | On-The-Job Training
4 | Courses
5 | Vocational Training
6 | Technical School, Engineering (East) 1990-96
7 | Technical College or University, up to 1998
8 | Technical College, since 1999
9 | University, since 1999
MPS$$
For waves U,V, and W, values for Wegener's Magnitude Prestige Scale have been added for respondents without a household interview ($NETTO=19).
ERWTYP$$
For the employment type variable, the old categories have been maintained but due to the common value for first-time job holders and those who have made a job change, the label for this category has been changed. Thus, the label 'employed, with change or first time employed' is now applied to the value 6.
3. Update Educational Variables
Thanks to our users, an error was identified in the generation of the educational variables in $PGEN, which had crept in some time ago in the process of retrospective generation for the years 2000 and 2001 and continued on since then. The error was in the variable $PBBIL02, and consisted in assigning foreign university degrees too high a value. The error came about through the integration of the variables $PBBILA and $PBBIL02 in these two years. All educational degrees have therefore now been generated again retrospectively for the years 2000 to 2006. The resulting variables $BILZEIT, ISCED$$ and CASMIN$$ have also been updated retrospectively from 2000 on.
PPFAD 2006
Revision of the $NETTO Codes
$NETTO
With this year's wave W (23rd survey wave), 2006, the compilation of data on the survey population has changed fundamentally. Previously, an individual interview was carried out with all household members above the age of 16. As of 2006, the regular individual interviews based on the standard adult questionnaire are introduced one year later when household members reach the age of 18. Seventeen-year-olds instead receive an expanded youth questionnaire in their first year as SOEP respondents. (This applies to the old samples A-G; for the new sample H, distribution of this youth questionnaire will start next year, while this year's 17-year-olds have received the regular individual questionnaire, in line with the old system).
This means that we now have two instruments instead of one to obtain data on respondents: the individual and the youth questionnaire. To ensure a consistent differentiation over time, it will therefore be necessary either to include the youth population of the current year or to increase the age limit for all previous years.
The newly revised $NETTO variable assists retrospectively in both differentiations for the entire survey period. The connection between survey population and survey instrument can be retraced with the help of the variable $NETTO in PPFAD or $HNETTO in HPFAD. As a result of the change in the survey population as well as the expansion of the survey instrument to include detailed information on biographical contexts, the corresponding variable $NETTO in PPFAD has been fundamentally revised and is now provided as a two-digit variable. To ease the transition to the new variable, the old one-digit variable is still provided as well under a different name $NETOLD; the variable $HNETTO in HPFAD is unaffected by this and remains unchanged.
Value Labels:
10 | Respondent Completed Interview
11 | Individual Questionnaire
12 | Individual Questionnaire and Biography
13 | Individual and Youth Questionnaire
14 | Individual and other Questionnaires
15 | Individual Questionnaireand Experiments, Tests
16 | Individual Questionnaire, First-Time Respondent, Age 17
17 | Youth Questionnaire, First-Time Respondent, Age 17
19 | Individual Questionnaire without Household Interview
20 | Children in Household Interviewed ($KIND)
21 | Children with Mother-Child Questionnaire I, Age 0-1
22 | Children with Mother-Child Questionnaire II, Age 2-3
30 | Persons in successfully interviewed household without Individual Interview
31 | Completed Gap Interview ($LUECKE)
32 | Completed Biography Questionnaire
33 | Successful Youth Questionnaire
34 | Successful Tests and Experiments
60 | Only Questionnaire without Individual or Household Interview
61 | Gap Interview without household reference
62 | Gap Interview with drop out
70 | Only Participation in Tests, Experiments, etc.
80 | Individual did not withdraw from panel population
81 | Previous respondent lacking current information
89 | Repatriate - (was Drop Out)
90 | Individual Dropouts $YPBRUTTO
91 | Moved abroad
99 | Died
$NETOLD
In the new variable $NETOLD, the old $NETTO code can still be used. Persons at the age of 17 who either filled out a youth questionnaire (n=307) or an individual questionnaire (sample H, n=31) are each coded with the value of 1. In this way, the selection on (WNETTO == 1 | WNETTO == 5) is not identical with the population in WP.
$PEQUIV 2006
1. New Variables
ALG2$$: Sum of all transfers from Unemployment Benefit II (Arbeitslosengeld II) received by the household. FALG2$$: Flag to identify the imputation of Unemployment Benefit II (ALG2$$).
IDEMY$$: Sum of indemnity payments received in the previous year.
FDEMY$$: Flag to identify the imputation of indemnity payments (IDEMY$$).
ITRAY$$: Sum of commuting and travel grants received in the previous year.
FTRAY$$: Flag to identify the imputation of commuting and travel grants (ITRAY$$).
2. Revised Variables
I11105$$
The variable (rental value of personally used living space = imputed rent) has been generated up to now only for persons living in owner-occupied housing. In line with recent research findings and also European Community guidelines for the generation of imputed rent in EU-SILC, this fictitious income advantage is now generated for persons in rental households as well who claim to pay below-market rental prices. These include people in rent-free housing, in socially subsidized housing, and in rental properties offered at a special rate (company dwellings, apartments provided by relatives at reduced rent, etc.).
W11101$$ and W11102$$
Due to the changes to the weighting factors in the files PHRF and HHRF, the variable W11101$$ now contains the individual weighting factor $PHRF1 (from the file PHRF) and the variable W11102$$ now contains the household weighting factor $HHRF1 (from the file HHRF).
First-time SOEP respondents show a significantly higher rate of item non-response in their first wave, which cannot be corrected adequately through imputation. For this reason, these two weights do not take into account the first wave of each new SOEP subsample. Furthermore, high-income subsample G has been excluded from the weighting scheme in order to prevent structural breaks in the analysis of income with vs. without this subsample. These two weighting variables are thus particularly well suited to a consistent time series of income inequality analysis.
W11105$$
The variable W11105$$ now contains the individual weighting factor $PHRFALL (from the file PHRF). This weighting variable takes into account all SOEP subsamples.
E11105$$
The content of the variable E11105$$ is now based on the ISCO88 International Standard Classification of Occupations.
E11106$$ and E111076$$
The variables E11106$$ and E11107$$ now provide information on sector affiliation in the form of a one or two-digit number according to the NACE scheme, the International Standard Industrial Classification of all Economic Activities.
3. Deleted Variable
W11106$$ 'HH-Weight immigrant sample'
BIOBIRTH; BIOBRTHM
KIDMON[n]
With wave W, the birth biographies of men (BIOBRTHM)-like those of women (BIOBIRTH)-include not only the year of birth (KIDGEB[n], with n = (1...15), but also the month of birth for each child (KIDMON[n]). This birth month is identical with the child's birth month given in PPFAD.
BIOTWIN
In 2006, a separate survey was carried out in all households with twins. This twin survey had the goal of validating the data on all twins in SOEP and gaining new information. The following variables have been changed or added in BIOTWIN as a result:
BIOMONOZ
The variable BIOMONOZ differentiates between identical and fraternal twins based on a question asked to first-time respondents. This information used to be obtained through a question asking whether the twins were of the same or different sexes. New codes have been introduced for the variable BIOMONOZ to reflect the improved information available. The values are thus no longer compatible with those from prior to wave W contained in variable BIOMONOZ in the dataset BIOTWIN.
INFOTWIN
The variable INFOTWIN has been introduced. This variable tells whether information on twins was given in the 2006 twin survey, whether the information was derived from previously exisiting SOEP data, and whether previously existing data on the twins coincides with the results of the twin survey.
EGP$$
The variable "Erikson and Goldthorpe Class Category" (International Socio-Economic Index of Occupational Status) has been corrected with regard to the categorization of freelance academics, who were previously grouped together with the self-employed (values of 5 or 6). The corrected generation process assigns academic freelancers to the upper service class, which corresponds to a value of 1.
The 2006 SOEP data distribution (1984-2005, Waves A-V) includes the usual wave-specific data VPBRUTTO, VP, VPKAL, VPGEN, VHBRUTTO, VH, VHGEN, VKIND and UPLUECKE, as well as updated versions of all datasets with a longitudinal component (spell data, biographical data, and weights).
The first CD-ROM contains, as usual, all SOEP data with variable labels and value labels in German, and the second contains all SOEP data with variable labels and value labels in English.
Please also note the following improvements and changes:
New and renamed datasets 2005
With the current data distribution, we renamed all SOEP datasets based on age-specific biographical questionnaires (e.g., "Mother and Child") in a more consistent manner. Since all these datasets are saved in long format, the names now start with "BIOAGE" and a two-digit suffix. This suffix gives the maximum age of the individuals in question during the survey year.
BIOAGE01
New name for the dataset BIOCHILD up to the present (based on the questionnaire for mothers with a newborn child below the age of 15 months).
BIOAGE03
New dataset based on mother-and-child questionnaire for mothers with a child between the ages of 2 and 3 years. For further information, please see the biographical data documentation.
BIOAGE17
New name for the dataset previously known as BIOYOUTH (based on a survey of adolescents between 16 and 17 years old).
Weighting 2005
The 2005 cross-sectional weights are provisional - an update of VPHRF and VHHRF will be released in fall 2006
The wave-specific projection and weighting variables will be adjusted annually to external official data to ensure the accuracy of marginal distributions on age, sex, household size and nationality. The source of the data is the German Federal Statistical Office's official microcensus. From 2005 on, the data on Berlin will no longer be reported separately for the areas comprising former West Berlin / East Berlin; rather, Berlin will be considered part of East Germany. As a consequence, the data required to adjust our weights to the official marginal distributions will not be available before fall 2006.
To prevent this from causing a delay in the distribution of the SOEP data up to Wave V (2005), the weights (VPHRF* and VHHRF*) have been adjusted to the data used for Wave U (2004).
From our experience, there is a very low deviation in the benchmark data over the years (the new definition for West Berlin / East Berlin being one exception). Please keep in mind the provisional nature of the weighting scheme, and indicate this explicitly in any publications using the weights for Wave V. We will inform you as soon as the final version, based on the 2005 microcensus data, becomes available via the SOEP NEWSLETTER and listserver.
$HGEN 2005
AHINC$$
The adjusted screener (AHINC$$) is now available for all waves (Exception: Sample C in 1990/1991).
$PGEN 2005
ALLBET$$ (new)
Raw categories for the size of the company. A consistent variable over all waves for the size of the company ("least common denominator" of the variable BETR$$).
Categories:
BETR$$ (revised):
The variable BETR$$ now has eleven instead of nine categories. The reason is the more detailed questions from Wave V onwards. The old category "5 to 20 employees" is now split into two categories ("5 to 10 employees" and "11 to 20 employees").
The new categories are:
TIP: The variable ALLBET$$ in the dataset $PGEN offers consistent data on company size thoughout all waves of the SOEP, although with fewer categories in a less detailed classification.
EMPLST$$ (new):
Employment Status. A consistent variable over all waves to differentiate employment status (in addition to the variable LFS$$, which differentiates non-employed persons).
Categories:
EXPFT$$ (new):
Working experience full-time employment. Coverage of complete working experience in full-time employment (in years, one digit after the decimal point).
EXPPT$$ (new):
Working experience part-time employment. Coverage of complete working experience in part-time employment (in years, one digit after the decimal point).
EXPUE$$ (new):
Unemployment experience. Coverage of unemployment experience throughout the entire period of working life (in years, one digit after the decimal point).
Contact: Silke Anger
$PEQUIV 2005
SSOLD$$ (new):
Social assistance to the elderly ("Grundsicherung im Alter").
FSSOLD$$ (new):
Imputation flag: Social assistance to the elderly.
LOSSR$$ (new):
Losses from renting and leasing.
FLOSSR$$ (new):
Imputation flag: losses from renting and leasing.
LOSSC$$ (new):
Losses from capital investment.
FLOSSC$$ (new):
Imputation flag: losses from capital investment.
D11112LL (new):
Race of individual
D11110$$ (erased):
data already included in the variables M11124$$.
D11111$$ (erased):
data already included in the variables M11125$$ .
Contact: Markus Grabka
Bug fixes
Correction of [T-U]HPOP in HPFAD.
Correction of some individual and household weights for the years 2003 and 2004 (THHRF, UPHRF, and UHHRF).
PPFAD 2004
LOC1989
The basic demographic information in PPFAD has been expanded to include location of residence in 1989, i.e., where an individual lived when the Berlin wall fell (variable LOC1989). This information is differentiated into the categories "East Germany", "West Germany", and "Abroad" and is available for all respondents (adults and children, see further documentation in Biography and Life History Data).
Contact: Joachim R. Frick
PGEN 2004
LABGRO$$ and LABNET$$
New variables have been generated for all waves (A-U) providing information on monthly gross and net labor income (LABGRO$$ and LABNET$$), consistently declared in euro. Missing values in case of item non-response are imputed as indicated by the corresponding imputation flag variables IMPGRO$$ and IMPNET$$ respectively (see also additional documentation in PGEN.PDF (PDF, 232.61 KB)).
Contact: Markus Grabka
HGEN 2004
HINC$$
$HGEN now includes the monthly net household income consistently named (HINC$$) and declared in euro over all waves (A-U).
Contact: Jan Goebel or Peter Krause
AHINC$$
A new variable has been generated for waves L-U (1995-2004) providing information on monthly net household income adjusted for possible underreporting (AHINC$$), also consistently declared in euro. Possible underreporting is checked with the help of the current individual incomes of all household members (see also additional documentation in HGEN.PDF (PDF, 61.43 KB)).
Contact: Jan Goebel or Peter Krause
$PEQUIV or SOEP-CNEF 2004
M11101$$-M11127$$
The files $PEQUIV now also include a set of cross-nationally harmonized health-related variables M11101$$-M11127$$ (see also the additional documentation in the Codebook for the $PEQUIV File 1984 - 2004 (PDF, 0.55 MB)).
Contact: Markus Grabka
The data of the German SOEP (100% version) are distributed on three CD-ROMs covering the years 1984-2003. New data sets for the survey year 2003 are the usual wave-specific data TPBRUTTO, TP, TPKAL, TPGEN, THBRUTTO, TH, THGEN, TKIND and SPLUECKE. There are also updates of data sets with a longitudinal component (biographical data and weights). The information collected for the first time in 2003 in the biographical questionnaire for sample G ("high-income sample") has been completely integrated into the user-friendly biographical data sets.
As of this year, the data on CD-ROM #2 also contains all SOEP data with variable labels and value labels in English (including the data from the 1988 financial statement in file EV).
In addition, we have made the following additions and changes:
Sample G "High Income Sample" (Start 2002)
The revised sampling design, using a higher income threshold, results in a smaller number of observations in wave 2.
Contact: Jürgen Schupp
HHRF and PHRF 2003
The standard weighting variables for waves S and T (SPHRF, TPHRF or SHHRF, THHRF) are based on sub-samples A-F, that is, without considering high-income sample G. In addition, we now offer a new integrated weighting variable for all sub-samples A-G (variables $PHRFAG or $HHRFAG, see also documentation (PDF, 267.43 KB) on the integrated weights for A-G vs. A-F ).
Contact: Martin Kroh
Rectypes 2003
1. BIOCHILD: Information from the 'Mother and Child Questionnaire'
In this new file, information on newborns in the SOEP will be collected each year from now on (see further documentation in Biography Data).
Contact: Jürgen Schupp
2. BIORESID: Information on second residence in the first interview
The data set BIORESID includes information on length of residency, and on second residence. The information comes from the biographical questionnaire, which has consistently contained questions on this since 1994 (see further documentation in Biography Data).
Contact: Thorsten Schneider
3. BIOBRTHM: Birth biography information for men - from 2001 on
This new data set includes information on the birth biographies of men interviewed with this modified questionnaire since 2001. BIOBRTHM is structured analogously to BIOBIRTH, based on a question fomerly only answered by women (see further documentation in Biography Data).
Contact: Christian Schmitt
4. BIOTWIN: data for identifying births of twins, triplets, etc.
BIOTWIN includes all identifiable births of twins, triplets, etc. in the SOEP. Identifiers (PERSNR) for the mother and siblings are included (see further documentation in Biography Data).
Contact: Jürgen Schupp and Christian Schmitt
5. HBRUTT98:
This new file contains the complete gross population of sample E in the year 1998. It is useful in attrition analysis of the first wave of this sample.
Contact: Peter Krause
BIOPAREN 2003
Variables on the nationalities of parents have been corrected (see further documentation in Biography Data).
Contact: Jürgen Schupp
PGEN 2003
MODE$$und MONTH$$
Two new variables have been generated for all previous waves to describe interview method and month (MODE$$ or MONTH$$. See also additional documentation (PDF, 0.66 MB).
Contact: Jürgen Schupp
$PSBIL
Update of $PSBIL: For foreigners, the category "leave without graduating" [code 6] had to be updated in 2000, which in turn made it necessary to update $BILZEIT, ISCED$$ und CASMIN$$.
Contact: Bettina Isengard and Peter Krause
$FAMSTD
The variable for martial status has been updated.
Contact: Christian Schmitt
HGEN 2003
HMODE$$ and HMONTH$$
Two new variables were generated for all previous waves to describe interview method and month (HMODE$$ or HMONTH$$). See also additional documentation (PDF, 0.64 MB).
Contact: Jürgen Schupp
PPFAD 2003
GEBMONAT
The central demographic information in PPFAD has been expanded to the month of birth (variable GEBMONAT). This information is now collected for all adults and children as well (see further documentation in Biography Data).
Contact: Christian Schmitt
Update of EINTRITT, ERSTBEFR, AUSTRITT, LETZTBEF (see further documentation (PDF, 0.53 MB)).
Contact: Peter Krause
BIOBIRTH 2003
The information on women's birth biographies was expanded to include information from the Youth Questionnaire, which is given to 16-17 year-olds being interviewed for the first time instead of the standard biographical questionnaire (see further documentation in Biography Data).
Contact: Christian Schmitt
BIOIMMIG 2003
This data was corrected to fix a case of miscoding in past years that occurred due to a reversal of the item sequence. This applies to the variables BIEXPRLV, BIEXPRAC and BIEXPRAN (see further documentation in Biography Data).
Contact: Jan Goebel
PFLEGE 2003
The new variable PNRCARE is now available for the years since 1999, that is, for waves P - T. PNRCARE is an invariable number identifying the primary caregiver in a household. In three cases, the person identified as caregiver was identical with the person being cared for. In these cases, PNRCARE was set at -3 (implausible value). For the waves prior to 1999, PNRCARE has been assigned the value -2.
Contact: Rainer Pischner
YPBRUTTO 2003
Revision of HHNRAKT and HHNROLD for persons listed doubly while living in a previous household.
Contact: Peter Krause
$EQUIV 2003
All income data since 1984 is coded in EURO.
As a supplement to the annual income aggregates offered thus far, we now add the individual income components (sum of all income earned by all household members, variables I111xx$$) with consistent variable names over time.
All information missing due to item-non-response was imputed and marked using flag variables.
All income variables are also included for sample G, but standard weights were used on the basis of sub-samples A-F (see also the additional documentation (PDF, 41.18 KB)).
Contact: Markus Grabka
Rectypes 2002
1. HBRUTT02
In addition to the continuous, wave-specific brutto information regarding progress in the field (SPBRUTTO, SHBRUTTO), households which were not surveyed have been included in the new subsample G for the file HBRUTT02. HBRUTT02 therefore contains all the households selected for subsample G; while the information on households who were surveyed for subsample G is also to be founding the continuous household-brutto SHBRUTTO. This matches the approach used for samples A (HBRUTT84), E (HBRUTT98) and F (HRBRUTT00).
Contact: Peter Krause
2. BIOSOC
The new data set BIOSOC contains youth information on everybody who has completed the biography questionnaire since 2000. This includes information such as arguments with parents, leisure activities, school grades and the federal state where they last attended school.
Contact: Thorsten Schneider
BIOJOB 2002
The data set BIOJOB contains detailed information on first jobs. As of now this also includes ISCO88 data, occupational scales, classification schemes (ISEI, SIOPS, EGP, MPS) as well as information about the sector (BRANCHE). Information regarding last jobs is a new addition and can be found in BIOJOB.
Contact: Thorsten Schneider
BIOPAREN 2002
The person to contact for the update of the Prestige-Scores for parents is Jürgen Schupp.
PGEN 2002
AUTONO$$
This new variable is based on the answers to 'Occupational Status' and represents the degree of autonomy in a person's occupation.
Contact: Jürgen Schupp
STIB$$
This variable unifies the answers to 'Occupational Status' over all waves.
Contact: Jürgen Schupp
ISCED$$, CASMIN$$
The wave specific files $PGEN have been retroactively (from 1984 onwards) expanded to include two further education variables ($ISCED and $CASMIN), which are respectively based on the international classification schemes ISCED (International Standard Classification of Education) and CASMIN (Comparative Analysis of Social Mobility in Industrial Nations). This will help improve comparisons of education-related analyses based SOEP data.
Contact: Bettina Isengard
$EQUIV 2002
Compared to the last data set, there have been fundamental changes to the handling of Item-Nonresponse for annually-based income information and the aggregated income information contained in $PEQUIV. The established longitudinal procedure used for the imputation of Item-Nonresponse has been expanded to include a purely cross-sectional imputation for all income variables, which, however, are only to be used in the case of individual longitudinal information being unavailable. This has resulted in a complete replacement of all the missing income data in the $PEQUIV files (for further information for the methodical procedure for additional imputation cf. Frick, J.R. and Grabka, M. (2003): Missing Income Data in the GSOEP: Incidence, Imputation and its Impact on the Income Distribution (PDF, 1.01 MB)).
Due to this, all the so-called imputation flags have been revised. This now reproduces the share of the imputated income in the respective income aggregate, i.e. if all information is present the value will be 0 and if any Item-Nonresponses are present then the value may be anything up to 100.
In addition, complete income information for the new sample F for years from 2000 to 2002 now also available.
The CNEF data is not yet available for the first wave of sample G, as the methodically demanding imputation algorithms applied by the SOEP require longitudinal data.
Contact: Markus Grabka
DM-EURO conversion
The income in $PEQUIV always refers to that of the previous year; this means that data collected in 2002 for the 2001 income year will still be in DM. There will be a conversion to Euros for all the $PEQUIV information in the next data distribution. Besides that, all the data contained in the $P files corresponds with the information collected with the original questionnaire, i.e. the data collected in Euros in 2002 or the data collected in DM in 2001 is respectively stored in the currency used in the questionnaire.
Contact: Peter Krause
With the 18th wave of the SOEP the concept for the construction of cross-sectional weights has slightly been changed. This change affects neither the derivation of the staying probabilities nor the construction of the weights for Subsample D units.
For more details, please see the Newsletter 60, April 2003.
With the current release of SOEP data (survey years 1984-2001), the coding frame for industry and occupation (first and second job) has been changed to the international standard of NACE and ISCO88, respectively. Especially „old friends" among SOEP-users should be aware that the variables ISCO$$, ISCOU$$, ISCOH$$ and $BRANCHE are no longer available. The respective new variables in the files $P and $PGEN are described in detail in the documentation of the generated variables at person-level (see file pgen.pdf (PDF, 0.66 MB)).
Other than that, the SOEP-group at DIW is currently fixing some minor bugs and deficiencies in the current data release. Firstly, the variables TODJAHR and TODINFO in the file PPFAD, which give year of death and the source of death information, will include all mortality information as given by a recent follow-up study („Verbleibstudie 2001") carried out by Infratest. Secondly, the variable $ERWZEIT in the file $PGEN will be updated so that there is valid information on the number of years with the current employer for all employed respondents in subsample C. Thirdly, the variable RP4002 in the file RP (occupational status: self-employed) and the variables RHHTAGIN, RHHMONIN, RHINTNR in RH (day and month of the interview as well as the interviewer‘s ID) had not been defined properly. All these problems will be fixed with the next release of data. However, users who need to use these variables should subscribe to our listserver so they will receive information about these updates sooner.
Rectypes 2000
1. VARIANZ
In addition to the household indicator this file contains the variables STRAT1, STRAT2, SAMPOINT and INTNR. Some software packages (such as STATA, SUDAAN) are able to use these to estimate variances. All four variables provide information on the respective subsample for the start of each first wave, i.e. they are saved at the case-level (variable HHNR).
STRAT1 identifies the levels, which were relevant for pulling the Primary Sampling Units for the respective sample. For subsample B, these were the five nationalities. Therefore, "artificial" levels were created for subsample B corresponding to the other subsamples and filed under STRAT2.
The variable SAMPOINT identifies the respective PSU (e.g. in subsample A voting constituencies, in Subsample D not present).
Due to data protection laws the various values of the variables STRAT1, STRAT2 and SAMPOINT were given transformed values, in order to prevent regional units from being identified.
The variable INTNR is a variable to which every interviewer assigns a number, so that clusters of households that were surveyed by the same interviewer can be identified.
2. HBRUTT00
Similarly to the collection of the supplementary sample 1998 (sample E), this file contains all Brutto information from all households in the Innovation Sample in the year 2000 that were recently surveyed using the Random-Route-Method. In this case, it doesn't matter if these households were successfully surveyed or not. Information such as this can be accessed for the use of methodical investigations through the participation of households in (SOEP) surveys.
3. QJUGEND
In the year 2000, a youth questionnaire was introduced to be used instead of the biography questionnaire. This was aimed at all "new" participants who had reached the minimum age of 16 and were therefore able to take part in the SOEP survey. The 232 data sets that exist as of now supplement the information collected from the likewise first-time answering of the person questionnaire, in order to gain retrospective details on education, as well as basis indicators on education success. A thorough revision, as well as a supplementation of the youth questionnaire indicators took place in 2001, in addition to the fact that the youth participants of sample F took answered this new questionnaire for the first time. As a result, the data set QJUGEND represents, so to speak, a type of pre-test for the recently prepared biography data set BIOYOUTH (available from 2001 onwards).
Reworking of labels
The VAR LABELS and VALUE LABELS have been be completely reworked for all previous years (up to and including 1999). Missing labels were included where applicable and the systematic was standardised (for instance for sub-items or variables with just one answer category). Furthermore, the labels were made consistent over time. At the same time the reworked label text was transferred to the English labels, so that these too were retrospectively fully identical to the German systematic.
$PGEN 2000
For the current data distribution, extensive revisions were made to the variables from earlier waves. For instance, note that there are far fewer missing values -1 (k.A.) for many variables related to the occupations. The education variables in all $PGEN were reworked and supplemented. New variables include a differentiated labour force status for all participants and education information generated on the basis of data first collected in the year 2000 which dealt with the highest level of education and employment achieved up till now. The existing generated education variables were retrospectively reworked, extrapolated, as well as supplemented: you will now be able to access data on the temporarily absent respondents, as well as information on current school attendance, apprenticeship or studies. Furthermore the variable BETR$$ in $PGEN was recoded (the data on the size of the firm and therefore the codes in SOEP have changed over time). We would like you to take this into account when updating programs.
Contact: Jürgen Schupp and Peter Krause
$PEQUIV 2000
The $PEQUIV files were updated. This affects:
Contact: Markus Grabka
Rectype 1999
INTERVIEW
This interviewer data set contains information about sex, age, education, occuaption and marital status from 1048 interviewers that work on sample A, B, C and D from survey wave 1 up to 12 (Documentation (PDF, 75.75 KB)).
For more information concerning the data distributions back to 1995 please refer to our German Site.