Reports , News of 6 June 2016

SOEPlong data up to the year 2014 available

We recently published the SOEP data (SOEP v31.1) in long format.

If you are registered as SOEP data user you may order the data online and receive a personalized link to download the data via completely encrypted transfers.

Other interested researchers please have a look at our websites or contact the  to get information on access to the data.

 "SOEPlong" is a highly compressed, easily analyzed version of the SOEP data that, according to numerous enthusiastic users, is much simpler to handle than the usual version. The data are no longer provided as wave-specific individual files but rather pooled across all available years (in "long" format). In some cases, variables are harmonized to ensure that they are defined consistently over time. For example, the income information provided up to 2001 is given in euros, and categories are modified over time when versions of the questionnaire have been changed. All these modifications are clearly documented and described for ease of understanding. In the case of recoding or integration of data (for example, datasets specific to East German or foreign populations), documentation is generated automatically and all modified variables are provided in their original form as well.

SOEPlong thus provides a well-documented compilation of all variables and data that is consistent over time. It thereby significantly reduces both the number of datasets and the number of variables. Our main structural tracking files $PFAD and $PHRF are also provided in a merged "long" form-in other words, weighting factors are already integrated into PPFADL and HPFADL. So-called "enumerated weights" [PHRFE] is provided in PPFADL, particularly for the analysis of household characteristics on the individual level.

Further variables included in the "long" format of the PFAD dataset are:

  • The variable IYEAR (interview year)-which corresponds to the variable DATUMY in HBRUTTO—to mark the actual interview year, supplementing the variable SYEAR (survey year, referring to the reference year for the survey instrument).
  • In addition, the generated partner IDs (PARID, PARTNER) are also included in PPFADL (to supplement the corresponding PGEN variables, allowing partners to be identified in households where one partner could not be interviewed).

A further addition to the "long" format of the SOEP data are the cumulative original data from the biographical questionnaire from the BIO dataset.

Preparation of the SOEPlong format also includes all datasets that are provided regularly as cross-sectional files. In generating the individual and household data in the SOEPlong format from the original survey data, comprehensive information is also generated from the cross-sectional variables documenting the long variables over time. This allows users to trace all adaptations that had to be made in variables over time and verify their comparability.

The "long" data will be provided, as always, for users of different data formats: Stata, SPSS, and SAS (and in the unlabeled ASCII format) an English version of all datasets will be made available.


For a first overview over the containing variables please refer to the following excel file. | XLSX, 15.6 MB We hope to implement the data set in the forthcoming update of our information system