Report of December 20, 2013
With this year's data distribution, SOEP v29, the data now again are being provided in both the normal and "long" format (SOEP v29). "SOEPlong" is a highly compressed, easily analyzed version of the SOEP data that, according to numerous enthusiastic users, is much simpler to handle than the usual version. The data are no longer provided as wave-specific individual files but rather pooled across all available years (in "long" format). In some cases, variables are harmonized to ensure that they are defined consistently over time. For example, the income information provided up to 2001 is given in euros, and categories are modified over time when versions of the questionnaire have been changed. All these modifications are clearly documented and described for ease of understanding. In the case of recoding or integration of data (for example, datasets specific to East German or foreign populations), documentation is generated automatically and all modified variables are provided in their original form as well.
SOEPlong thus provides a well-documented compilation of all variables and data that is consistent over time. It thereby significantly reduces both the number of datasets and the number of variables. Our main structural tracking files $PFAD and $PHRF are also provided in a merged "long" form-in other words, weighting factors are already integrated into PPFADL and HPFADL. And for the first time, a beta version of so-called "enumerated weights" [PHRFE] is provided in PPFADL, particularly for the analysis of household characteristics on the individual level.
Further variables included in the "long" format of the PFAD dataset are:
A further addition to the "long" format of the SOEP data are the cumulative original data from the biographical questionnaire from the BIO dataset.
Preparation of the SOEPlong format also includes all datasets that are provided regularly as cross-sectional files. In generating the individual and household data in the SOEPlong format from the original survey data, comprehensive information is also generated from the cross-sectional variables documenting the long variables over time. This allows users to trace all adaptations that had to be made in variables over time and verify their comparability.
The "long" data will be provided, as in previous years, for users of different data formats: Stata, SPSS, and SAS (and in the unlabeled ASCII format). In addition, this year an English version of all datasets will be made available.
Documentation
The first complete documentation on the SOEPlong format is the first content made available in the new version SOEPinfo, which is currently under development (see "DDI on Rails" in SOEPnewsletter 102 (PDF, 2.51 MB)). It will also be provided there in graphic form . These figures show all years in which a given variable was collected and provide information on changes in distributions over time. In addition, the distribution of missings is presented in the lower part of the figure. These figures are provided for all "long" variables; for categorical variables the distributions are presented using bar plots, and for metric variables using violin plot.
Please see SOEPinfo v.2 beta under https://data.soep.de.
Your feedback is very welcome!