Arne Bethmann, Malte Schierholz, Knut Wenzig, Markus Zielonka
In recent years several German large-scale panel studies demonstrated the demand for the coding of open-ended survey questions on respondents’ occupations (e. g. the National Education Panel Study (NEPS), the German Socio-Economic Panel (SOEP), and the Panel Study “Labour Market and Social Security” (PASS)). So far occupational coding in Germany is mostly done semi-automatically, employing dictionary approaches with subsequent manual coding of cases which could not be coded automatically. Since the manual coding of occupations generates considerably higher costs than automatic coding, it is highly desirable from a survey cost perspective to increase the proportion of coding that can be done automatically. At the same time the quality of the coding is of paramount importance calling for close scrutiny. The quality of the automatic coding must at least match that of the manual coding if survey cost is not to be traded for survey error. From a total survey error perspective this would free resources formerly spent on the reduction of processing error and offer the opportunity of employing those resources to reduce other error sources. In contrast to dictionary approaches, which are mainly used for automatic occupational coding in German surveys, we employ two machine learning algorithms (i. e. Naive Bayes and Bayesian Multinomial) for the task. Since we have a substantial amount of manually coded occupations from recent studies at our disposal we use these as training data for the automatic classification. This enables us to evaluate the performance as well as the quality - and hence the feasibility - of machine learning algorithms for the task of automatic coding of open-ended survey questions on occupations.