Imputing Top‐Coded Income Data in Longitudinal Surveys*

DOIhttp://doi.org/10.1111/obes.12400
Published date01 February 2021
AuthorLi Tan
Date01 February 2021
66
©2020 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd.
OXFORD BULLETIN OF ECONOMICSAND STATISTICS, 83, 1 (2021) 0305–9049
doi: 10.1111/obes.12400
Imputing Top-Coded Income Data in Longitudinal
Surveys*
Li Tan
Purdue University, School of Engineering Education, 701 West Stadium Avenue West
Lafayette, Indiana 47907, USA (e-mail: tan304@purdue.edu)
Abstract
The incomes of top earners are typically top-coded in survey data. I show that the accuracy
of imputed income values for top earners in longitudinal surveys can be improved signif‌i-
cantly by incorporating information from multiple time periods into the imputation process
in a simple way. Moreover, I introduce an innovative, nonparametric empirical Bayes im-
putation method that further improves imputation quality. I show that the empirical Bayes
imputation method reduces the RMSE of imputed income values by 19–51% relative to
standard approaches in the literature. I also illustrate the benef‌its of the empirical Bayes
method for investigating multi-year income inequality.
I. Introduction
Longitudinal surveys are used in research in many f‌ields. Among the most widely used
longitudinal surveys in the United States are the Survey of Income and Program Participa-
tion (SIPP), the Panel Study of Income Dynamics (PSID) and the National Longitudinal
Survey ofYouth (NLSY).A common feature shared by the surveys is that income values of
the highest earners are censored to protect individuals’ identities. This censoring practice
is referred to as income top-coding.
Three types of methods have typically been employed in previous research to handle
income top-coding in research applications. The f‌irst type circumvents the top-coding
problem by excluding top earners from the analysis – for example, by redef‌ining the
income distribution of interest, such as dropping the top-coded observations (e.g. Shin
and Solon, 2011; Jensen and Shore, 2015). The second type replaces top-coded income
values with basic, non-model-based imputed values. Survey-provided ‘default’values are
a frequently used example (e.g. Heywood and O’Halloran, 2005; Christie-Mizell, 2006);
another example is the use of a ‘common multiplier’ in whichimputed values are obtained
by multiplying the top-coding threshold value by an ad hoc common multiplier (e.g. Katz
JEL Classif‌ication numbers: D31, C81.
*I thank Cory Koedel, David Kaplan and PeterMueser for helpful comments and suggestions. Financial supports
from the University of Missouri Census Regional Data Center Interdisciplinary Doctoral Fellowship Program are
gratefully acknowledged. All errors are my own. The Bayesian imputation method introduced in this paper can be
implemented with the R package BayesImp availableat https://github.com/ltanecon/BayesImp.
Imputing Top-Coded Income Data 67
and Murphy, 1992; Lemieux, 2006;Autor, Katz and Kearney, 2008). The third method type
is model based. It estimates a population income distribution under parametric assumptions
with the truncated sample and then randomly draws imputed income values from the right
tail for top-coded earners. The distributional assumptions used to recover the population
income distribution vary some across studies, but the fundamental approach is the same.
I refer to this latter approach as the standard imputation method hereafter (e.g. Kopczuk,
Saez and Song, 2010; Burkhauser et al., 2012; Attanasio and Pistaferri, 2014; Armour,
Burkhauser and Larrimore, 2016).
Despite the popularity of these methods, they possess important limitations, particularly
in longitudinal-data applications. For example, the standard imputation method – the most
rigorous of the three – has been shown to be effective in recovering the population income
distribution accurately in any given year (Burkhauser et al., 2012), but it originates from
cross-sectional data applications and is not suited to be applied to longitudinal data, which is
quite common (Nichols, 2008; Bhuller, Mogstad and Salvanes, 2011; Orthofer, 2016). The
reason is that it does not leverage longitudinal information to improveimputation accuracy.
This paper develops and tests methods that incorporate information in longitudinal
data into the imputation process to account for income dependency across time periods
within individuals. I f‌irst show that the standard imputation method can be improved
signif‌icantly by incorporating available longitudinal information in a simple way. Without
substantially modifying the analytic framework of the standard method, the suggested
approach, which I call ‘rank-based imputation’, generates a ranking algorithm based on
longitudinal information to assign imputed income values to top-coded individuals. An
appeal of rank-based imputation is its simplicity and effectiveness in improvingimputation
accuracy, but it has some limitations: perhaps most importantly, the ‘rank assignment’ for
individuals is static and does not account for the variation in ranking caused by income
f‌luctuations over time.
Next, I introduce an innovative, empirical Bayes-based imputation method that further
improves imputation accuracyby combining the analytic insights of the rank-based method
and the methodological advantages of the non-parametric empirical Bayes framework re-
cently developed by Gu and Koenker (2016, 2017) also see Koenker and Mizera, 2014).
The advantage of this approach is that it leverages additional information from the sample
to increase imputation accuracy, and it does not subject to the static-rank limitation of the
rank-based method.
I compare these new imputation methods to the standard method using data from the
1996 SIPP. In addition to the actual income top-coding in the SIPP, applied to the roughly
0.5% of highest earners, I pseudo top-code income values starting from the 98th percentile
of the income distribution. This follows the NLSY top-coding protocol to create a realistic
top-coding scenario (i.e. the NLSY top-codes the highest 2% of income observations).
Individuals affected by the pseudo top-coding are valuable for examining the eff‌icacy of
the methods because their true income values are available in the SIPP but temporarily
expunged, which allows for straightforward comparisons between imputed and actual in-
come values. The results from the comparison of methods are not qualitatively sensitive to
alternative pseudo top-coding thresholds (see Appendix A).
Using the pseudo top-coded sample, I show that the income values imputed by the
standard method exhibit excessive income volatility within individuals. Noting the short
©2020 The Department of Economics, University of Oxford and JohnWiley & Sons Ltd

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT