Modelling missing values in cross-national surveys

Modelling missing values in cross-national surveys: a
latent variable approach
M. Katsikatsou, J. Kuha and I. Moustaki
London School of Economics and Political Science
Workshop on Cross-National Surveys: Methods of Design and Analysis
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
1/33
Outline
1
An Introduction to Latent Variable Models.
2
Latent Variable Models for Multi-group Complete Binary Data.
3
Variables and Joint Distributions.
4
Various Model specifications for handling Item Non-response.
5
An Application from the European Social Survey.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
2/33
Measurement and Structural Models
1
Many theories in behavioral and social sciences are formulated in terms of
theoretical constructs that are not directly observed or measured: Prejudice,
ability, radicalism, motivation, wealth.
2
The measurement of a construct is achieved through one or more observable
indicators (questionnaire items - Measurement model).
3
The purpose of a measurement model is to describe how well the observed
indicators serve as a measurement instrument for the constructs also known
as latent variables.
4
In some cases, a concept may be represented by a single latent variable, but
often they are multidimensional in nature and so involve more than one
latent variable.
5
Subject-matter theories and research questions usually concern relationships
among the latent variables, and perhaps also observed explanatory variables
(structural models).
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
3/33
Application
2008 European Social Survey. Countries selected: Denmark, Great
Britain, The Netherlands, Poland.
Three questions are selected which aim to measure attitudes towards
receivers of welfare provision.
Most unemployed people do not really try to find a job.
Many people manage to obtain benefits and services to which they are
not entitled.
Employees often pretend they are sick in order to stay at home.
Response options (5-point scale): Agree strongly (negative attitude)
to Disagree strongly (positive attitude).
Missing categories: Refusal, Don’t know, No answer.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
4/33
Aim of the analysis
Valid comparisons of the latent variable ’attitude towards receivers
of welfare’ among the countries taking into account possible
differences in the measurement of the latent variable across groups
(measurement invariance) and the effect of item non-response.
Measurement invariance will be assumed.
Various model specifications are proposed for the missing data
mechanism.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
5/33
Family of Latent Variable Models
Metrical
Latent
variables
Categorical
Mixed
Manifest variables
Metrical
Categorical
Mixed
factor
latent trait latent trait
analysis
analysis
analysis
latent profile latent class latent class
analysis
analysis
analysis
Hybrid models
Bartholomew, D.J. and Knott, M. amd Moustaki, I (2011) Latent Variable
Models and Factor Analysis: A unified approach. Wiley.
Skrondal, A. and Rabe-Hesketh, S. (2004). Generalized Latent Variable
Modelling: Multilevel, Longitudinal, and Structural Equation Models. Boca
Raton, FL: Chapman and Hall/CRC.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
6/33
General scope and Notation
Observed/Manifest variables/ items are denoted by:
Y = (Y1 , Y2 , . . . , Yp )′ .
Latent variables are denoted by: η = (η1 , η2 , . . . , ηq )′ . Latent
variables can be either continuous, discrete or mixed.
Covariates are denoted with X = (X1 , . . . , Xc )′ such as group
variables, gender, age, etc.
Response/non-response indicators: R = (R1 , R2 , . . . , Rp )′ defined
as Rj = 1 if Yj is observed and Rj = 0 if Yj is not observed. This
stochastic vector contains all the information about the missing
patterns.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
7/33
Important features of our approach
It allows information about the unobserved part of Y to be inferred
through the observed part of Y since the manifest variables are
expected to be correlated, Yc = (Yobs , Ymis ).
An underlying model is assumed to model relationships among the
variables being measured. Specifically, a single continuous latent
variable η is assumed to be responsible for the dependencies among
the Ys.
A discrete latent dimension, response propensity, is assumed on
which individuals in the population vary.
Therefore a latent trait and a latent class model interplay together
to allow for responding to an item to vary not only according to
individual’s position on the latent variable η but also on the position
on the response propensity dimension. Those two things together
produce a non-ignorable non-response model.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
8/33
Model specification for multi-group complete binary data
Suppose that each respondent belongs to one of G groups. The group is
treated as a fixed and observed explanatory variable for η and Y.
As η is unobserved, any inference will be based on the conditional
distribution of Y given the group and other covariates:
P(Y = y ∣ X) = ∫ P(Y ∣ η)p(η ∣ X) dη
(1)
Under conditional independence:
p
P(Y ∣ η) = ∏[πi (η)]yi [1 − πi (η)]1−yi , Yi = 0, 1
i=1
πi (η) = P(Yi = 1 ∣ η)
Measurement model: P(Y ∣ η)
Structural model: p(η ∣ X)
Here X includes the group variable and other covariates.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
9/33
Latent Trait Model
In a latent trait model with X containing only group information,
the latent variable
η ∼ N(m(g ) , φ(g ) ), g = 1, . . . , G
This is the structural model.
For the measurement model, we use the logistic model
logit[πi (η)] = τi + αi η, i = 1, . . . , p,
(2)
where τi and αi are the intercept and loading parameter, respectively,
taken here to be invariant across groups (measurement invariance).
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
10/33
Estimation
For a random sample of size n, the Log-likelihood function:
n
`(θ) = ∑ log P(Y = yj ∣ Xj ; θ)
(3)
j=1
where, θ denotes the parameters of the model. Maximum likelihood or
Bayesian estimation can be applied. Most commonly maximization
algorithms are the E-M and Newton Raphson. Numerical methods for
approximating the integrals are also needed.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
11/33
Variables and their Joint Distribution
Conditional on the observed covariates X, the joint distribution of the
other variables can be written as
p(R, YC , η∣X) = p(R∣YC , η, X) p(YC ∣η, X) p(η∣X)
(4)
where p(⋅∣⋅) denotes a conditional probability function or probability density
function. We will refer to p(R∣YC , η, X), p(YC ∣η, X) and p(η∣X) as the
non-response model, measurement model and structural model.
As η and Ymis are not observed, the conditional distribution of the
observed variables is obtained from (4) as
p(R, Y∣X) = ∫ p(R∣YC , η, X) p(YC ∣η, X) p(η∣X) dη dYmis
(5)
where the integrals are over the possible values of η and Ymis .
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
12/33
Variables and their Joint Distribution, cont’d
Finally, we make some further assumptions which will reduce (5) to
p
p(R, Y∣X) = ∫ p(R∣η, X) [∏ p(Yi ∣η)] p(η∣X) dη.
(6)
i
Measurement Invariance
Conditional Independence
Missingness depends on η and covariates X
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
13/33
Assumptions about missingness
Ignorable non-response
Missingness does not depend on the variable of interest but depends on
covariates (MAR).
p(R∣η, X) = p(R∣X).
Missingness does not depend on the variable of interest or covariates
(MCAR)
p(R∣η, X) = p(R)
It is evident that in the presence of MCAR or MAR, missing values can
be ignored with no effect on the inference about η.
In the MAR case, ML etimation and inference methods should be
applied.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
14/33
Assumptions about missingness
Non-ignorable non-response
Missingness depends on the variable of interest, e.g. an attitude.
If we ignore it (treat it as random) the inference about the variable of
interest is likely to be biased. To avoid bias in inference, include the
missing data mechanism in the model.
To avoid a confounding correlation between η and R, one needs to
condition on all the important covariates.
A rich/appropriate model needs to be chosen for the missing
data mechanism.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
15/33
Models proposed for p(R∣η, X)
Missing indicators are summarised by a latent variable ξ that
measures response propensity.
The latent variable response propensity can be assumed to be
continuous (latent trait model) or discrete (latent class model).
Here the emphasis is also given in specifying a rich model for the
response/non-response indicators.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
16/33
Diagrammatic representation of a model for missing data,
Holman and Glas (2005)
y1
y2
b
R1
…
R2
i
η
Katsikatsou, Kuha, Moustaki
…
b
a
i
ξ
Modelling item non-response
14th of December 2014
17/33
Diagrammatic representation of model for missing data,
Knott, Albanese, Galbraith (1990), O’Muircheartaigh and Moustaki (1999),
Moustaki and Knott (2000)
y1
y2
b
b
R1
…
R2
…
b
b
η
i
i
ξ
x
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
18/33
Response propensity is treated as continuous: Latent Trait
Model
For each missing data indicator binary item Ri , dropping the group
variable:
S
logit[πi (η, ξ, x)] = β0,i + β1,i η + β2,i ξ + ∑ γs,j xs , i = 1, . . . , p,
(7)
s=1
where πi (η, ξ, x) = P(Ri = 1 ∣ η, ξ, x), and β0i , β1,i , β2,i and γs,i are the
intercept, loading parameters and regression coefficients, respectively, of
the model for the i th response propensity item.
The parameters β1,i provide information on non-ignorability separately for
each item.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
19/33
The proposed model for missing data, η is continuous and
ξ is discrete
y1
y2
b
R1
…
R2
…
b
i
i
e
η
ξ
a
d
x
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
20/33
Model specification: Latent Class Model
The latent class is denoted by ξ with K latent classes, where K << 2p .
The latent class model will approximate the observed multinomial
distribution of R as the number of latent classes increases.
The number of latent classes needs to be decided based on model fit
criteria, AIC and BIC.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
21/33
Two ways of looking at the missing data mechanism model
Version A Use the latent class membership
(respondents/non-respondents latent classes) as a predictor for the
mean and the variance of η. Such an interpretation is plausible in real
data.
For example, high levels of non-response is very likely to indicate high
levels of lack of an ability, extreme attitudes, etc.
Version B Use the attitude latent variable η as a predictor for the
latent class membership.
For example, respondents with more liberal views might be less likely
to respond to questions about immigration. That definition is more in
line with the literature on missing data.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
22/33
Find a good model for the response indicators, the single
group case
Fit a latent class model to the missing indicators R and define the
number of latent classes in each country.
The measurement model for the indicators in R under
conditional independence and measurement invariance:
K
P (R = r ∣ X = x) = ∑ P (R∣ξ = k) P (ξ = k ∣ X = x)
(8)
k=1
The structural part of the model: It is important to consider all
potential covariates that affect both η and ξ.
P (ξ = k ∣ X = x) =
exp (α0∣k + γ ′ξ X)
′
∑K
k=1 exp (α0∣k + γ ξ X)
,
(9)
where α0∣K = 0 for identification purposes.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
23/33
Testing for ignorability - the multigroup case
The test for MCAR:
p (g ) (ξ = k∣X = x, η) = p (g ) (ξ = k) ,
(10)
The test for MAR:
p (g ) (ξ = k∣X = x, η) = p (g ) (ξ = k ∣ X = x) ,
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
(11)
24/33
Testing for non-ignorability - the multigroup case
p (g ) (ξ = k∣X = x, η) ,
P
(g )
(g )
(ξ = k ∣ X = x, η) =
(12)
′ (g )
exp (α0∣k + γ ξ
(g )
(g )
X + λξ η)
′ (g )
∑K
k=1 exp (α0∣k + γ ξ
(g )
,
η (g ) = λη(g ) X + (g )
′
Katsikatsou, Kuha, Moustaki
′
′
Modelling item non-response
(13)
X + λξ η)
(14)
14th of December 2014
25/33
European Social Survey: a study of non-response
Scale on attitudes towards welfare:
Most unemployed people do not really try to find a job.
Many people manage to obtain benefits and services to which they
are not entitled.
Employees often pretend they are sick in order to stay at home.
Response alternatives (5-point scale): Agree Strongly; Agree ; Neither
agree nor disagree; Disagree; Disagree Strongly.
Missing categories: Refusal, Don’t know, No answer.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
26/33
Regression of ξ on covariates and attitude, analysis
conducted separately in each country
FI
DK
GB
FR
NL
PL
GR
Intercept
4.59
3.92
3.48
3.82
3.76
2.53
3.10
Age→ ξ
-0.50**
-0.38**
-0.27*
-0.36**
0.08
-0.27**
-0.03
Ed L3→ ξ
1.09
2.12*
2.02
0.11
1.65**
1.85***
1.44**
Ed L45→ ξ
1.00
1.51*
0.40
0.65
1.87**
1.78***
1.85*
F→ ξ
-0.74
-1
-0.53
-0.54
-0.02
-0.74*
-0.75*
η→ ξ
0.04
-0.29
-0.49*
-0.54*
-1.17*
-1.06**
0.01
***: p-value<0.001; **: p-value≤0.01; *: p-value≤0.05
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
27/33
ESS: Multigroup model
Country
Q1
Q2
R1
η
ξ
Q3
R2
R3
Age
Education
Gender
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
28/33
Parameter estimates, multi-group model
Measurement Model for the attitudinal items:
(all countries)
ˆ1
λ
0.52***
ˆ2
λ
0.50***
ˆ3
λ
0.49***
Regression of attitude on covariates:
Country→ η
Age→ η
Ed L3→ η
Ed L45→ η
F→ η
DK
0 (fixed)
-0.02
0.56***
1.09***
0.10
GB
-1.12***
0.01
0.11
0.46***
-0.05
NL
-0.35**
0.05*
0.36***
0.92***
0.03
PL
-0.96***
0.01
-0.10
-0.10
0.15*
***: p-value<0.001; **: p-value≤0.01; *: p-value≤0.05
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
29/33
Regression of ξ on covariates and attitude
Intercept
Country→ξ
Age→ξ
Ed L3→ξ
Ed L45→ξ
F→ξ
η →ξ
Katsikatsou, Kuha, Moustaki
DK
0 (fixed)
-0.38**
2.09*
1.55*
-1.01
-0.28
3.86***
GB
NL
-1.14
-0.66
-0.25*** 0.06
2.14*
1.54**
0.38
1.62**
-0.48
-0.07
-0.50*
-0.85*
Modelling item non-response
PL
-2.32**
-0.27**
1.84***
1.75***
-0.73*
-1.01**
14th of December 2014
30/33
Ongoing research
Test other model specifications such as a continuous response
propensity and discrete attitudes.
Perform a sensitivity analysis that examines the effect of ignoring
non-ignorable non-response on the structural part of the model.
Perform a sensitivity analysis that examines the effect of various
models for missing data on the structural and measurement models.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
31/33
References
Knott, M. and Albanese, M. T. and Galbraith, J. (1990). Scoring attitudes to abortion. The Statistician, 40, 217-223.
O’Muircheartaigh, O. and Moustaki, I. (1999). Symmetric pattern models: a latent variable approach to item
non-response in attitude scales. Journal of the Royal Statistical Society, Series A., Vol. 162, 177-194.
Moustaki, I. and Knott, M. (2000). Weighting for Item Non-Response in Attitude Scales Using Latent Variable Models
with Covariates. Journal of the Royal Statistical Society, Series A, Vol. 163(3), 445-459.
Holman, R. and Glas, C. A. W. (2005). Modelling non-ignorable missing-data mechanisms with item response theory
models.British Journal of Mathematical and Statistical Psychology, Vol. 58, 1-17.
Katsikatsou, M., Kuha, J. and Moustaki, I. (in preparation) Multigroup data and item non-response: a general model
framework.
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
32/33
Thank You and Many Good Wishes for 2015
Katsikatsou, Kuha, Moustaki
Modelling item non-response
14th of December 2014
33/33