Panel Data Methods in Stata - ReSAKSS

Panel data methods in Stata
Session 2
By Ziyodullo Parpiev, PhD
Overview

Panel data
 How to get to know the data

Change over time
 Tabulating
 Calculating transition probabilities
Using panel data in Stata

Data on n cases, over t time periods, giving a total of
n × t observations

One record per observation
i.e. long format

Stata tools for analyzing panel data begin with the
prefix xt

First need to tell Stata that you have panel data using
xtset
Complete and incomplete person-wave data
+------------------------------------------------------------------+
|
pid
wave
sex
age
mastat
jbstat
fihhmn |
|------------------------------------------------------------------|
| 10019057
1
female
59
never ma
retired
780 |
| 10019057
2
female
60
never ma
retired
759.14 |
| 10019057
3
female
61
never ma
retired
923.5 |
| 10019057
4
female
62
never ma
retired
62.5 |
| 10019057
5
female
63
never ma
retired
663 |
| 10019057
6
female
64
never ma
retired
missing o |
| 10019057
7
female
65
never ma
retired
1254.963 |
| 10019057
8
female
66
never ma
retired
1270.432 |
| 10019057
9
female
67
never ma
retired
1364.555 |
| 10019057
10
female
67
never ma
retired
1479.74 |
| 10019057
11
female
68
never ma
retired
1328.25 |
| 10019057
12
female
69
never ma
retired
1371.49 |
| 10019057
13
female
71
never ma
retired
missing o |
| 10019057
14
female
71
never ma
retired
1372.333 |
| 10019057
15
female
73
never ma
retired
1475.812 |
|------------------------------------------------------------------|
| 10028005
1
male
30
never ma
employed
1501.155 |
| 10028005
2
male
31
never ma
employed
1636.259 |
| 10028005
3
male
32
never ma
employed
1943.283 |
| 10028005
6
male
35
never ma
employed
2001.54 |
| 10028005
7
male
36
never ma
employed
1634.33 |
| 10028005
9
male
38
never ma
employed
1587.945 |
+------------------------------------------------------------------+
Telling Stata you have time series data
Unique cross-wave identifier
Time variable
. xtset pid wave
panel variable:
time variable:
delta:
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Cases not observed for
every time period
. xtset pid wave
panel variable:
time variable:
delta:
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Period between observations in units of the time variable
Describing the patterns in panel data
. xtdes,patterns(20)
Freq. Percent
Cum. | Pattern
---------------------------+----------------1294
28.12
28.12 | 111111111111111
248
5.39
33.51 | 1..............
157
3.41
36.93 | 11.............
115
2.50
39.43 | ..............1
105
2.28
41.71 | 111............
104
2.26
43.97 | 1111...........
73
1.59
45.56 | 11111..........
69
1.50
47.05 | ............111
66
1.43
48.49 | ..........11111
62
1.35
49.84 | .............11
60
1.30
51.14 | .1.............
60
1.30
52.45 | 11111111111....
58
1.26
53.71 | 11111111.......
58
1.26
54.97 | 111111111......
57
1.24
56.21 | 11111111111111.
55
1.20
57.40 | .....1.........
54
1.17
58.57 | ........1111111
54
1.17
59.75 | .11111111111111
54
1.17
60.92 | 1111111111.....
53
1.15
62.07 | .........111111
1745
37.93 100.00 | (other patterns)
---------------------------+----------------4601
100.00
| XXXXXXXXXXXXXXX
Examining change over two waves
1991
|
1992 Employment status
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
961
35
76 |
1,072
2 |
36
38
24 |
98
3 |
40
23
524 |
587
-----------+---------------------------------+---------Total |
1,037
96
624 |
1,757
2001
|
2002 Employment status
Employment |
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
991
15
46 |
1,052
2 |
20
12
9 |
41
3 |
56
20
495 |
571
-----------+---------------------------------+---------Total |
1,067
47
550 |
1,664
Calculating transition probabilities
The transition probability is the probability of
transitioning from one state to another
p ij  Pr{ X t  j | X t 1  i )
So to calculate by hand,
n
p ij  N ij /  N ij
j 1
Cell count
Row total
Transition probability matrix
1991
|
1992 Employment status
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+---------1 |
0.90
0.03
0.07|
1.00
2 |
0.37
0.39
0.24|
1.00
3 |
0.07
0.04
0.89|
1.00
-----------+---------------------------------+----------
2001
|
2002 Employment status
Employment|
status
|
1
2
3 |
Total
-----------+---------------------------------+--------1 |
0.94
0.01
0.04|
1.00
2 |
0.49
0.29
0.22|
1.00
3 |
0.10
0.04
0.87|
1.00
-----------+---------------------------------+---------
Transition probability matrices in Stata
Mean transition probabilities for all waves t
to t+1 when you leave out the “if” statement
. xttrans jbstat if wave<3,freq
current |
economic |
current economic activity
activity |
1
2
3 |
Total
-----------+---------------------------------+---------1 |
961
35
76 |
1,072
|
89.65
3.26
7.09 |
100.00
-----------+---------------------------------+---------2 |
36
38
24 |
98
|
36.73
38.78
24.49 |
100.00
-----------+---------------------------------+---------3 |
40
23
524 |
587
|
6.81
3.92
89.27 |
100.00
-----------+---------------------------------+---------Total |
1,037
96
624 |
1,757
|
59.02
5.46
35.52 |
100.00
Change in a categorical variable over time
A decision tree
0.91
empl
empl
0.03
unemp
0.06
0.90
olf
unemp
0.03
0.26
0.49
empl
empl
unemp
0.25
olf
0.07
olf
0.10
0.03
empl
unemp
0.87
olf
Change in a continuous variable over time

Size transition matrix

Quantile transition matrix

Mean transition matrix

Median transition matrix
Size transition matrix

Absolute mobility


Boundaries set exogenously i.e. predetermined


e.g. movement in and out of poverty
e.g. poverty defined a priori as an income below £5,000
Do not depend on distribution under investigation

e.g. comparing mobility in 1990s and 2000s
incorporates both movements of positions of individuals and
economic growth
Quantile transition matrix

Mobility as a relative concept

Same number of individuals in each class

Only records movements involving re-ranking

Cannot take account of economic growth, for
example when comparing matrices

Cannot draw a complete picture if comparing
mobility in different cohorts/countries/welfare
regimes
Mean/median transition matrices



Both absolute and relative approaches incorporated
into matrices
Class boundaries defined as percentages of mean or
median income of the origin and destination
distributions
Example:


25%, 50%, 75% of median income
Note that this is not the same as quartiles
Example: income 1991-1992
wave = 1
household income: month before interview
------------------------------------------------------------Percentiles
Smallest
1%
181.86
0
5%
349.82
0
10%
458.98
0
Obs
2795
25%
826.6895
0
Sum of Wgt.
2795
50%
1511.067
75%
90%
95%
99%
2365.493
3329.769
4062.217
6748.689
Largest
9230.818
9230.818
9230.818
9230.818
Mean
Std. Dev.
1773.253
1299.089
Variance
Skewness
Kurtosis
1687633
1.836874
8.622895
wave = 2
household income: month before interview
------------------------------------------------------------Percentiles
Smallest
1%
207.9433
0
5%
338.7431
0
10%
460.68
0
Obs
2639
25%
861.67
5
Sum of Wgt.
2639
50%
75%
90%
95%
99%
1508
2449.813
3414.511
4103.649
5824.449
Largest
8405.636
8405.636
10491.08
10491.08
Mean
Std. Dev.
1795.179
1229.827
Variance
Skewness
Kurtosis
1512476
1.352148
6.370836
Category boundaries for each method
Matrix
Year
Boundary 1
(n)
Boundary 2
(n)
Boundary 3
(n)
Boundary 4
(n)
Size
1991
0 - 800
(580)
800 - 1500
(650)
1500 - 2200
(504)
2200 - 9231
(715)
1992
0 - 800
(580)
800 - 1500
(645)
1500 - 2200
(473)
2200 - 10491
(751)
1991
0 – 827
(609)
827 -1511
(615)
1511 – 2365
(611)
2365 – 9231
(614)
1992
0 – 862
(610)
862 – 1508
(612)
1508 – 2450
(612)
2450 – 10491
(615)
1991
0 – 887
(654)
887 -1773
(814)
1773 – 2660
(506)
2660 – 9231
(475)
1992
0 – 898
(652)
898 -1795
(766)
1795 – 2693
(501)
2693 – 10491
(530)
1991
0 – 750
(539)
750 -1500
(685)
1500 – 2250
(540)
2250 – 9231
(685)
1992
0 – 746
(536)
746 -1491
(686)
1491 -2237
(505)
2262 – 10491
(722)
Quartile
Mean
Median
Warning!

Measurement error

Causes an over-estimation of mobility

If mother’s and baby’s weight are reported to nearest half
pound can affect which band the observations falls in

A respondent may describe their marital status as separated in
year 1 and single in year 2
Overview





Types of questions, types of variables: time-invariant, time-varying and
trend
Between- and within-individual variation
Concept of individual heterogeneity
From OLS to models that allow causal interpretations: fixed effects and
random effects models
The basics of these models’ implementation in Stata
Types of variable

Those which vary between individuals but hardly ever over time





Those which vary over time, but not between individuals




The retail price index
National unemployment rates
Age, in a cohort study
Those which vary both over time and between individuals






Sex
Ethnicity
Parents’ social class when you were 14
The type of primary school you attended (once you’ve become an adult)
Income
Health
Psychological wellbeing
Number of children you have
Marital status
Trend variables



Vary between individuals and over time, but in highly predictable ways:
Age
Year
Between- and within-individual variation

If you have a sample with repeated observations on the same individuals, there are two
sources of variance within the sample:


The fact that individuals are systematically different from one another (between-individual variation)
The fact that individuals’ behaviour varies between observations over time (within-individual variation)
k
T 
W 
m
 
i 1
j 1
k
m
 
i 1
k
B 
( x ij  x )
2
_
( x ij  x i )
2
Within variation is the sum of the squares of each
individual’s observation from his or her mean
j 1
m
 
i 1
Total variation is the sum over all individuals and years,
of the square of the difference between each
observation of x and the mean
_
_
_
( xi  x)
2
Between variation is the sum of squares of differences
between individual means and the whole-sample mean
j 1
 x11 x12 ... x1 m

 x 21 x 22 ... x 2 m

 .......... ........
 .......... ........

x
x
... x km
 k1 k 2









Remember:
From the variation, you get to the variance, you get to
the Standard Deviation:
SD 
T/(N - 1)
xtsum in STATA

.
.
Similar to ordinary “sum” command
xtset pid wave
panel variable:
time variable:
delta:
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
Have chosen a balanced sample
xtsum female partner age ue_sick LIKERT wave if nwaves == 15
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
wave
overall
between
within
8
4.320605
0
4.320605
1
8
1
15
8
15
N =
n =
T =
N =
n =
T =
19410
1294
15
19410
1294
15
All variation is
“between”
Most variation
is “between”,
because it’s
fairly rare to
switch between
having and not
having a
partner
All variation is within,
because this is a
balanced sample
More on xtsum….
.
.
xtset pid wave
panel variable:
time variable:
delta:
pid (unbalanced)
wave, 1 to 15, but with gaps
1 unit
xtsum female partner age ue_sick LIKERT wave if nwaves == 15
Variable
female
Mean
Std. Dev.
Min
Max
Observations
.4984321
.4989059
0
0
0
.5397574
1
1
.5397574
N =
16324
n =
1237
T-bar = 13.1964
N =
16292
n =
1234
T-bar = 13.2026
overall
between
within
.5397574
partner
overall
between
within
.6892954
.4627963
.4217842
.243531
0
0
-.244038
1
1
1.622629
age
overall
between
within
40.03349
19.74332
19.27238
4.31763
0
6.4
31.30015
98
90.93333
54.30015
ue_sick
overall
between
within
.0672924
.2505353
.1738938
.1852756
0
0
-.866041
1
1
1.000626
N =
16302
n =
1237
T-bar = 13.1787
LIKERT
overall
between
within
11.26167
5.344825
3.609665
4.030974
0
0
-6.738331
36
29.69231
35.12834
N =
15661
n =
1225
T-bar = 12.7845
overall
between
within
8
4.320605
0
4.320605
1
8
1
15
8
15
wave
N =
n =
T =
N =
n =
T =
19410
1294
15
Observations with
non-missing
variable
Number of
individuals
Average number
of time-points
Min & max refer to xi-bar
19410
1294
15
Min & max refer to individual deviation from own averages, with global averages added back in.
The xttab command
For simplicity, omitted jbstats of missing, maternity
leave, gov training and other.
.
xttab jbstat if nwaves == 15 & jbstat >= 1 & jbstat != 5 & jbstat <= 8
jbstat
Overall
Freq.
Percent
self-emp
employed
unemploy
retired
family c
ft studt
lt sick,
1388
8982
539
2687
1159
718
558
8.66
56.03
3.36
16.76
7.23
4.48
3.48
Total
16031
100.00
Pooled sample, broken
down by person/years
Between
Freq.
Percent
228
974
274
314
292
271
105
2458
(n = 1236)
Within
Percent
18.45
78.80
22.17
25.40
23.62
21.93
8.50
42.72
68.27
17.51
58.49
28.97
42.93
39.08
198.87
50.28
Number of people who
spent any time in this state
Of those who spent any
time in this state, the
proportion of their time
(on average) they spent in
it.
Which statistical model for panel data?
Your research question will guide which models are most suitable
but the nature of your data is also important:
Is your research question cross-sectional or longitudinal, or both?


Cross-sectional: exploit variation between individuals
Longitudinal: exploit variation “within” individuals over time and permit
causal interpretation of effects

and can consider “between” variation if needed
What is the effect on income of having more children?
•
•
•
What is the difference in income between individuals who have a different
number of children?
What is the difference in income before and after the birth of a child?
•
What is the difference in income between men and women and before
and after the birth of a child?
How does income change in the time leading up to the birth of a
child ?  survival analysis  later in this course!
Longitudinal analysis is concerned with
modelling individual heterogeneity
A very simple concept: people are different!
In social science, when we talk about heterogeneity, we are really
talking about unobservable (or unobserved) heterogeneity:


Observed heterogeneity: differences in education levels, or
parental background, or anything else that we can measure and
control for in regressions
Unobserved heterogeneity: anything which is fundamentally
unmeasurable, or which is rather poorly measured, or which does
not happen to be measured in the particular data set we are
using.

With panel data we can do something about unobserved heterogeneity
as we can differentiate between person-level unobserved x that are
identical over time and those that vary over time!
OLS with panel data
OLS: pooled
3000
4000
OLS: cross-section
1000
2000
Income
x1
0
5
10
15
20
25
5
10
15
20
25
30
10
15
20
25
30
35
4000
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
3000
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
2000
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
1000

Cross-sectional effect captures may be quite misleading (omitted variable bias)!
By adding more data points from the same units at different points in time we can get
better estimates. But assumptions of OLS may be violated!
Income

0
10
20
30
40
Number of years since leaving school
pid=1
pid=2
pid=3
OLSt=1: y=2448 -156*x1
0
10
20
30
40
Number of years since leaving school
pid=1
pid=2
pid=3
OLSpooled: y=1925 + 29*x1
An illustration of how unobserved
heterogeneity matters
Considering this is from panel data, two problems become apparent:
•
•
Error terms for persons 1, 2 and 3 differ systematically
The association between x and y appears to be biased
OLS: unobs het
4000
4000
OLS: pooled
w1
3000
2000
Income
2000
Income
3000
u1 ?
1000
1000
w3
0
10
20
30
40
Number of years since leaving school
pid=1
pid=3
pid=2
0
10
20
30
40
Number of years since leaving school
pid=1
pid=3
pid=2
Panel data allows you to:
Break down the error term
(wi) in two components: the
unobservable characteristics
of the person (ui), and
genuine “error” (ei).
 then model ui and ei
Expanding the OLS model to consider
unobserved heterogeneity
Analytically, think of splitting the error term into it’s two components ui and 
i
y i    x i1  1  x i 2  2  x i 3  3  .........  x iK  K  u i   i
… and consider that you have repeated observations over time
Individual-specific, fixed over time
y it    x it   u i   it
Varies over time, usual assumptions apply (mean
zero, homoscedastic, uncorrelated with x or u or
itself)
.. and then reduce the complexity of the information available in some way, or
•
•
•
Focus on “between” variation: loose info on “within” variation
Focus on “within” variation: loose info on “between” variation
Model both types of variation making further assumptions
Within and between estimators
Individual-specific, fixed over time
y it    x it   u i   it
Varies over time, usual assumptions apply
(mean zero, homoscedastic, uncorrelated with
x or u or itself)
Not interested in within variation? Use the means of all observations for all persons i
y i    xi  ui   i
This is the “between” estimator
Not interested in “between” variation? Why not “remove” it in that case!
( y it  y i )  ( x it  x i )   (  it   i )
And this is the “within” estimator – “fixed effects”
Interested in both? Well, let’s treat xi_bar as imperfect to measure person fixed effect
and use between variation where within variation is poorly captured
( y it   y i )  (1   )  ( x it   x i )   {( 1   ) u i  (  it    i )}
θ measures the weight given to
between-group variation, and is
derived from the variances of ui
and εi
Between estimator
y it    x it   u i   it
y i  xi  ui   i


Interpret as how much does y change between different people
Not much used


It’s inefficient compared to random effects


It doesn’t use as much information as is available in the data (only uses means)
Assumption required: that ui is uncorrelated with xi


Except to calculate the θ parameter for random effects, but Stata does this, not you!
Easy to see why: if they were correlated, how could one decide how much of the
variation in y to attribute to the x’s (via the betas) as opposed to the correlation?
Can’t estimate effects of variables where mean is invariant over individuals


Age in a cohort study
Macro-level variables
Focusing on “within” variation – the fixed
effects family

“Fixed effects” estimator
 Basic idea: For each individual, calculate the mean of x and the
mean of y. Then run OLS on a transformed dataset where each yit
is replaced by ( x it  x i ) and each xit is replaced by ( y it  y i )
xtreg y x, fe
Identical to:
Least Squares Dummy Variables regression areg, y x, absorb(pid)
Include a dummy indicator for each individual; all individual level differences,
including the idiosyncratic error term, will then be captured in the person-specific
intercept.
Members of the same family, which you may come across in the literature:
First Differences regress D.(y x)
For each individual, and each time period’s y and x, calculate the difference between the value in
this period and that in the last period. Then run OLS on a transformed dataset where each yit is
replaced by (yit – yit-1) and each xit is replaced by (xit – xit-1)
“Hybrid models” regress y x mean_x z
run standard OLS but add x i of each time-varying variable as additional regressors
Fixed effects estimator
1000
y it    x it   u i   it
-1000
-500
0
Income
500
( y it  y i )  ( x it  x i )   (  it   i )
pid wave y x1
x i ( y  yi) (x  xi )
yi
1
1 2340 0 3076.7 12.5 -736.7
-12.5
1
2 2405 5 3076.7 12.5 -671.7
-7.5
1
3 2730 10 3076.7 12.5 -346.7
-2.5
1
4 3250 15 3076.7 12.5 173.3
2.5
1
5 3705 20 3076.7 12.5 628.3
7.5
1
6 4030 25 3076.7 12.5 953.3
12.5
2
1 1885 5 2442.8 17.5 -557.8
-12.5
2
2 2145 10 2442.8 17.5 -297.8
-7.5
2
3 2275 15 2442.8 17.5 -167.8
-2.5
2
4 2470 20 2442.8 17.5 27.2
2.5
2
5 2762 25 2442.8 17.5 319.2
7.5
2
6 3120 30 2442.8 17.5 677.2
12.5
3
1
780 10 1765.8 22.5 -985.8
-12.5
3
2 1170 15 1765.8 22.5 -595.8
-7.5
3
3 1365 20 1765.8 22.5 -400.8
-2.5
3
4 2405 25 1765.8 22.5 639.2
2.5
3
5 2405 30 1765.8 22.5 639.2
7.5
3
6 2470 35 1765.8 22.5 704.2
12.5
Fixed Effects
-10
0
Number of years since leaving school
pid=1
pid=3
Fixed effects:




10
pid=2
y=65*x1
Ignores between-group variation – so it’s an
inefficient estimator
However, few assumptions are required for FE to
be consistent: ui is allowed to correlate with xi
Disadvantage: can’t estimate the effects of any
time-invariant variables
Need to consider change in interpretation of effects
Want to look at the effect of non-time
varying x? Use x and x in OLS
i
y it    x it   u i   it
y it     1 x it   2 x i   3 z i  u i
residual
Hint: create
pid
1
1
1
1
1
1
2
2
2
2
2
2
3
3
3
3
3
3
wave
1
2
3
4
5
6
1
2
3
4
5
6
1
2
3
4
5
6
xi
y
2340
2405
2730
3250
3705
4030
1885
2145
2275
2470
2762
3120
780
1170
1365
2405
2405
2470
yourself
x
1
2
2
2
1
1
0
1
1
1
1
0
1
1
0
0
0
0
z
1
1
1
1
1
1
2
2
2
2
2
2
2
2
2
2
2
2
x_bar
1.5
1.5
1.5
1.5
1.5
1.5
0.66
0.66
0.66
0.66
0.66
0.66
0.33
0.33
0.33
0.33
0.33
0.33
  it
it
zi: non-time varying individual characteristics for
which you do not need to include group means
•the effect of any unobserved characteristic
otherwise transported in the effect x it is shifted
to the effect of x i :  1 approximates the
coefficient in the FE model,  3 gives you,
approximately, the OLS estimate for non-timevarying variables z i
• Typically no interest in the effect of x i so no
need to worry about its interpretation. Note
that  1   3 is approximately equal to the effect
in the pooled OLS
• Disadvantage: can only control for
unobserved heterogeneity associated with
observed time-varying variables xi; u iresidual
Random effects estimator
y it    x it   u i   it
“Random Effects Model” here RE Generalised Least
Squares
( y it   y i )  (1   )  ( x it   x i )   {( 1   ) u i  (  it    i )}

Uses both within- and between-group variation, so makes best use of the
data and is efficient. Starts off with the idea that using xi_bar is not the best
we can do to capture within variation.





the more imprecise the estimate of the person-level variation (as measured by the
person xi_bar) the more we should draw on the information from other units (x_bar)
Assumption required: that ui is uncorrelated with xi
Rather heroic assumption – think of examples
Will see a test for this later
Note that the within and between effect is constrained to be identical (much
more like OLS in this respect so no causal interpretation!).

E.g., when you include a location indicator in your model, you are saying that the
effect on y of moving to a new town is the same as the effect on y of living in
different towns. When you include a female dummy, you are saying that the effect
of being female on y is the same as the effect on y of changing gender.
Estimating fixed effects in STATA
.
xtreg LIKERT female ue_sick partner age age2 badh, fe
Fixed-effects (within) regression
Group variable: pid
“R-square-like”
R-sq:
statistic
within
= 0.0501
between = 0.1906
overall = 0.1285
corr(u_i, Xb)
Peaks at age 48
Number of obs
Number of groups
Coef.
female
ue_sick
partner
age
age2
_cons
(dropped)
1.951485
-.298668
.1141748
-.0011833
1.230831
6.252975
sigma_u
sigma_e
rho
3.9934565
4.0525618
.49265449
F test that all u_i=0:
24204
3317
Obs per group: min =
avg =
max =
1
7.3
14
F(5,20882)
Prob > F
= 0.1561
LIKERT
=
=
Std. Err.
.1394164
.118635
.0214403
.0002209
.0428556
.4932977
t
14.00
-2.52
5.33
-5.36
28.72
12.68
P>|t|
0.000
0.012
0.000
0.000
0.000
0.000
=
=
[95% Conf. Interval]
1.678218
-.5312018
.0721501
-.0016163
1.14683
5.286073
(fraction of variance due to u_i)
F(3316, 20882) =
4.56
220.44
0.0000
2.224752
-.0661342
.1561994
-.0007503
1.314831
7.219877
“u” and “e” are the two parts
of the error term
Prob > F = 0.0000
Between regression:

.
Not much used, but useful to compare coefficients with fixed effects
xtreg LIKERT female ue_sick partner age age2 badh, be
Between regression (regression on group means)
Group variable: pid
Number of obs
Number of groups
=
=
24204
3317
R-sq:
Obs per group: min =
avg =
max =
1
7.3
14
within
= 0.0480
between = 0.2322
overall = 0.1482
sd(u_i + avg(e_i.))=
F(6,3310)
Prob > F
3.833357
LIKERT
Coef.
female
ue_sick
partner
age
age2
_cons
1.476659
2.038192
-.0101941
.0827335
-.0009489
2.275832
3.953941
Std. Err.
.1350226
.312191
.1777423
.0219026
.0002263
.0926521
.4430909
t
10.94
6.53
-0.06
3.78
-4.19
24.56
8.92
P>|t|
0.000
0.000
0.954
0.000
0.000
0.000
0.000
=
=
166.80
0.0000
[95% Conf. Interval]
1.211923
1.426085
-.35869
.0397895
-.0013927
2.094171
3.085181
1.741395
2.650299
.3383019
.1256775
-.0005052
2.457493
4.822701
Coefficient on
“partner” was
negative and
significant in FE
model.
In FE, the “partner”
coeff really measures
the events of gaining
or losing a partner
Random effects regression
.
xtreg LIKERT female ue_sick partner age age2 badh, re theta
Random-effects GLS regression
Group variable: pid
Number of obs
Number of groups
=
=
24204
3317
R-sq:
Obs per group: min =
avg =
max =
1
7.3
14
within
= 0.0500
between = 0.2239
overall = 0.1471
Random effects u_i ~ Gaussian
corr(u_i, X)
= 0 (assumed)
min
0.1986
5%
0.1986
theta
median
0.5482
95%
0.6629
Std. Err.
Wald chi2(6)
Prob > chi2
LIKERT
Coef.
female
ue_sick
partner
age
age2
_cons
1.493431
2.045302
-.1947691
.1058038
-.0011062
1.433115
5.181864
.1259931
.1271039
.0973734
.014544
.0001498
.0385506
.3137662
sigma_u
sigma_e
rho
3.0248563
4.0525618
.3577895
(fraction of variance due to u_i)
11.85
16.09
-2.00
7.27
-7.39
37.17
16.52
2013.32
0.0000
Option “theta” gives a summary
of weights
max
0.6629
z
=
=
P>|z|
0.000
0.000
0.045
0.000
0.000
0.000
0.000
[95% Conf. Interval]
1.246489
1.796183
-.3856175
.0772981
-.0013998
1.357558
4.566894
1.740373
2.294422
-.0039207
.1343094
-.0008126
1.508673
5.796835
Tells you how good an approximation xi_bar is of the person-level effect; or
how much of the within variation we used to determine the effect size 
zero= OLS 1=FE estimators

OLS simply treats within- and between-group variation as the same
Pools data across waves
.
reg LIKERT female ue_sick partner age age2 badh

Source
SS
df
MS
Model
Residual
103583.505
6
591239.694 24197
17263.9175
24.4344214
Total
694823.199 24203
28.7081436
LIKERT
Coef.
female
ue_sick
partner
age
age2
_cons
1.409466
2.031815
-.0751296
.0983746
-.0010613
1.841796
4.450393
Std. Err.
.0640651
.1240757
.0769271
.0103316
.0001049
.0357165
.2212733
t
22.00
16.38
-0.98
9.52
-10.12
51.57
20.11
Number of obs
F(
6, 24197)
Prob > F
R-squared
Root MSE
P>|t|
0.000
0.000
0.329
0.000
0.000
0.000
0.000
=
=
=
=
=
=
24204
706.54
0.0000
0.1491
0.1489
4.9431
[95% Conf. Interval]
1.283895
1.788619
-.2259116
.078124
-.001267
1.771789
4.016684
1.535038
2.275011
.0756524
.1186252
-.0008557
1.911802
4.884102
Test whether pooling data is valid
y it    x it   u i   it






If the ui do not vary between individuals, they can be treated as part of α and OLS
is fine.
Breusch-Pagan Lagrange multiplier test
H0 Variance of ui = 0
H1 Variance of ui not equal to zero
If H0 is not rejected, you can pool the data and use OLS
Post-estimation test after random effects
.
quietly xtreg LIKERT female ue_sick partner age age2 badh, re
.
xttest0
Breusch and Pagan Lagrangian multiplier test for random effects
LIKERT[pid,t] = Xb + u[pid] + e[pid,t]
Estimated results:
Var
LIKERT
e
u
Test:
28.70814
16.42326
9.149756
sd = sqrt(Var)
5.357998
4.052562
3.024856
Var(u) = 0
chi2(1) = 10816.48
Prob > chi2 =
0.0000
Comparing models




Compare coefficients between models
Reasonably similar – differences in “partner” and “badhealth” coeffs
R-squareds are similar
Within and between estimators maximise within and between r-2 respectively.
FE
RE
fe m ale
u e _sick
p artn e r
1.95 ***
-0.30 **
BE
O LS
1.49 ***
1.48 ***
1.41 ***
2.05 ***
2.04 ***
2.03 ***
-0.19 **
-0.01
-0.08
age
0.11 ***
0.11 ***
0.08 ***
0.10 ***
age 2
0.00 ***
0.00 ***
0.00 ***
0.00 ***
1.23 ***
1.43 ***
2.28 ***
1.84 ***
_co n s
6.25 ***
5.18 ***
3.96 ***
4.45 ***
R-2 w ith in
0.050
0.050
0.048
R-2 b e tw e e n
0.191
0.224
0.232
R-2 o v e rall
0.129
0.147
0.148
0.149