Advantages of panel data • More observations • Time dimension allows taking into account dynamics • More variation in data (over time, over cross sections) • Unobserved variables that may be correlated with the variables in the model can be eliminated Panel data Applied Time Series Econometrics Spring 2014 1 Handling panel data 2 • Unbalanced panel caused by: – Some survey respondents stop participation – New individuals taken to sample (to replace exited ones) – Rotation of sample – Demographic events • Some concepts: • Panel data / longitudinal data / pooled cross section – time series data: – Same observation units followed over time • Balanced panel: • firm data: entry of new firms, exit of firms, mergers • in individual data: deaths, births – Information available on all observation units for all time periods – Errors in data • Unbalanced panel: • sample attrition – For some units, not all time periods available 3 • Pseudo panel: separate cross sections are aggregated to cells (e.g. based on age cohort, sex etc.) that are followed over time – Sometimes the term pseudo panel is also used for aggregate panel data, e.g. industry panels or country panels (in contrast to ”true” panels that cover micro units) 4 • The data set can be ordered in alternative ways: – Stacked by cross section: list all years for the 1st cross section unit, then for the 2nd, etc. – Stacked by year: list all observations for the 1st year, then for the 2nd year, etc. – In either case, variables as columns – Both are ok, but you have time identifier and crosssection identifier as variables (columns) – The data can be unbalanced (no need to have ”empty” rows) • The data can also be in columns by crosssection (“wide form”) and converted to stacked data (“long form”) in Stata 5 6 1 • Data set stacked by cross-section unit or by date – Columns: date, unit, variable yit, variables xit’ (1xK vector) 1 1 y11 x11 ' ... ... ... ... T 1 y1T x1T ' y ' x 1 2 21 21 ... ... ... ... T 2 y2T x2T ' ... ... ... ... 1 N y N 1 xN 1 ' ... ... ... ... T N y NT x NT ' 1 1 ... 1 ... ... T T ... T • A panel need not have time and cross section as the dimensions • You could have for example region and industry as the dimensions 1 y11 x11 ' 2 y21 x21 ' ... ... ... N yN1 xN1 ' ... ... ... ... ... ... 1 y1T x1T ' 2 y 2 T x2 T ' ... ... ... N y NT x NT ' – In this case, the regions have to have code numbers; treat them as if they were “time periods” 7 8 • Stata data set wagepanel.dta in the course homepage • Number of cross-section units: 545 • Number of time periods: 8 • Number of obs.: 8x545 = 4360 • • • • • • • • • • • • • 1. nr 2. year 3. black 4. exper 5. hisp 6. hours 7. married 8.-16. occ1 to occ9 17. educ 18. union 19. lwage 20.-26. d81 to d87 27. expersq • When you have imported a panel data to Stata, you have to tell the program that it is a panel • Statistics Time series Setup and utilities Declare dataset to be time series person nuber (cross section identifier) 1980 to 1987 (time identifier) =1 if black labor market experience in years =1 if Hispanic annual hours worked =1 if married occ1=1 if occupation = 1 etc. years of schooling =1 if in union member log(wage) d81=1 if year = 1981 etc. exper^2 – Then give both time variables and panel ID variable • Command xtset nr year (also tsset nr year works) – Where year and nr are the names of the variables in the wagepanel.dta data set that indicate time and cross-section – Commands that start with ”xt” are panel commands 9 • Time series operators L., F., D. work also with panel data 10 Pooled estimation Pooled estimation • If we use OLS directly for the panel, this is usually called pooled estimation yit xit it • i=1,...,N (cross sections); t=1,...,T (time) • Use normal regression (regress) 11 12 2 • Estimation methods that take into account the panel nature of the data are fixed effects (FE) and random effects (RE) estimation yit i xit it • i = fixed effect or random effect – depending on the application, also called individual effect, firm effect etc.; other terms: unobserved effect, idiosyncratic effect – The terms fixed effects and random effects are misleading, since both allow i to be random! 13 Fixed effects (FE) estimation • A) least squares dummy variables estimation (LSDV) – Include a dummy variable for all cross section units (except leave out one) and estimate model with OLS – The coefficients of the dummies estimate the unobserved effects – Not useful, if there are many cross section units; in the wage data N=545 – In principle possible to include the dummies in the form i.nr, e.g. reg lwage educ i.nr, but 14 there may be problems with matrix size • The idea in the other fixed effects methods is to get rid of the unobserved effect i • B) take differences over time • Incidental parameters problem – When number of observations increases, we can use asymptotic properties of the estimators, e.g. consistency – In time series data T; in cross-section data N; in panel data T and/or N – If in panel data T is fixed and the model has cross section dummies, when N, also the number of parameters to be estimated increases and there is no gain from more observations parameters of ”fixed effects” not consistent yit xit it • i = 0, so the ”fixed” effect disappears • Fist observation for each cross section is lost • OLS used for the differenced model; use regress and the D. operator for the variables • This is called first difference transformation • Sometimes long differences (e.g. 3-year differences) used, but this requires more data • Variables that are constant over time drop out 15 First-difference estimation (note: constant dropped) 16 • C) take differences from cross-section means (demeaning) • Estimate the model _ _ _ yit y i ( xit x i )' ( it i ) _ y i t yit / T , etc. 17 • Again, the unobserved effect disappears, because the mean of i is i (so i-i=0) • OLS used for the demeaned model • This is called within transformation • If T=2, first-difference and within approaches give the same results 18 3 Regression line estimated with pooled OLS _ y y y True regression line, cross-section unit 1 1 True regression line, cross-section unit 2 Regression line estimated with FE 2 _ xx x 19 • Examples: – yit = log(wage); i = unobserved ability; xit includes education, which is correlated with ability (high ability more education) – yit = log(output) ; i = unobserved managerial ability; xit includes logs of inputs (labor and capital), which are correlated with managerial ability (good management firm grows uses more labor and capital) • In both cases OLS estimates inconsistent • Unobservable can be eliminated in FE estimation 20 • Estimation of FE (within) model in Stata • Statistics Longitudinal/Panel data Linear models Linear regression • In the estimation window, specify the model in the normal way and choose model type • Command for example xtreg lwage exper hours, fe – Options fe=fixed effects, re=random effects, be=between, pa=population averaged (rarely used in econometrics) 21 FE (within) estimation 22 • Some notes on the output • Stata output shows a constant even for fixed effects model – This is an ”average” constant • In demeaning overall averages added to the variables • Output gives 3 different R2 values – Within: for demeaned model • R2 in dummy variable OLS would be much higher – Between: for time averaged model – Overall: for pooled data • Between and overall R2 are not quite equal to ”traditional” R2 23 24 4 Dynamic panel data with FE • What has to be assumed in FE estimation? • Errors uncorrelated with the explanatory variables: _ • Strong exogeneity fails (at least) when there are lagged values of the dependent variable: yit i xit yi ,t 1 it _ E[( xit x i )' ( it i )] 0 _ • This implies that xit are uncorrelated with all (past and future) errors, since they are part of the average error – This is called strong exogeneity assumption _ _ _ yit y i ( xit x i )' ( yit 1 y i ) ( it i ) 25 • Statistics Longitudinal/panel data Dynamic panel data (DPD) • xtabond (Arellano-Bond) • xtdpdsys (Blundell-Bond) • xtdpd (both of above) • Fairly complicated, require many choices (lag legths, instruments, endogeneity of variables) • The transformed lagged dependent variable correlated with the transformed error • Estimation by GMM (generalized method of moments): a combination of GLS and instrumental variables, with lagged values of variables as instruments Clustering • Transformed errors for cross section unit i (individual, firm etc.) tend to be correlated with each other (they all include the same average error) – Typical approach: correct standard errors for ”clustering” (cluster = cross section unit) – Command for example xtreg lwage exper, fe vce(cluster nr) or specify the standard error option from the panel data menu (SE/Robust) – nr is the name of the cross-section identifier 27 FE with standard error corrected for clustering 26 28 Several fixed effects • It is also possible to specify fixed time effects (effects that vary over time, but not across cross sections) – “two-way” model – Dummy variables for time periods (in the example data, dummies d81 to d87; leave out one of them!) • Sometimes three-way models 29 – for example if employer of individuals is known, there could be individual effects, firm effects, and time effects – Complicated, if large data sets (e.g. 1 million individuals, 10000 firms) 30 5 RE estimation Random effects (RE) estimation • Include the unobserved effect in the error term: yit xit uit , uit i it , E ( i ) 0 , Var ( i ) i2 – Note: Now x includes a constant • Use generalized least squares (GLS) or maximum likelihood (ML) to estimate the model, taking into account the error structure (all errors for i are correlated with each other, since they include the same random i) 31 Issues in choosing between FE and RE • Traditional view : In FE the ”effects” are time invariant parameters to be estimated, in RE they are random terms • When the data set is a random sample of a large population, RE may be appropriate • If data cover certain individuals / firms / etc., FE is appropriate – For example, data on all the biggest firms in Finland, data on OECD countries, etc. 32 • Contemporary view : • The ”effects” are random in both approaches, the main issue is whether they are correlated with the x ’s (allowed in FE) or not (assumed in RE) • In FE, inferences conditionally on the effects, in RE unconditional inferences • Out ‐ of ‐ sample projections of yit possible with RE, but not with FE, since i not known for out ‐ of ‐ sample observations 33 • In FE, variables that do not change over time (for a cross section unit), cannot be used – Their mean is constant, so difference from mean is zero (the variable is “wiped out” in the within transformation) – E.g. education cannot be included, if nobody’s educational level changes over time!! – Other examples: • Female dummy in wage equation • Many country characteristics in country panels 34 • FE is based on variation within cross section units • Average the data for each cross section and use them in estimation of model _ _ _ yi xi ' i – we get between estimator (BE), which is based on variation between cross section units • It can be shown that RE estimator can be written as a combination of FE and BE estimators – RE is a “compromise” of FE and BW 35 36 6 Between estimation • Breusch-Pagan LR (Lagrange multiplier) test; Tests the hypothesis that Var(i) = 0 – If hypothesis accepted, pooled OLS can be used (instead of RE) – After xtreg –estimation with option re, use command xttest0 – Or Statistics Longitudinal/Panel data Linear models Lagrange multiplier test for random effects – The test output gives a chi-square test statistic and pvalue – High value for the test statistic (and small p-value) would indicate rejection of hypothesis Var(i) = 0 , i.e. rejection of hypothesis of no random effects 37 38 • Testing RE vs. FE • Hausman test Breusch-Pagan test for random effects – Tests whether the coefficients in FE and RE estimations are equal – If xi correlated with i, FE is consistent but RE not, so the estimates should be different – If equality of FE and RE estimates is accepted, both are consistent and we conclude that xi not correlated with i RE can be used – If equality of estimates is rejected, we conclude that xi is correlated with i FE should be used 39 • Hausman test in Stata – Statistics Postestimation Tests Hausman specification test – With commands: After xtreg –estimation with option fe, store the estimates with estimates store fe_e 40 Hausman test • Here fe_e is an arbitrary name given for the stored estimates – Then estimate the same model (i.e. same variables) with xtreg and option re, and store the estimates: estimates store re_e – Finally, give command hausman fe_e re_e – A high value for the test statistic (and small p-value) would indicate rejection of hypothesis that FE and RE estimates are equal; if rejected, FE should be used – The test sometimes does not work (involves inverting a matrix that may not be invertible) 41 42 7
© Copyright 2024 ExpyDoc