Variables

Stata
A short introduction
Applied Time Series Econometrics
31E01300
Spring 2014
1
General
• Stata is a general statistical package,
which is widely used by econometricians,
biometricians, and social statisticians
• To use Stata efficiently, it is necessary to
learn commands
• However, you can also point-and-click and
Stata shows the commands implied by the
”clicks”  easy to learn commands
2
Starting
• Start the program by clicking the shortcut symbol
• Or: if you already have a data file, click the its
name
• When you start the program, you will see
–
–
–
–
–
–
–
–
Title
Menu row
Shortcut symbols
Command window
Review window
Results window
Variables window
Address line
3
4
•
•
•
•
•
•
•
•
•
•
•
•
Shortcut symbols:
Open file
Save file
Print
Start/close log
Viewer
Edit graph
Open do file
Edit data
View data
Variables manager
Stop execution
5
• Log is a record of everything that appears in the
results window, including commands and results
• Viewer: for example help file is shown in a viewer
• Data editor: input or change data
– Not necessary to use, most likely better not to
use
• Data browser: you can look at data, but not
change it (read-only)
• Do file: a file with commands
6
•
•
•
•
•
•
•
•
Menus:
File: open a data file, import data, print etc.
Edit: edit text
Data: go to data editor or browser, combine data
sets, get descriptions of data, matrix operations
Graphics: graphics tools
Statistics: statistics and estimation commands
Window: determine which windows are shown
Help: search help files (on the computer) or addon programs available in the net; manual in pdf
form
7
Reading in data
• To try something quickly with a small number of
observations, you can type in the data or copy &
paste data in the Data Editor
• To import data from a file, click File  Import 
Excel spreadsheet
– You come to a window where you can choose
Excel file and worksheet names.
• Similarly, if the data is as a text file, choose the
format in the Import menu
• There are also some specialized file transfer
programs (e.g. StatTransfer) to convert data
directly e.g. from Excel to Stata
8
• It is a useful to organize the data in Excel so tha
variable names are in the first row and
observation number / date in time series as the
first column
• When reading in a very large data set, Stata
(new versions) automatically adjust the memory
size; with older versions of Stata you had to
secure that there is enough memory for the data
– To increase memory, write in the command
window for example: set memory 50m
9
• If a Stata data file already exists, you can
start Stata by clicking Stata data file name
(file with .dta ending)
• Or start by opening Stata and then click
the Open file shortcut symbol, or click File
 Open, and choose existing data file
10
Saving your data
• Click the save button and the dataset (with possible
changes) replaces the old one with the same name
• To give a new name, click the save shortcut symbol or
FileSave As and give a name to the data set
– Stata data file names always have an ending .dta
• Note: it is always important to have a safe copy of your
original, unchanged data
• When you generate new variables etc., save the new
data file with another name than the original file
• If the data set is small, a good policy is to do the variable
creation always ”from scratch” with a do-file (see below)
11
After reading in data
• In the Variables window:
–
–
–
–
Variable names
Possibly some explanations (Labels)
Variable type and format
If the type is str (string), the variable has been read
as text and not numbers
• It could be genuinely text (like automobile names) or just
numbers coded as text, in which case they can be converted
to numbers
• In the Results window: command that has been
executed (and output when applicable)
• In the Review window: command that has been
executed
12
13
• There are two other ways of giving commands in
Stata (besides point-and-click)
• 1) Write command in the Command window
• Then press Enter and Stata executes the
command
• The command appears in the Review window
and the result in the Results window
14
• Examples of commands to read in data
import excel “C:\Documents\try.xlsx",
sheet ("Sheet1")
– Reads an Excel file; ‘Documents’ is whatever
place (folder) where you saved your data file
use “C:\Documents\datasetname.dta", clear
– Reads a Stata file that already exists
• An example of a command to save data
save ”C:\Documents\newfilename.dta”
15
• 2) Write the commands in a Do-file and run it
• Click Do-file shortcut symbol and an empty dofile appears.
• Write the commands there
• You can also copy commands from the Review
window and paste them in the do-file
• Then highlight the command(s) and click the
symbol in the extreme right (Do-button)
• Stata executes the command(s) (for example,
data is read).
• Now you can save the do-file and repeat the
same thing in the future
16
Dobutton
17
• What’s the point in using a do-file with
commands?
– You have a record of what you have done:
how you handled the data, generated
variables, exact estimation commands used
etc.
–  you (and others!) can always replicate
what you have done
–  easy to add new variables, change
estimation methods etc.
–  easier to detect errors
–  you can apply user-written extensions to
Stata commands
18
• Often easiest to start by doing initial
estimations using point-and-click mode; then
copy estimation commands to a do file
• When you have learnt the basics, faster to
write the commands directly
• To learn about commands, see the manual
and the help file
19
• When you read data or save datasets to
your computer or memory stick, it is often
useful to define the path so you don’t have
to repeat it
• Example: you keep the files in the folder
project in the documents folder and have
the Stata dataset mydata there
local mypath ““C:/documents/project””
cd `mypath’
use mydata.dta
20
Example file
• We will use several example files which will
be available in the home page
– The dataset iceream.dta contains time series
data from a paper by Hildreth and Lu (1960)
– four-week observations from 18 March 1951 to
11 July 1953
– consumption: consumption of ice cream per
head (pints)
– income: avg. family income per week (US $)
– price: price of ice cream (per pint)
– temp: average temperature (in Fahrenheit)
21
– time: index from 1 to 30
Looking at the data
• Viewing the data set
– Click the Data Browser button
– You see the variable names and values
– You can sort the data according to some
variable by clicking some cell in the column
according to which the data is to be sorted, and
then click Data  Sort
– You can order the columns by clicking Data 
Data utilities  Change order of variables
• Or drag variables in the Variables window of the Data
browser to the order you want
22
– Note:
– Decimals are noted with periods, not commas
– Missing values denoted by dots ( . )
• No missing values in the icecream data set
23
Printing data and summary statistics
• Click Data  Describe data
• There are several options, which all lead
to a window where you can choose the
variables you are interested in
– List: lists variable values in the results window
– Describe: gives information on e.g. the range,
and distribution of the variables and their
format
– Summary statistics: mean, variance, min, max
24
• Descriptive statistics can be obtained also
by clicking Statistics  Summaries,
tables, and tests  Summary and
descriptive statistics  Summary
statistics
– There you can also e.g. get correlations of
variables
25
• The same with commands:
– you can write the command
summarize consumtion price income temp
– Or
list consumption price income temp
– You can either write these in the command
window or in a do-file
• To tabulate distribution for variables with
discrete values, use tabulate or tab, for
example:
tab temp
26
• If several variable names have the same
beginning, e.g. if we have variables x1 and x2,
you can write summarize x* to get a summary
of both
• Most commands can be shortened, for example
instead of summarize, you could write sum
• if specifies conditions, e.g. if you want to
summarize variable cons when variable temp
exceeds 60, choose in the summary statistics
window the if/by-sheet and write the condition
temp>60 there
– Corresponding command is summarize
consumption if temp>60
– Several if-conditions can be combined: if temp> 60
& income<80
27
• by makes it possibly to do an operation for
separate subgroups
– For example, we could summarize icecreame
consumtion by temperature
– Choose in the summary statistics window the
if/by sheet and there click Repeat commands
by groups and write temp as the Variables that
define groups
– With command: by temp, sort: sum
consumption or bysort temp: sum
consumption
– Here most ”groups” have only one observation,
so the operation did not make much sense 28
Create variables
• Click Data  Create or change data  Create
new variable
• Example: convert the temperature, which is in
Fahrenheit (°F) to Centigrades (°C)
• Give a name to the new variable, e.g. ctemp
• Then write expression for of the new variable:
(temp-32)*5/9
– By clicking Create, you get mathematical etc.
Functions and from Data  Create or change data
 Create new variable (extended) some additional
expressions
– You can alternatively write the command generate
ctemp=(temp-32)*5/9 in Command window or do-file
29
• To change the content of a variable, click
Data  Create or change variables 
Change contents of a variable
– With commands: for example the income
variable is weekly income. If you wanted to
convert that to annual income, without
changing the name of the variable, you could
write
replace income = 52*income
• To rename a variable, click Data 
Variable utilities  Rename a variable
– Alternatively, write a command, e.g.
rename temp temp_fahrenheit
30
• Some additional points:
– Stata allows long variable names, so try to
avoid short, cryptic names (difficult to
remember afterwards what they are)
– Upper case and lower case letters are treated
as different, so for example temp and Temp
would be different variable names
– If you try to create a new variable and give it a
name that an existing variable has, you’ll get
an error message
31
• Often useful to add explanations on the
variables, i.e. labels
– Click the shortcut symbol for data manager, or
click Data  Variables manager, then
choose a variable and write its label
(description)
– The label appears in the Variables window
– You can also choose value labels: Click the
value labels manager and give the values and
their labels and give a name for the labels
32
– Example: give the temperature variable temp
the label ”Fahrenheit” and the variable ”ctemp”
the label ”Celcius”
– Do it in the menu, or use commands
label variable temp ”Fahrenheit”
label variable ctemp ”Celsius”
– The variable labels no appear e.g. in graphs
and tables instead of the variable names
33
– Example: generate dummy variable for
temperatures above 15 °C . Then give the
label values high and low for the values 1 and
0, and the name templabel for the value
labels:
gen hightemp = ctemp>15
label variable hightemp “dummy for high temp”
label define templabel 1 “high” 0 “low”
label values hightemp templabel
• We need the last command to attach the value labels to the
variable male
– If we now give the command tab hightemp, we see
the value labels instead of 1’s and 0’s
34
• In the previous example, we used a logical
expression ctemp>15
• This returns 1 if the expression is true and
0 if not
• Similarly, one can use
>, <, >=, <=, != (not equal), == (equal)
35
Some examples on plotting data
• Click Graphics  Histogram
– Then choose variable
– Note: several other choices available,
concerning if/by, graph titles etc.
– Alternatively write command, e.g. histogram
ctemp
• To get a smoothed continuous distribution
graph, click Graphics  Smoothing and
dencities  Kernel density
– Alternatively, use command kdensity ctemp
36
• Plot icecream demand (cons) against
temperature (ctemp)
• Click Graphics  Twoway graph .
– Then click Create, and you get a window where
you can choose graph type (take Scatter) and
choose the variables
– Click Submit and you’ll get the graph
– Note that this is named Plot 1 in the Twoway
graph window. You can now specify several
plots and include them in the same graph
– Alternatively, you’ll get the same graph with a
command scatter consumption ctemp
– Line graph: line consumption time
37
.6
.5
.2
.3
cons
.4
-5
0
5
10
Celsius
15
20
38
0
10
time
20
30
39
.2
.3
cons
.4
.5
.6
• Saving a graph
– After getting the graph, you can save it
• File  Save as and give a name, or use command graph
save graphname where graphname is the name you want to
give
• Saved as a file with ending .gph
– You can also copy the graph and paste it in a Word
document
– Graphs in Stata mostly not as flexible as in Excel
– In Stata the graphs can be edited, for example text
can be added, color changed etc. using the Graph
editor
• If you have saved two graphs and given them
e.g. names graph1 and graph2, you can
combine them to one graph: graph combine
graph1.gph grpah2.gph
40
Using estimation commands
• Example: simple regression model
• Click Statistics  Linear models and
related  Linear regression
• Then choose dependent variable and
independent variable
– Constant term is added automatically
– Alternatively, write command, e.g.:
regress consumption ctemp income
41
Log file
• It is useful to have a record of what you have done
during a session; This is shown in a log file
• To start a log file, click the log shortcut symbol, then
choose file name and location
– Choose log file type
– log: text file that can be directly read by Word (e.g. copy tables to
text)
– smcl: formatted log file, has to be ”translated” to a text file before
use
• The results window shows that the log has been started
• In the review window you see a corresponding
command, which looks something like log using
”C:/Folder name/logfilename.log”
• You can see the log file by clicking File  Log  View
• To stop logging, click log shortcut, and Close log file, or
write command log close
42
• All this you can do in a do-file with commands (i.e. write
these in a do-file, save it, highlight commands and click the
Do-button)
• Example:
log using ”C:\Folder name\logfilename.log”, replace
use ”C:\Folder name\icecream.dta”, clear
generate ctemp=(temp-32)*5/9
sum consumption ctemp income price
reg consumption ctemp income price
save ”C:\Folder name\icecream_modified.dta”, replace
log close
43
Some command & do-file hints:
• In the above do-file, the log command has
”, replace ” in the end
– This replaces existing do-file with another one
– ”, append ” adds new log file in the end of an old one
– Of cource you can always give a new name to the log
file
• use command has ”, clear ” in the end
– This clears possible old data from memory before
reading new one
• save command has ”, replace ” in the end
– Replaces old data file with a new one
– Be careful with this: don’t destroy possibly the only
copy of your original data
44
• It is useful to write some explanations and
comments in the do-file
– Easier to remember afterwards what you did and why
– Comments can be added by starting a line with *
– Or us /* and */ and include comments between them
• Lines can be broken by /// in the end of the line.
The command continues from the next line
• Examples:
*Regress icecream consumption on income
/* Regress icecream consumtion
on temperature and income */
reg consumption ///
ctemp income
45
• Some user prefer to end all commands with
some symbol, e.g. semicolon ;
• In this case, you have to tell Stata that the
’delimiter’ is ;
#delimit ;
reg consumption
temp income;
– Note that you can in this case continue the command
in several lines, only the last line of a command ends
with ;
• To change command to end with break (carriage
return), use #delimit cr (this is the default)
46
Time series operations
• When the data set is time series variables,
we can use special time series operations,
but first we have to tell Stata that we have
time series
– The data has to have a variable that is a time
identifier, e.g. time in the icecreame.dta data
– Statistics  Time series  Setup and
utilities  Declare dataset to be time
series data
– With command (assume time is the name of
the identifier):
tsset time
47
• In models with a time series dimension, we often
need lags
• The lag operator is L. ; for two lags L2., etc.
– L.x is xt-1, L2.x is xt-2 etc.
– Note the dot ”.” after L
• You can either generate lagged variables:
gen lagincome= L.income and use the
generated variables in estimation command, for
example reg consumption lagincome
• Or: specify the lags in the estimation command:
reg consumption L.income
48
• To use differences over time, the operator is D. (or
D2. for twice differencing etc.)
– D.x is xt – xt-1, D2.x is (xt – xt-1) - (xt-1 – xt-2)
• You can generate
gen dcons = D.consumption
or
gen dcons = consumption – L.consumption
• Or use the command directly in estimation:
reg D.consumption D.income
• You can also combine these, e.g.
gen dlagcons = D.L.consumption
• Note: when using lagged or differenced variables,
you loose observations from the beginning of the
data period
49
• For leads, use F. , F2. etc.
– F.x is xt+1, F2.x is xt+2 etc.
• For seasonal differences, use S. , S2. etc.
– S.x is xt-xt-1 (the same as D.x), S2.x is xt – xt-2
etc.
• E.g. to get 4-quarter differences of
consumption, use gen dcons =
S4.consumption (or gen dcons =
consumption - L4.consumption)
• These can be combined with lags, e.g.
L.S4.consumption
50
• Using multiple lags:
reg consumption L(1,2,3).income
reg consumption L(1/3).income
(lags from 1 to 3)
reg consumption L(1(2)5).income
lags from 1 to 5 (with an increment of 2)
• Alternatively, you can generate the lags; in this
case, it is useful to name them with names
ending in lag numbers: gen
c1=L.consumption, gen c2=L2.consumption,
and similarly gen i1=L.income, i2=L2.income
• These can easily be handled in the
commands:
reg consumption i1-i3 c1-c3
51
Dates
• Suggestions on coding dates e.g. in an
Excel file that you read into Stata
– Annual data: years are already numbers, so
no problem
– Quarterly data: 1980q1, etc., or q1-1980 etc.
– Monthly data: 1996Jan, 1996Feb, etc., or
Jan1996 etc., or 1996m01, 1996m02, etc.
– Daily data: 1987Dec13 etc., or 13dec1987, or
13121987
– also other formats possible
52
• This kind of time codes would be strings
(text) in Stata data file.
• For example, file nyse.dta has information
on monthly volume of NY stock exchange,
with string variable obs showing the
months in the form 1900m01 etc.
• To generate a number for the months,
generate a new variable,
gen month =monthly(obs,"YM")
– YM tells that we have used coding like
1980m1 where year comes first and then
month; Note the use of quotation marks and
capital letters (”YM”)
53
• Stata produces a new variable month
which is a series of numbers that always
start running from the beginning of year
1960, in this case from Jan 1960
• With monthly data, there are 60*12=720
monthly observations in 1900m1 –
1959m12, so the numbering in the data
set starts from -720
• A similar methods applies to monthly,
daily, etc. data, e.g. gen day =
daily(date,"YMD") if the original string
variable is date
54
• Then in the tsset command, one can specify
the month variable as the time dimension:
tsset month
• Better still, use the command
tsset month, monthly
– Now variable month is attached with
”labels” 1900m1 etc. (i.e. no longer
strings), although it still has the numerical
values -720,…, this is convenient e.g. in
graphs
– If the option , monthly is not used, the
periods are numbered from 109 even in
figures where we might prefer to have
actual dates
55