help merge
help fvvarlist
help regress
help(formula)
Stata is statistical analysis software used commonly in social sciences. It is known for it’s ease of use, robust support for complex survey design, and comprehensive and clear documentation.
Stata (pronounced either of stay-ta or stat-ta, the official FAQ supports both) is primarily interacted with via typed commands written in the Stata syntax. There is a GUI that you can access via midesktop, or you can SSH into one of the university Linux servers and run Stata at the command line (via the stata
command).
Once you have Stata up and running, the simplest form of use is as a calculator.
. display 2 + 4
The .
is the command prompt in the Stata console, similar to >
in R or $
for the BASH shell.
The command name is display
and it outputs the result of the input expression. This is a trivial example, but display
is much more flexible.
Most Stata commands follow the same basic format:
. command <variable(s)>, <options>
The command can have sub-commands. The number of variables which need to be or can be passed obviously varies by command. The order of variables may matter. For example, any regression model treats the first listed variable as the response.
The options are space-separated words (e.g. robust
) or words with options (e.g. level(.90)
). Almost all commands support some number of options, but most commands do not require any options.
For example, a regression would be as simple as
. regress y x1 x2, noconstant
This is defining a linear regression model predicting y
based upon the continuous variables x1
and x2
. The noconstant
operation removes the implied intercept from the design matrix for this model.
Stata supports abbreviations for most commands. For example, the regression command could have been typed as:
. reg y x1 x2, nocons
I do not recommend using abbreviations in most cases as it creates ambiguity for readers of your code. You should be aware that it works as you may see others use them.
Most commands support an if
or in
option to operate on a subset of the data. For example,
. regress y x1 x2 if group == 1
performs linear regression only on the set of cases for which the variable group equals 1. The in
option operates on specific rows, and may be useful for testing slow running code:
. regress y x1 x2 in 1/100
The 1/100
syntax is equivalent to writing 1 2 3 4 .... 99 100
.
Stata code you submit for problem sets or project work should use the full names of commands.
Stata has well-written, comprehensive documentation. Use the help _____
syntax to look up documentation on any command. Help files contain the syntax of the command, a list of options allowed, a description of the command, and a number of examples. Additionally, the very top of each help file contains a link to the appropriate location in the full Stata manual which typically has further examples. Use these help files liberally.
Note that running help
on an abbreviation (e.g. help reg
) pulls up the help correctly. This is useful when reading others code. Additionally, in the syntax and options listing, you’ll see partial underlining. These indicate the required abbreviations.
The file extension for the binary format native to Stata is .dta
.
You can access files with:
use
to load a local file, or when passed a URL, an online file.sysuse
or webuse
to access Stata example data sets by name (e.g. in Help examples)import excel
and import delimited
for importing Excel or CSV files.For importation, if working in the Stata GUI you may consider using the File -> Import
menu instead of trying to type the command. The Import dialog box has a live preview as you change settings which can be useful. After you run the import, the command it generated is echoed and you can copy it to your .do
file to be run again.
You can save the reference data set with the save
command,
. save newfilename
Passing the replace
option allows Stata to overwrite an existing file. Stata will throw an error if you try to overwrite an existing data sets without explicitly specifying replace
.
To make your analyses reproducible, I recommend always specifying replace for data sets created by your scripts, and never saving to data files obtained from elsewhere.
If you try and open a new data set when there are un-saved changes in the existing data, Stata will refuse with an error. The clear
command will remove the existing data and allow you to load a new data set. You can also pass clear
as an option to the commands use
/sysuse
/webuse
/import
.
A quirk of Stata is that until the most recent version of Stata (16), only a single data set could be open at a time. Because of this, in earlier versions of Stata if you wanted to operate on multiple data sets, you either need to switching between them, or merge them into one file using the append
or merge
commands.
The benefit of having a single data set open is that you never need to refer to it - any command you give must operate on the open data set. For instance, in our previous regression examples the variables y
, x1
, x2
, and group
all referred to columns of the reference data in memory.
The most recent version of Stata introduced the concept of frames, – rectangular data sets similar to data.frames in R – that can be used to store multiple data sets in memory.
The preserve
and restore
commands can be useful for switching between files or making destructive modifications to an existing data set. The preserve
command takes a snapshot of the data as it currently is, and restore
switches back to it. To see an example of it, consider the collapse
command. The collapse
command generates a summary data set by collapsing the existing data by a given variable and creating summary statistics. For example, if we have some sort of census data,
. collapse (mean) age (percent) female, by(state)
This would replace the existing data set with a new one with one row per state, containing a variable indicating which state, a variable indicating the average age in that state, and a variable indicating the percent of the state which identifies as female.
The command collapse
is destructive; after running it the original data (and any unsaved changes) is lost. We can wrap this in preserve
and restore
to save the descriptive data and recover the original data.
. use data1
. preserve
. collapse (median) x1 x2 (min) x3 (sum) x4, by(group)
. save summary_data
. restore
After running the above five lines, “data1” will still be open, but we’ll have a new file summary_data.dta
.
Alternatively, since version 16, frames can be used in place of the above.
. use data1
. frame copy default group_summary
. frame group_summary: collapse (median) x1 x2 (min) x3 (sum) x4, by(group)
The following commands may be useful for exploring the data in memory:
list
prints the data to the output window. list <variable name>
prints only the requested variables. Combine with if
and in
to make the output more compact.describe
displays some information about the variable, including its type (e.g. string or numeric) and whether it has labels attached to it.summarize
, codebook
, tabulate
, mean
display descriptive statistics.browse
(or edit
) to browse (or edit) the data in an Excel-style window. Combine with if
and in
. (Only works in GUI.)Never use browse
or edit
within a .do
file submitted for a course assignment. Use of edit
is not reproducible and code should always be written so that it can run in batch mode (non-interactively).
Stata scripts are called “Do-files”, named for their extensions (.do
). In the GUI, you can open a new Do-file for editing by typing doedit
. You can run Do-files interactively by highlighting the desired code and hitting the “Execute (Do)” (Windows) or “Do” (Mac) buttons or using an associated shortcut key (cmd/ctrl + shift + d
. .
Do-files can also be run from start to finish by using the do
command. Alternatively, if you are accessing Stata at the command line, you can simply run stata mydofile.do
to launch Stata, run the Do-file, and exit. Or, better, stata -b mydofile.do
to execute the file in batch mode.
Stata uses some programming terminology in unique ways. Specifically:
regress
, use
, display
.(weight / height^2) * 703
log(salary)
, log()
is the function).In addition to the one data principal, Stata also operates on one estimation command at a time. An estimation command is any command which computes estimates with associated inference, for example mean
to obtain the mean and a confidence interval, or regress
as we discussed above. The commands save
or preserve
are examples of commands which are not estimation commands. After running an estimation command, Stata provides access to post-estimation commands and returned objects.
Estimation commands support commands that follow them and use their results. For example, after a regression command we can obtain the AIC and BIC by running
. estat ic
Note that we do not refer to a specific model. Instead, reference is to the values currently in the ereturn
and return
spaces as created by the most recent estimation command.
You can see a list of all post-estimation commands supported by running help ____ postestimation
. For example, help regress postestimation
.
The code run by issuing an estimation command, typically creates several objects which can be examined. Estimation commands are either of R-class or E-class (the distinction is not very important). What is important is that E-class commands store things in the ereturn
while R-class commands store in the return
.
For example, after running regress
, the object r(r2)
contains the \(R^2\) for the model.
. regress y x
. display e(r2)
Matrices can also be returned, but cannot be accessed with display
, instead we use matrix list
:
. matrix list e(V)
which will return the variance-covariance matrix of the estimators.
You can see a full list of the returned objects via return list
or ereturn list
. The help file for each command will also describe what the command returns.
We will see how these objects can be manipulated after we discuss macros.
Macros in Stata use a simple text substitution evaluation system.
We define a macro with the local
command.
. local myvars x1 x2 x3
. regress y `myvars'
The local
command stores whatever is past the name of the macro. When it is referenced via the back-tick and single quote syntax, the stored value is substituted into the command before the command is executed. For example, when you run the regress
line above, Stata will replace “myvars” with “x1 x2 x3”
and then execute the regression model.
Macros can also store numeric values and can be operated on.
. local x 3
. display `x'
. local y = `x' + 2
. display `y'
You’ll notice the first local
call contains no equal sign, whereas the second does. The equals sign forces immediate evaluation. You can see the difference by running:
. display `y'
. display "`y'"
. local y2 `x' + 2
. display `y2'
. display "`y2'"
You can store returned objects and operate on or display them.
. regress y x
. local r2 e(r2)
. display "The model R^2 is " `r2'
Matrices can be stored as well
. regress y x
. matrix v = e(V)
. display "The variance/covariance matrix:"
. matrix list v
Note that when referring to matrices, no tick/quote is needed. However, matrices can be referred to in more limited contexts than can macros.
Loops are another place where macros are used often. The syntax is very similar to other languages.
foreach <macro name> of <list of numbers/words/variables> {
..
}
The list can be:
numlist 1/5
or numlist 1 4 29 192
a b c
varlist x y z
For example, to regress over a series of outcome variables:
foreach var of varlist y1 y2 y3 {
regress `var' x1 x2
estimates store reg_`var'
}
The estimates store
command saves the regression results so that you can restore them later using estimates restore
to make them the most recent estimation command, or use a user-written command such as outreg
to produce an output document.
Alternatively, if the variables names are that clean, you could use the numerical suffix as the iterator:
foreach i of numlist 1/3 {
regress y`i' x1 x2
estimates store reg_y`i'
}
For more on loops, see help foreach
.
drop
/keep
- Remove columns (drop <varname>
) or rows (drop if <conditional>
).generate
- Create new variables using generate <newvarname> = <expression>
. The expression can depend on other variables.replace
- Modify an existing variable. Commonly used with if
.destring
/tostring
- Convert strings that are really numbers to strings, or the reverse.encode
/decode
- Convert strings with words to numbers with associated labels (e.g. factors) and the reverse.Mata is a matrix programming language which is part of Stata. It has two primary uses:
While objects from Stata can be passed in and out of Mata, most of the time we can think of Mata as completely independent of the Stata language.
Mata is entered by using the mata
command.
. mata
When in Mata mode, the prompt changes from a .
to :
. Stata commands will not be accepted inside Mata. To exit Mata, enter the end
command.
: end
You can enter mathematical expressions directly into Mata:
: 5-4
When inside Mata, you can define and use “variables” in the same sense as R.
: x = 4
: x + 2
It also supports built-in mathematical functions, e.g.
sqrt(4)
log(x)
Mata sessions have permanence. If you end
a Mata session and then invoke a new session, it retains the same variables.
You can also run a single line of Mata with the mata:
preface. After running this command, you will be in Stata, not Mata.
. mata: 2 + 2
To define a Mata matrix, we can combine the column-join operator ,
and the row-join operator \
. We can print a matrix by calling it alone.
: M = (1,2\3,4)
: M
To help keep track of object dimensions I recommend using lowercase letters for scalars and upper case letters for matrices.
Matrix operations work as expected. We can:
mata: 4*M
, mata: M :+ 2
mata: A + B
, mata: A*B
(an error if dimensions are not compatible)mata: A'
mata: A\B
would stack A on top of B (assuming dimension compatibility)mata: A:*B
mata: C = I(5)
mata: D = J(4, 2, 0)
We can pass matrices between Stata and Mata using the st_matrix()
function. This functions works inside of Mata.
Say we run a regression in Stata:
. sysuse auto
. regress mpg headroom
We saw before that e(V)
contains the variance/covariance matrix, but now let us obtain the standard errors for the coefficients.
. matrix v = e(V)
. matrix list v
To manipulate these, we can pass them into Mata.
. mata:
: V = st_matrix("v")
: SE = diagonal(sqrt(V))
: st_matrix("se", SE)
: end
. matrix list se
The diagonal
function extracts only the diagonal of the matrix. Confusingly, the related diag
command diagonalizes the matrix, setting off-diagonal elements to 0.
margins
One of Stata’s most popular tools is the margins
command. It is a very complex command (the help manual for this one command is ~55 pages long), but extremely powerful. We’ll explore two common uses.
First, when running a regression with a categorical variable, it is common to use reference (0/1) encoding with one level of the categorical variable being the reference or intercept, and other groups represented by indicator variables. If you wanted to estimate the difference between levels and neither is the reference, you need to either change the reference category or develop a contrast. The margins
command facilitates this second approach.
. regress y x i.z
. margins z
. margins z, pwcompare
First, note the i.z
variable in the regress
command. All Stata models assume by default that variables are continuous. To treat a variable as categorical, preface it with i.
: e.g. a variable group
would be i.group
.
The first margins command will estimate the marginal mean for each level in z
. This is done by taking the original data, replacing z
by its first level for every row or case, using the regression equation to estimate the response, and finally computing the average response. This process is repeated with z
replaced by each of its subsequent levels.
The second margins command adds the pwcompare
option which will generate all contrasts or comparisons between levels of z
taken pairwise. By default it produces a confidence interval for each contrast; you can obtain a p-value by passing pwcompare(pv)
instead.
Another use of margins
is for the creation of interaction plots. Interactions are entered into regressions via x##z
. This syntax includes the main effects of both x
and z
as well as their interaction. When a variable is involved in an interaction, Stata assumes it is categorical; you can use the c.
prefix to force Stata to treat it as continuous.
. regress y c.x##i.z
. margins z, at(x = (1 2 3 4 5))
. marginsplot
This margins
call is obtaining the marginal mean of z at each of those 5 values of x
. The marginsplot
command is a post-post-estimation command which can be run after margins
to produce a plot.
We can test whether slopes are the same across several subgroups with margins
as well:
. margins z, dydx(x)
. margins z, dydx(x) pwcompare(pv)
The dydx()
option estimates the slope on a continuous variable by taking the slope of the regression equation relative to the continuous variable. By passing the categorical variable z
we’re asking for the slope in each group, and then testing them against each other using the pwcompare
option.
Professor Shedden’s Stata Intro
Stata workshop notes from CSCAR: Intro to Stata
Stata material from IDRE at UCLA: Stata