Stata is a statistical software package most frequently used for data analysis in academic research. It is especially popular in health services research, epidemiology, and various social sciences.
Two commands that make Stata particularly appealing in these fields are:
margins for post-estimation summaries such as adjusted predictions and marginal effects,
svy for incorporating survey weights into an analysis.
Stata is commercial software that requires a license for use. As a UM student you can use Stata on the SCS servers by typing stata
at the command line. There are a limited number of licenses available, so please do not leave Stata open when not in use. Also be aware that all the licenses could be in use at busy times, such as the night before an assignment for this course is due.
Stata does support markdown from version 15, but the most recent version of Stata on the SCS servers is 14:
which stata
ls /usr/local/bin/ | grep "stata"
Stata 15 is available as a module on the Flux cluster.
Disclaimer: You are not required to use Stata markdown.
Stata primarily works with a single rectangular data set with observations in rows and variables in columns. Variables can be referred to by name and always reference this “master” dataset.
Stata variables come in two primary types - numeric and string. Strings are stored as str#
with #
indicating the maximum length.
Numeric variables come in the following storage types:
Storage Type | Bytes | What |
---|---|---|
byte | 1 | small integers (up to \(2^8- 1=255\)) |
int | 2 | big integers (up to \(2^{16}-1\)) |
long | 4 | very big integers |
float | 4 | up to 38 decimal places |
double | 8 | up to 323 decimal place |
Stata programs generally prefer / require numeric types.
Running compress
instructs Stata to switch variables to smaller storage types where possible.
A common pattern for commands in Stata is,
/* Template */
command <variable(s)>, <option>
/* Example */
command regress A1C BMI, level(99)
where command
is the name of command followed, when needed, by one or more specific variables (columns) to operate on, and then a list of options for modifying default behavior. If you are familiar with functions in languages like R
or python
that follow a syntax f(var1,var2)
you can think of commands as rough equivalents with variables being necessary arguments and options being, well, optional arguments.
The syntax for specific commands may vary from this pattern and it is a good idea to read the help documentation when using a command for the first time:
/* Template */
help <command>
/* Example */
help regress
use
, sysuse
, webuse
- load a Stata native .dta
file into memoryimport delimited
- read delimited data filesclear
- clear the current datasave
help
describe
- overview of the current data setlist
- list a subset of variables; useful with i.e. list <var> in 1/10
summarize
- compute and display summary statisticscodebook
- summarize entire datasettabulate
- compute frequency tablesexit
- quit Stata.do
filesA script is a set of instructions to a computing language for carrying out a particular purpose such as data preparation or analysis. Stata scripts use the extension .do
and will generally be your primary way of interacting with and using Stata.
You can execute a do file by typing, i.e. stata -b my_analysis.do
at the command line or do my_analysis
within an interactive Stata console.
Scripts serve several purposes:
Serve as a record for how a particular analysis was carried out,
Communicate to others (including future you 😕) your thought process during an analysis,
Communicate a set of instructions for what you want Stata to do.
When learning a new computing language it is not uncommon to get hung up on item 3, aka syntax, at the expense of other purposes a script serves. You can combat this tendency by paying attention to style and developing good commenting habits.
Here are some opinions on style in Stata:
The following commands are useful for manipulating data in Stata:
keep
, drop
- keep or drop a subset of variablesgenerate
- create a new variable using functions of existing onesreplace
- replace an existing variable, especially useful with replace <var> if <condition>
label
- change display labels for variables
label variable
label define
label values
label data
encode
, decode
- use to switch between string and integer representation for categorical variablesrecode
- to re-code into different values, i.e \(0 \to 1, 1 \to 0\)
In class we demonstrated the above commands using the Resedidential Energy Consumption Survey (RECS) from 2009. On the CSC servers, you can obtain a local copy of the data using:
wget http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv
You can get short descriptions of these data here:
wget http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv
More detailed descriptions are available as an Excel file here.
The data management demonstration from class can be found at: RECS_prep_subset1.do.
See Professor Shedden’s Stata Intro for additional examples using the RECS data.
Boolean operators are useful for generating values conditionally on other values. Here are the basics:
Operator | Meaning |
---|---|
& | and |
| | or |
== | equal |
!= | not equal |
>, >= | greater than (or equal to) |
<, <= | less than (or equal to) |
We will review the regression example from Professor Shedden’s Stata Intro.
The key Stata commands in the demonstration are:
regress <dv> <ivlist>, options
- for computing regression estimatesdisplay
- for evaluating an expressionr()
and e()
- for extracting results from the most recent commandc.
, i.
- for instructing Stata how to treat variables#
, ##
- for specifying interaction termsThe script from the in-class demonstration can be found at: RECS_Consump_Analysis.do
The script is likely more useful and readable, but a log from the live demonstration is also available here: RECS_Consump_Analysis_14Sep2017.txt
Some important data analysis principals:
In Stata “variable” always refers to a column of the dataset. However, in programming it is also useful to have access to variables in the general sense.
“Macros” serve this role in Stata and are somewhat similar to shell variables in bash and other Linux shells. A macro is a string that is interspersed into a Stata program and evaluated when that program is executed. The key to understanding and using Stata macros is knowing when they are evaluated.
A local macro can be defined by:
local life_questions 42
The value of a macro is retrieved by encapsulation between a back tick ` and an apostrophe ’:
display `life_questions'
The live demonstration of macros from class is available here macros_14Sep2017.txt
Macros are frequently used in loops:
foreach var in varlist yearmade-kwh {
summarize `var'
}
Examples of Stata loops can be found in the data management script linked above and the data merge example here: RECS_merge.do.
Many Stata commands are defined by .ado files rather than built into the Stata source code. Some of these commands are user-contributed and you can extend the functionality of Stata by obtaining programs written as .ado
files or writing your own.
Those interested can learn more about programming Stata here.
Disclaimer: You will not need to know anything about .ado
files for problem sets or exams in this course.
Links to the in-class demonstrations:
Data management: RECS_prep_subset1.do
Macros: macros_14Sep2017.txt
Regression: RECS_Consump_Analysis.do RECS_Consump_Analysis_14Sep2017.txt
Merging data: RECS_merge.do.
Post-estimation summaries using margins: RECS_ES.do
You may find the following resources useful in learning to use Stata:
These notes are based in large part on: