Course Homepage

About Stata

Stata is a statistical software package most frequently used for data analysis in academic research. It is especially popular in health services research, epidemiology, and various social sciences.

Two commands that make Stata particularly appealing in these fields are:

Licensing

Stata is commercial software that requires a license for use. As a UM student you can use Stata on the SCS servers by typing stata at the command line. There are a limited number of licenses available, so please do not leave Stata open when not in use. Also be aware that all the licenses could be in use at busy times, such as the night before an assignment for this course is due.

Markdown

Stata does support markdown from version 15, but the most recent version of Stata on the SCS servers is 14:

which stata
ls /usr/local/bin/ | grep "stata"   

Stata 15 is available as a module on the Flux cluster.

Disclaimer: You are not required to use Stata markdown.

One Data, Two Types

Stata primarily works with a single rectangular data set with observations in rows and variables in columns. Variables can be referred to by name and always reference this “master” dataset.

Stata variables come in two primary types - numeric and string. Strings are stored as str# with # indicating the maximum length.

Numeric variables come in the following storage types:

Storage Type Bytes What
byte 1 small integers (up to \(2^8- 1=255\))
int 2 big integers (up to \(2^{16}-1\))
long 4 very big integers
float 4 up to 38 decimal places
double 8 up to 323 decimal place

Stata programs generally prefer / require numeric types.

Running compress instructs Stata to switch variables to smaller storage types where possible.

Using Stata

Command Syntax

A common pattern for commands in Stata is,

/* Template */
command <variable(s)>, <option>

/* Example */
command regress A1C BMI, level(99)

where command is the name of command followed, when needed, by one or more specific variables (columns) to operate on, and then a list of options for modifying default behavior. If you are familiar with functions in languages like R or python that follow a syntax f(var1,var2) you can think of commands as rough equivalents with variables being necessary arguments and options being, well, optional arguments.

The syntax for specific commands may vary from this pattern and it is a good idea to read the help documentation when using a command for the first time:

/* Template */
help <command> 

/* Example */
help regress

Basic commands

  • use, sysuse, webuse - load a Stata native .dta file into memory
  • import delimited - read delimited data files
  • clear - clear the current data
  • save
  • help
  • describe - overview of the current data set
  • list - list a subset of variables; useful with i.e. list <var> in 1/10
  • summarize - compute and display summary statistics
  • codebook - summarize entire dataset
  • tabulate - compute frequency tables
  • exit - quit Stata

Stata scripts or .do files

A script is a set of instructions to a computing language for carrying out a particular purpose such as data preparation or analysis. Stata scripts use the extension .do and will generally be your primary way of interacting with and using Stata.

You can execute a do file by typing, i.e. stata -b my_analysis.do at the command line or do my_analysis within an interactive Stata console.

Scripts serve several purposes:

  1. Serve as a record for how a particular analysis was carried out,

  2. Communicate to others (including future you 😕) your thought process during an analysis,

  3. Communicate a set of instructions for what you want Stata to do.

When learning a new computing language it is not uncommon to get hung up on item 3, aka syntax, at the expense of other purposes a script serves. You can combat this tendency by paying attention to style and developing good commenting habits.

Here are some opinions on style in Stata:

Data Management

The following commands are useful for manipulating data in Stata:

  • keep, drop - keep or drop a subset of variables
  • generate - create a new variable using functions of existing ones
  • replace - replace an existing variable, especially useful with replace <var> if <condition>
  • label - change display labels for variables

  • label variable
  • label define
  • label values
  • label data

  • encode, decode - use to switch between string and integer representation for categorical variables
  • recode - to re-code into different values, i.e \(0 \to 1, 1 \to 0\)

Data Management Demonstration

In class we demonstrated the above commands using the Resedidential Energy Consumption Survey (RECS) from 2009. On the CSC servers, you can obtain a local copy of the data using:

wget http://www.eia.gov/consumption/residential/data/2009/csv/recs2009_public.csv

You can get short descriptions of these data here:

wget http://www.eia.gov/consumption/residential/data/2009/csv/public_layout.csv

More detailed descriptions are available as an Excel file here.

The data management demonstration from class can be found at: RECS_prep_subset1.do.

See Professor Shedden’s Stata Intro for additional examples using the RECS data.

Boolean Operators

Boolean operators are useful for generating values conditionally on other values. Here are the basics:

Operator Meaning
& and
| or
== equal
!= not equal
>, >= greater than (or equal to)
<, <= less than (or equal to)

Regression

We will review the regression example from Professor Shedden’s Stata Intro.

The key Stata commands in the demonstration are:

  • regress <dv> <ivlist>, options - for computing regression estimates
  • display - for evaluating an expression
  • r() and e() - for extracting results from the most recent command
  • c., i. - for instructing Stata how to treat variables
  • #, ## - for specifying interaction terms

The script from the in-class demonstration can be found at: RECS_Consump_Analysis.do

The script is likely more useful and readable, but a log from the live demonstration is also available here: RECS_Consump_Analysis_14Sep2017.txt

Some important data analysis principals:

  • Pay attention to the scale of your variables
  • Summarize on a natural and easy-to-understand scale
  • Center variables before creating interactions to reduce colinearity.

Macros & Programming Statements

In Stata “variable” always refers to a column of the dataset. However, in programming it is also useful to have access to variables in the general sense.

“Macros” serve this role in Stata and are somewhat similar to shell variables in bash and other Linux shells. A macro is a string that is interspersed into a Stata program and evaluated when that program is executed. The key to understanding and using Stata macros is knowing when they are evaluated.

A local macro can be defined by:

local life_questions 42

The value of a macro is retrieved by encapsulation between a back tick ` and an apostrophe ’:

display `life_questions'

The live demonstration of macros from class is available here macros_14Sep2017.txt

Macros are frequently used in loops:

foreach var in varlist yearmade-kwh {
  summarize `var'
}

Examples of Stata loops can be found in the data management script linked above and the data merge example here: RECS_merge.do.

Extending Stata

Many Stata commands are defined by .ado files rather than built into the Stata source code. Some of these commands are user-contributed and you can extend the functionality of Stata by obtaining programs written as .ado files or writing your own.

Those interested can learn more about programming Stata here.

Disclaimer: You will not need to know anything about .ado files for problem sets or exams in this course.

Resources

Links to the in-class demonstrations:

You may find the following resources useful in learning to use Stata:

These notes are based in large part on:

Course Homepage