SAS is a both a programming language and a collection of data analysis routines. It is closed source commercial software widely used by industry. For instance, SAS promotional materials claim 83,000 installations including most of the top 100 companies from the Fortune 500. It is also quite popular in biostatistics and in the healthcare industry.
SAS is primarily a declarative rather than an imperative language. In other words, you tell SAS what you wish to accomplish and let the program figure out how to accomplish it.
Another feature of SAS is that it is designed to work efficiently with data on disk rather than in RAM, unlike R or Stata.
A SAS program typically consists of two types of code blocks:
SAS also has the capabilities to define macros and variables. As in Stata, macros in SAS work through string substitution.
You have several options for accessing SAS for learning and assignments.
All examples shown in class will use SAS in batch mode as that is the way I primarily use it.
You can use SAS in batch mode on GreatLakes (from within the campus network):
When we run SAS in batch mode, the SAS program has extension .sas
. After running this file a .log
file will be created with the code run and messages from the SAS program, and and displayed or printed output will end up in a file with extension .lst
. These are all plain text files, so you can view them with a page viewer such as less
.
You can also use SAS in an interactive “line” mode. To do this on the SCS servers, invoke SAS with the -nodms
option:
Some procedures, such as proc import
, attempt to create an additional window. This will cause an error if graphical forwarding is not set up. To prevent this you can add the -noterminal
option when invoking SAS at the command line.
To exit this command line interface, use the statement endsas;
.
SAS offers a free “University Edition” for academic use.
You can also access SAS using midesktop through the UM computing service.
Several of our examples are based on Professor Shedden’s 2016 course notes.
All of the examples discussed below can be found at the git repo Stats506_F20 under examples/sas
. To run the examples, you will need to download data to the examples/sas/data
folder yourself.
This video explains the basics of a SAS program and how to write one using SAS studio.
Here are some key points to keep in mind:
data
and proc
steps.;
.run;
statement tells SAS to execute a block of code..lst
) file in batch mode.In example 0 we import delimited data using a data step with an infile
statement to parse a file and an input
statement to specify the formats. We then run the contents
and print
procedures to examine the data set created.
Next, we use proc import
to import a comma delimited copy of the 2009 RECS data and again explore it using proc print
and proc contents
.
You can read more about formats for SAS variables here.
Note that character style formats are preceded by $
and that all format types end with .
with the exception of numeric types where the .
can be followed by an integer d
for decimal precision.
Another file format frequently used with SAS is a “fixed-width” file.
Here, rather than using a delimiter to separate columns each column has a standard or fixed width. In example 1 we read a fixed width file using a data step with an input and an infile
statement.
The example, as posted, has several messages about invalid data in the log file.
Can you figure out how to resolve these?
In Professor Shedden’s notes, you can find a filename
statement which uses a “pipe” to read data in a compressed format.
You have previously encountered the open XPT format for NHANES data.
Please see Professor Shedden’s notes for how to reference this file type within SAS.
SAS uses a binary format sas7bdat
for native data storage on disk. SAS also uses the concept of ‘libraries’ similar to how schema are used in SQL. The default library is named WORK
and is set up in a temporary directory.
You create handles for libraries using a libname
statement.
In example 2, we create a library handle mylib
and save the RECS data to it after importing.
In example 3, we create a data table recs
referencing the RECS data in .sas7bdat
format downloaded from the EIA site. Note the additional metadata it contains relative to the version imported from CSV (by comparing the outputs from proc contents
).
In example 4 we create rural and urban subsets of the RECS data an save them to our library using data
steps.
We then use a data
step to find the last 5 rows of the recs
data as imported from csv
or read natively from sas7bdat
to compare.
There are several procedures useful for obtaining descriptive statistics.
In example 5 we explore proc tabulate.
In example 6 we explore proc means
, proc summary
, and proc freq
.
An important difference between proc means
and proc summary
is that the former computes output to be printed to the listing file while the latter constructs a table of summary statistics. The latter is thus useful for implementing the “split, apply, combine” pattern of grouped aggregation. (There is an output
statement in proc means
that can be used to produce both.)
In example 7 we look for the state(s) with the highest proportion of wood-shingle roofs among single-family homes using proc summary
.
Some notes about proc summary
:
class
statement observe that we see both group totals and an overall total differentiated by _TYPE_
;by
statement, the data must be sorted first;by
statement, we do not get an overall total.In example 7, we also make use of proc format
to create a format state
that we later use to print nice values for the REPORTABLE_DOMAIN
variable. By specifying a format library, we add those values to a sas format data set with extension .sas7bcat
. In a later example we will reuse those formats after setting the fmtsearch
option to include this format library.
(Note: SAS is being introduce before SQL this year. You may wish to review this section after the SQL notes rather than now.)
SAS has a procedure proc sql
which allows you to form SQL like queries within SAS. This can be more efficient than similar programs constructed using multiple proc
and data
steps.
In example 8, adapted from Professor Shedden’s notes, we use proc sql
to find all single family homes in the RECS data with mean ‘heating degree days’ above 2000.
Then in example 9, we use proc sql
to repeat the analysis of finding the “States” with the highest proportion of wood-shingled roofs. Following the analysis, we use proc export
to write the resulting table to csv.
While there are many useful procedures in SAS, custom analyses often involve data manipulations done using multiple data steps. This is often called “data step programming”.
In example 10, we use data step programming along with proc sort
and proc summary
to find the percent of single family homes within each census region more than one standard deviation above the mean electrical usage for that region using the RECS data.
This example makes use of a technique called “re-merging” to add group-level summary statistics as variables in our data. The basic idea of “re-merging” is the following:
first, compute group-level summary statistics and store in a new table;
then, merge this table (e.g. left join) back into the table it came from using the grouping variable from step 1 to identify common rows.
To merge, note that we use a data step with a merge
statement identifying the data sets to be merged and a by
statement identifying the variable(s) to join on.
Example 12 repeats example 10 using the RECS sample weights and a better programming style.
Repeat example 10 using proc sql
. You can find a solution to this exercise on the course repo (as example11.sas
).
In the case studies folder, you will find a short case study fitting a linear mixed model to the sleepstudy
data from R’s lme4
package. This case study illustrates:
proc mixed
,ods
system to create sas tables with components from models fit using proc mixed
,%let
construction,