SAS is a both a programming languate and a collection of data analysis routines. It is closed source commercial software widely used by industry. For instance, SAS promotional materials claim 83,000 installations including most of the top 100 companies from the Fortune 500. It is also quite popular in biostatistics and the healthcare industry.
As with SQL, SAS is primarily a declarative rather than imperative language. In other words, you tell SAS what you wish to accomplish and let the program figure out how to accomplish it.
Another feature of SAS is that it is designed to work with data on disk rather than in RAM, unlike R or Stata.
A SAS program typically consists of two types of code blocks:
SAS also has capabilities to define macros and variables though we will not cover these here.
Our examples are largely based on Professor Shedden’s 2016 course notes and case studies.
You have several options for accessing SAS for learning and assignments. All examples shown in class will use SAS in batch mode as that is the way I primarily use it.
You can use SAS in batch mode on the scs servers:
ssh luigi.dsc.umich.edu
sas example0.sas -log example0.log
Several versions of SAS are also available on Flux (module load SAS
).
When we run SAS in batch mode, the SAS program has extension .sas. After running this file a .log file will be created with the code run and messages from the SAS program, and and displayed or printed output will end up in a file with extension .lst. These are all plain text files, so you can view them with a page viewer such as less
or more
.
SAS offers a free “University Edition” for academic use.
You can also access SAS using midesktop through the UM computing service. You will need to figure out the details yourself if you choose this route.
All of the examples shown below can be found at the git repo Stats506_F18 under Examples/SAS
. To run the examples, you will need to download data to the Examples/SAS/data
folder yourself.
This video explains the basics of a SAS program and how to write one using SAS studio.
Here are some key points to keep in mind:
In example0 we import delimited data using a data step with an infile
statement to parse a file and an input
statement to speciy the formats. We then run the contents
and print
procedures to examine the dataset created.
We then use proc import
to import a comma delimited copy of the 2009 RECS data and again explore it using proc print
and proc contents
.
You can read more about formats for SAS variables here. Note that character style formas are preceeded by $
and that all format types end with .
with the exception of numeric types where the .
can be followed by an integer d
for decimal precision.
Another file format frequenlty used with SAS is a “fixed-width” file. Here, rather than using a delimiter to separate columns each column has a standard or fixed width. In example 10 we read a fixed width file using a data step with an input and infile statement.
The example, as posted, has several error messages about invalid data in the log file. Can you figure out how to resolve these?
In example 11, we use a filename statement and a “pipe” to read data in a compressed format.
You have previously worked with the open XPT format in Stata. Please see Professor Shedden’s notes for how to reference this file type within SAS.
SAS uses a binary format sas7bdat
for native data storage on disk. SAS also uses the concept of ‘libraries’ similar to how schema are used in SQL. The default library is WORK
set up in a temporary directory. You create handles for libraries using a libname
statement.
In example1, we create a library handle mylib
and save the RECS data to it after importing.
In example 2, we create a data table recs
referencing the RECS data in sas7bdat format downloaded from the EIA site. Note the additional metadata it contains relative to the version imported from CSV.
In example 3 we create rural and urban subsets of the RECS data an save them to our library using “data” steps.
We then use a data step to find the last 5 rows of the recs data as improted from csv or read natively from sas7bdat to compare.
There are several procedures useful for obtaining descriptive statistics.
In example 4 we explore proc tabulate.
In example 5 we explore proc means
, proc summary
and proc freq
.
An important difference between proc means
and proc summary
is that the former computes output to be printed to the listing file while the latter constructs a table of summary statistics. The latter is thus useful for implementing the “split, apply, combine” pattern of grouped aggregtation. In example 6 we look for the state with the highest proportion of wood-shingle roofs using proc summary
.
Some notes about proc summary
:
class
statement observe that we see both group totals and an overall total;by
statement, the data must be sorted first;by
statment, we do not get an overall total.In example 6, we also make use of proc format
to create a fromat state
that we later use to print nice values for the REPORTABLE_DOMAIN
variable. By specifying a format library, we add those values to a sas format dataset with extension .sas7bcat
. Later, in example 8 we will reuse those formats after setting the fmtsearch
option to include this format library.
SAS has a procedure proc sql
which allows you to form SQL like queries within SAS. This can be more efficient than similar programs constructed using multiple proc and data steps.
In example 7, taken from Professor Shedden’s notes, we use proc sql
to find all single family homes in the RECS data with mean ‘heating degree days’ above 2000.
Then in example 8, we use proc sql
to repeat the analysis of finding the “States” with the highest proportion of wood-shingled roofs. Following the analysis, we use proc export
to write the resulting table to csv.
While there are many useful “procs” in SAS, custom analyses often involve data manipulations done using multiple data steps. This is often called “data step programming”. In example 9, we use data step programming along with proc sort
and proc summary
to find the percent of single family homes within each census region more than one standard deviation above the mean electrical usage for that region using the RECS data.
In example 9, we make use of a technique called “remerging” to add group-level summary statistics as variables in our data. The basic idea of “remerging” is the following:
first, compute group-level summary statistics and store in a new table;
then, merge this table (e.g. left join) back into the table it came from using the grouping variable from step 1 to identify common rows.
To merge, note that we use a data step with a “merge” statement identifying the datasets to be merged and a “by” statement identifying the variable(s) to join on.
Repeat example 9 using proc sql
.