Introduction

Change-point analysis seeks to identify the point or points in a time series at which a change, or break in the trend, takes place. This technique is applicable to a wide variety of social science questions where the researcher is interested in the point of time at which a statistically significant change in the quantity being studied occurs.

Our tutorial demonstrates the application of change-point analysis to life expectancy data. Our World in Data provides time series of life expectancy at birth for many countries in the world over a long period of time. We selected two countries - the United States and Japan - on which to perform our change-point analysis. Life expectancy data for both these countries was available going back over a century. In the case of the United States, data was available for 1880, 1890, and every year from 1901 to 2011. For Japan, data was available approximately every five years from 1865 to 1945 and every year from 1947 to 2011.

Plotting the time series for each country, we can see that life expectancy has changed dramatically in both the United States and Japan over the domain of the data:

While both time series clearly indicate an overall upward trend in life expectancy, less obvious is the specific year or years at which a change can be considered to have occurred. (Note that outliers are not necessarily change points, as change-point analysis is concerned with changes in the overall trend rather than individual point-to-point variations. However, certain change-point models may be more sensitive to outliers than others, as we’ll see below.)

Identifying these points of change in the trend is precisely what our analysis seeks to do. In total, we built three different change-point models: one in SAS using PROC MCMC (Markov Chain Monte Carlo), a second in R using the ‘bcp’ (Bayesian Change Point) package, and a third also in R using the ‘changepoint’ package. In the first SAS analysis, we assumed that there was only one underlying change point. The R packages allowed for multiple change points. Analysis and results are given below.


Analysis and Results


SAS: PROC MCMC

This SAS procedure performs Markov Chain Monte Carlo (MCMC) simulation to fit a wide range of Bayesian statistical models. Here we used it to create Bayesian change-point models for the life expectancy data.

The first stage involved preparing the data. We standardized the year and life expectancy variables prior to running the model.

/* read in the data */
proc import datafile="M:\Private\Stats506\project\life-expectancy.csv" dbms=CSV out=le_all;
run;

/* filter to just include selected country */
data le_usa;
    set le_all;
    where code = "USA";
    *year = _N_;
run;

/* calculate summary statistics */
proc summary data=le_usa missing nway;
    var year life_expectancy;
    class code;
output out=le_usa_stats(drop=_TYPE_ _FREQ_) 
    mean(year) = meanyr std(year) = stdyr 
    mean(life_expectancy) = meanle std(life_expectancy) = stdle;
run;

/* standardize year and life_expectancy using summary statistics */
data le_usa_z;
    merge le_usa(in=a) le_usa_stats(in=b);
    by code;
    if a;
        x = (year-meanyr)/stdyr;
        y = (life_expectancy - meanle)/stdle;

/* remove first two years when measurements were at 10-yr intervals */
    if year in (1880,1890) then delete;
run;

We then used Markov Chain Monte Carlo simulation to fit a Bayesian change-point model, following closely the example given in the PROC MCMC documentation on the SAS website.

/* changepoint model using Markov Chain Monte Carlo */
proc mcmc data=le_usa_z outpost=postout seed=24860 
    ntu=1000 nmc=20000 /* use 20000 iterations after burn-in of 1000 */
    statistics=summary;
ods select PostSummaries;
ods output PostSummaries=test1;
    
array beta[2];
parms alpha cp beta1 beta2;
parms s2;
    
prior cp ~ unif(-1.3, 1.1);
prior s2  ~ uniform(0, 5);
prior alpha beta:  ~ normal(0, v = 1e3);
   
j = 1 + (x >= cp);
mu = alpha + beta[j] * (x - cp);
model y ~ normal(mu, var=s2);
run;

Here are the resulting posterior summary statistics for the model parameters. Note that the standardized mean of the change point (cp) is \(-0.0395\). This corresponds to the gap in between the years 1953 and 1954.

We can visualize the posterior distribution of the changepoint as follows. Once again, we rely heavily on the example from the SAS documentation.

data _null_;
    set test1;
    call symputx(parameter, mean);
run;
    
data b;
    missing A;
    input x1 @@;
    if x1 eq .A then x1 = &cp;
    if _n_ <= 2 then y1 = &alpha + &beta1 * (x1 - &cp);
    else y1 = &alpha + &beta2 * (x1 - &cp);
    datalines;
    -3 A 1.5;
run;
    
proc kde data=postout;
univar cp / out=m1 (drop=count);
run;
    
data m1;
    set m1;
    density = (density / 3) - 4;
run;
    
data all;
    set le_usa_z b m1;
    rename x=Year_Standardized y=Life_Expectancy_Standardized;
run;

The posterior distribution of the change point appears roughly normal.

Finally, we can generate a scatterplot that shows the change point.

title "Change-Point Model of United States Life Expectancy";
title;
proc sgplot data=all noautolegend;
    scatter x=x y=y;
    series x=x1 y=y1 / lineattrs = graphdata2;
    series x=value y=density / lineattrs = graphdata1;
run;

In the plot above, we can see the “jump” between the two points to the immediate left (\(year=1953\)) and right (\(year=1954\)) of the dividing line, thus giving visual support to our change-point model. Note that this model was seemingly unaffected by the outlier around 1917.

We can rerun the same analysis on the subset of data from Japan to obtain the following results.

In the plot above, the change point (represented by the vertical line) corresponds to the gap between the years 1959 and 1960. However, the change point’s underlying distribution appears to be bimodal, leading us to seriously question whether the mean is an appropriate summary statistic in this case. It could be that multiple change points, or an entirely different model, would better represent the data for Japan. This could be a subject of future research. Regardless, it is apparent that this particular model was not sensitive to the outlier around 1945.


R: ‘bcp’ Package

The following analysis uses the ‘bcp’ (Bayesian Change Point) and the ‘dplyr’ packages in R. Like PROC MCMC in SAS, the heart of the change-point analysis in this package is Bayesian analysis using a Markov Chain Monte Carlo (MCMC) algorithm. So we would need to specify a prior distribution, and then use the \(bcp\) function to get the desired results, including the posterior probability of each year being a change point.

Make sure that you can load both packages before running the examples below. If you do not already have them installed, first run:

install.packages("bcp")
install.packages("dplyr")

Example 1:

To prepare for our data analysis below, we first imported the dataset as a data frame, and then used more interpretable names for the columns of the dataset.

library(dplyr)
library(bcp)

allcountries = read.csv("life_expectancy.csv", header = TRUE)
colnames(allcountries) = c("country", "code", "year", "expectancy")
# View the first few rows of the data
head(allcountries)
##       country code year expectancy
## 1 Afghanistan  AFG 1950     27.706
## 2 Afghanistan  AFG 1951     27.706
## 3 Afghanistan  AFG 1952     27.706
## 4 Afghanistan  AFG 1953     27.706
## 5 Afghanistan  AFG 1954     27.706
## 6 Afghanistan  AFG 1955     30.284

Since for this example we were only interested in life expectancy for the United States, we used piping to subset the data and also dropped the “country” column as it does not provide any more information than the “code” column.

We can also get the basic descriptives for the new dataset by using \(summary\).

us_expectancy = allcountries %>% 
                filter(code == "USA") %>%
                select(code, year, expectancy)

summary(us_expectancy)
##       code          year        expectancy   
##  USA    :113   Min.   :1880   Min.   :39.41  
##         :  0   1st Qu.:1927   1st Qu.:58.50  
##  ABW    :  0   Median :1955   Median :69.50  
##  AFG    :  0   Mean   :1955   Mean   :66.13  
##  AGO    :  0   3rd Qu.:1983   3rd Qu.:74.56  
##  ALB    :  0   Max.   :2011   Max.   :78.83  
##  (Other):  0

It can be helpful to visualize our data in a plot:

plot(us_expectancy$year, us_expectancy$expectancy, type = "b", 
     xlab = "year", ylab = "Life Expectancy", 
     main = "USA Life Expectancy at Birth")

Notice that there are two gaps of 10+ years between the first 3 observations. We dropped the first two observations so that all observations in the dataset would be one year apart in our analysis.

us_expectancy = us_expectancy[-c(1,2),]
# check the dimension of our udpated dataset
dim(us_expectancy)
## [1] 111   3

We also noticed that there’s a dip in life expectancy on the graph around year 1920. We can check exactly which year it is and keep that in mind in our analysis.

us_expectancy[which(us_expectancy$expectancy == min(us_expectancy$expectancy)),]
##    code year expectancy
## 20  USA 1918       47.2

We now need to specify the parameters for the ‘bcp’ model. One of the most important parameters to specify is the prior parameter \(p0\). According to the package documentation, “for sequential data, \(p0\) is the parameter of the prior on change-point probabilities, Unif(0, \(p0\)), on the probability of a change point at each location in the sequence.” For this example, we assigned \(p0\) a value of 0.008. Then, since the draws from our distribution were random, we set a seed to make sure that our result is reproducible. Next, we set \(burnin\), the number of iterations that were excluded from the estimation of the posterior means and probabilities of changes, to 1000, and set \(mcmc\), the total number of iterations used, to 10000.

We can then graph the result of our Bayesian analysis of change-point problems.

set.seed(101)
bcp_us = bcp(us_expectancy$expectancy, p0 = 0.008, burnin = 1000, 
             mcmc = 10000)

plot(bcp_us, main = "USA Life Expectancy Example")

Note that this Bayesian approach estimates both the posterior means and the posterior probability of a change point for each position in our sequence of observations from years 1901 to 2011. We can make a graph of posterior probability vs. year to visualize our result.

year_prob_us = cbind(us_expectancy$year, bcp_us$posterior.prob)
plot(year_prob_us, type = "l", xlab = "Year", ylab = "Posterior Probability")

We can also create a data frame with “year” and “posterior probability” being the columns and arrange them in descending order based on the posterior probabilities. We can then view the first few rows of the data and check which years are likely to be the change points.

colnames(year_prob_us) = c("year", "prob")
year_prob_us = as.data.frame(year_prob_us) %>% arrange(desc(prob))
head(year_prob_us)
##   year   prob
## 1 1917 1.0000
## 2 1918 1.0000
## 3 1920 0.9934
## 4 1907 0.8236
## 5 1937 0.7528
## 6 1953 0.6281

We can see that years 1917 and 1918 (around WWI) are change points with probability 100%, and year 1920 also has a probability of almost 100% to be a change point. This is interesting since previously we noticed that year 1918 could very likely be an outlier in our dataset. Hence, we may guess that the calculation for change points in this package may be very sensitive to outliers. Additionally, years 1907, 1937, and 1953 are likely to be change points.


Example 2:

For this example, since we were interested in life expectancy for Japan, we again used piping to subset the data and dropped the “country” column as it does not provide any more information than the “code” column.

We can again get the basic descriptives for the new dataset by using \(summary\) and also graphing the data.

jpn_expectancy = allcountries %>% 
    filter(code == "JPN") %>%
    select(code, year, expectancy)

summary(jpn_expectancy)
##       code         year        expectancy   
##  JPN    :81   Min.   :1865   Min.   :30.54  
##         : 0   1st Qu.:1951   1st Qu.:60.92  
##  ABW    : 0   Median :1971   Median :72.81  
##  AFG    : 0   Mean   :1964   Mean   :67.17  
##  AGO    : 0   3rd Qu.:1991   3rd Qu.:79.25  
##  ALB    : 0   Max.   :2011   Max.   :83.09  
##  (Other): 0
plot(jpn_expectancy$year, jpn_expectancy$expectancy, type = "b",
     xlab = "year", ylab = "Life Expectancy", 
     main = "Japan Life Expectancy at Birth")

We noticed that there’s a dip in life expectancy on the graph around year 1945. We can check exactly which year it is and keep that in mind in our analysis.

jpn_expectancy[which(jpn_expectancy$expectancy == min(jpn_expectancy$expectancy)),]
##    code year expectancy
## 16  JPN 1945    30.5368

For this example, we assigned \(p0\) a value of 0.009, and set \(burning\) and \(mcmc\) to 1000 and 10000, respectively.

We can then graph the result of our Bayesian analysis of change-point problems.

set.seed(101)
bcp_jpn = bcp(jpn_expectancy$expectancy, p0 = 0.009, burnin = 1000, 
             mcmc = 10000)

plot(bcp_jpn, main = "Japan Life Expectancy Example")

Once again, we made a graph of posterior probability vs. year to visualize the probability of each year being a change point.

year_prob_jpn = cbind(jpn_expectancy$year, bcp_jpn$posterior.prob)
plot(year_prob_jpn, type = "l", xlab = "Year", ylab = "Posterior Probability")

We can also create a data frame with “year” and “posterior probability” being the columns and arrange them in descending order based on the posterior probabilities. Now, we can view the first few rows of the data and check which years are likely to be the change points.

colnames(year_prob_jpn) = c("year", "prob")
year_prob_jpn = as.data.frame(year_prob_jpn) %>% arrange(desc(prob))
head(year_prob_jpn)
##   year   prob
## 1 1935 1.0000
## 2 1945 1.0000
## 3 1947 0.8187
## 4 1922 0.8090
## 5 1951 0.6834
## 6 1957 0.4668

Similar to the conclusion in Example 1, we can see that years 1935 and 1945 (around WWII) are change points with probability 100%. Note that these are the observations right before the suspected outlier and the outlier itself. Hence, there is a good chance that the calculation for change points in this package is very sensitive to outliers. Additionally, years 1947, 1922, and 1951 are likely to be change points.


R: ‘changepoint’ Package

This package aims to identify a single or multiple change points within a data set. We would mainly be using the \(cpt.mean\) function, which calculates the optimal positions and the potential number of change points for a time series using a user-specified method. The available algorithms include “AMOC”, “BinSeg” (Binary Segmentation), and “PELT” (Pruned Exact Linear Time).

Based on the same set of data (“life_expectancy.csv”), the following analysis was done with the ‘changepoint’ and ‘data.table’ packages in R. Make sure that you can load them before running the examples below. If you do not have the package installed, first run:

install.packages("changepoint")
install.packages("data.table")

Again, we first imported the dataset as a data frame, and then used more interpretable names for the columns of the dataset.

library(changepoint)
library(data.table)
## data: life expectancy for 199 countries
life <- read.csv('life_expectancy.csv', header = T)
# turn column name into more interpretable ones
colnames(life) = c("country", "code", "year", "expectancy")
# ignore first column since it's equivalent to the second one
life <- life[, -1]

Example 1:

In the first example, we would only need the data for the United States, so we used the data.table package to filter to rows whose code column is “USA”.

#USA
USA <- as.data.table(life)[code == "USA"]

Let’s have a quick look at our data in a plot:

plot(USA$year, USA$expectancy, type = "b", 
     xlab = "year", ylab = "Life Expectancy")

Notice that there are two gaps of 10+ years between the first three observations. We dropped the first two observations so that all observations in the dataset would be one year apart in our analysis.

USA <- USA[-c(1,2), ]

First of all, the trend of this data needs to be omitted before the analysis. We can find the first difference using the \(diff()\) function.

# first difference
de.tr_USA <- c(0,diff(USA$expectancy))
plot(de.tr_USA)

Now, we use the \(cpt.mean\) function to find the change point(s). By default, the “AMOC” algorithm would try to obtain a single change point. We would also consider obtaining multiple change points using the approximate (BinSeg) and the exact (PELT) methods.

Note that the function \(cpts\) can actually directly give us the location of a change point in our data set without the other output of the function \(cpt.mean\) (like ‘Changepoint type’).

# AMOC, by default
cpts(cpt.mean(de.tr_USA))
## numeric(0)
# Binary Segmentation
cpts(cpt.mean(de.tr_USA, method = "BinSeg"))
## numeric(0)
# PELT
cpts(cpt.mean(de.tr_USA, method = "PELT"))
## [1] 17 18 19
plot(cpt.mean(de.tr_USA, method = "PELT"))

From the output above, we can see that there’s no indication of change points in our data set after removing the trend using either the “AMOC” method or the “BinSeg” method. However, three change points, the \(17^{th}\) to the \(19^{th}\) observations, can be identified by using the “PELT” method,

USA[c(17:19),]
##    code year expectancy
## 1:  USA 1917       54.0
## 2:  USA 1918       47.2
## 3:  USA 1919       55.3

which correspond to the years of 1917, 1918, and 1919.


Example 2:

The second example focuses on Japan.

# Japan
Japan <- as.data.table(life)[code == "JPN"]

plot(Japan$year, Japan$expectancy, type = "b", 
     xlab = "year", ylab = "Life Expectancy")

This time we would choose the “BinSeg” (binary segmentation) method, since it tends to find fewer numbers of change points, which is what we prefer.

# locate the changepoint
cpt.mean(c(0,diff(Japan$expectancy)), method = "BinSeg")
## Class 'cpt' : Changepoint Object
##        ~~   : S4 class containing 14 slots with names
##               cpts.full pen.value.full data.set cpttype method test.stat pen.type pen.value minseglen cpts ncpts.max param.est date version 
## 
## Created on  : Mon Dec 04 01:43:11 2017 
## 
## summary(.)  :
## ----------
## Created Using changepoint version 2.2.2 
## Changepoint type      : Change in mean 
## Method of analysis    : BinSeg 
## Test Statistic  : Normal 
## Type of penalty       : MBIC with value, 13.18335 
## Minimum Segment Length : 1 
## Maximum no. of cpts   : 5 
## Changepoint Locations : 15 16 18 
## Range of segmentations:
##      [,1] [,2] [,3] [,4] [,5]
## [1,]   16   NA   NA   NA   NA
## [2,]   16   15   NA   NA   NA
## [3,]   16   15   18   NA   NA
## [4,]   16   15   18   13   NA
## [5,]   16   15   18   13   22
## 
##  For penalty values: 15.0407 15.0407 15.0407 8.915925 4.054938
plot(cpt.mean(c(0, diff(Japan$expectancy)), method = "BinSeg"))

Japan[c(15,16,18),]
##    code year expectancy
## 1:  JPN 1935    48.2425
## 2:  JPN 1945    30.5368
## 3:  JPN 1948    56.8300

We can see that the \(15^{th}\), \(16^{th}\), and the \(18^{th}\) observations (i.e., years 1935, 1945, and 1948) are where the change points are in this data set using the ‘changepoint’ package.


Comparsion of the Results

Here is a summary of the years identified as change points by each of the three models in the United States and Japan life expectancy data.

United States
Package Year
SAS 1954
R (‘bcp’) 1907, 1917, 1918, 1920
R (‘changepoint’) 1917, 1918, 1919
Japan
Package Year
SAS 1960
R (‘bcp’) 1922, 1935, 1945, 1947
R (‘changepoint’) 1935, 1945, 1948

As we can see, the two R packages produced fairly similar results, despite ‘bcp’ employing a Bayesian approach and ‘changepoint’ employing a frequentist approach. SAS, although employing a Bayesian approach like ‘bcp’, produced results quite different than either R package. This could be due to a difference in the prior distribution, which must be specified explicitly in SAS while in ‘bcp’ it is regulated by a built-in parameter of the function call. On the positive side, the SAS model seemed much less sensitive to the outliers in the data. Ultimately, it is up to the researcher doing the change-point analysis to determine which algorithm is best suited for the specific model and data.