Introduction

According to the survey from American Diabetes Association, more than 30 million Americans have diabetes and around 84 million Americans have pre-diabetes. The health care costs of diagnosed diabetes are 327 billion per year according to the 2017 statistics. Compared with those without diabetes, people with diabetes are spending 2.3 times more on their medical expenditure. This gap is striking. With National Health and Nutrition Examination Survey on hand, our group are interested in exploring the prevalence of diabetes in terms of demographic features, dietary habits and alcohol use.

So our question is:

In terms of demographic features, alcohol use and dietary habits, what features are most associated with the prevalence of diabetes?

In the end of our analysis, we will show how gender, age, nutrient, occupation, alcohol use and sleep quality are associated with the probability of being diagnosed with diabetes. By comparing respective odds ratios, we will come to a conclusion that describes which population are more subject to diabetes.

Data

Here we use survey data in the 2015 - 2016 cycle using NHANES data.

Response variable

Variable Description Data Type Dataset
DIQ010 Doctor told you have diabetes Categorical. 0/1 DIQ_I.XPT

Predicting variables

Variable Description Data Type Dataset
RIAGENDER Gender(1: male; 2: female) Categorical. 1/2 DEMO_I.XPT
RIDAGEYR Age in years at screening Positive Integer. [18,80] DEMO_I.XPT
ALQ130 Average alcoholic drinks/day in the past 12 months Positive Integer. [0,15] ALQ_I.XPT
DR1TSUGR Total sugar (gm) Double. [0.33, 533.44] DR1TOT_I.XPT
DR1TTFAT Total fat (gm) Double. [3.54, 498.63] DR1TOT_I.XPT
OCQ260 Description of job/work situation Categorical. 1/2/3 OCQ_I.XPT
SLD012 Sleep hours Double. [3.0, 2.0] SLQ_I.XPT

Methods

In statistics, the logistic model is used to model the probability of a certain class, such as pass/fail, win/lose, alive/dead or healthy/sick.

Here, our response variable is binary (diabetes or not). In ordinary linear model, the response variable range from negative infinite to positive infinite. In order to model a binary dependent variable, we need to use a logistic function as a link function: \[ \\log(\tfrac{p}{1-p}) \] Then we can use logistic regression to fit our model: \[ \\log(\tfrac{p}{1-p}) =\eta = \beta_0 + \sum_{i=1}^{n}\beta_iX_{i} \] Use maximum likelihood approach to find parameters that maximize the likelihood of the data: \[ \ell(\beta) = \sum_{i=1}^{n}[y_i(x_i^\intercal\beta)-n_i\log(1+exp(x_i^\intercal\beta))] \] Then we can get the probability of success (here means being told you have diabetes): \[ p = P(Y = 1) = \tfrac{e^{\beta_0 + \sum_{i=1}^{n}\beta_iX_{i}}}{e^{\beta_0 + \sum_{i=1}^{n}\beta_iX_{i}} + 1} \]

We have set some common rules for performing the data cleaning part. The response variable DIQ010 should be coded as 1: Yes and 0: No and should be treated as a categorical variable. RIAGENDER (Gender) should be coded as 1: Males and 2: Female and should be treated as categorical variable. ALQ130 (Avg # alcoholic drinks/day past 12 mons) should range from 1 to 15. The use of two datasets DR1IFF_I.XPT and DR1TOT_I.XPT is to confirm the measurement accuracy of total sugar intake and total fat intake. OCQ260 (Description of job/work situation) should be recoded as 1: private business employee, 2: government employee and 3: self-employed and should be treated as a categorical variable.

Then we preprocessed and cleaned all the variables. The rows with missing values are removed and some variables are recoded for consistency. After the data preprocessing step, to get a basic idea about relationship between the response and some predictors, we first create a few contingency tables for each categorical predictor and the response. Then we fit the dataset with logistic regression using diabetes as the response variable and all the others as predictors, without any interaction terms or transformations. We carried out the above procedures in R, python and STATA.

Core Analysis

R

Data loading

# data: ------------------------------------------------------------------------
library(SASxport)
path = "H:/Personal/STATS506/project"
files = list.files(path, pattern = "*.XPT", full.names = T)
dataset = sapply(files, read.xport, simplify = FALSE)


# rename data
names = c("alcohol", "demo", "diabetes", "total_nutri", 
          "occupation", "sleep")
names(dataset) = names

Variables selecting

# clean up key variables used in this problem: ---------------------------------
library(data.table)
# select all key variables
ds1 = as.data.table(dataset$alcohol)[,.(SEQN, alcohol = ALQ130)]
ds2 = as.data.table(dataset$demo)[,.(SEQN, gender = RIAGENDR, age = RIDAGEYR)]
ds3 = as.data.table(dataset$diabete)[,.(SEQN, 
                                        diabetes = ifelse(DIQ010 == 2, 0, DIQ010))]
ds4 = as.data.table(dataset$total_nutri)[,.(SEQN, 
                                            tot_sugar = DR1TSUGR, 
                                            tot_fat = DR1TTFAT)]


ds5 = as.data.table(dataset$occupation)[,.(SEQN, occupation = OCQ260)]
ds6 = as.data.table(dataset$sleep)[,.(SEQN, sleep = SLD012)]

# merge all datasets
DS = Reduce(merge,list(ds1,ds2,ds3,ds4,ds5,ds6))

Data cleaning

# remove missing values and unqualified values: --------------------------------
DS = na.omit(DS)[diabetes %in% c(0,1),
                  ][!(alcohol %in% c(777,999)),
                    ][!(occupation %in% c(6, 77, 99)),
                      ][occupation %in% c(2,3,4), occupation := 2
                        ][occupation == 5, occupation := 3
                          ][
                            , `:=`(occupation = as.factor(occupation),
                                   diabetes = as.factor(diabetes),
                                   gender = as.factor(gender))
                          ][, -c("SEQN"), with = FALSE
                            ][, `:=`(age = age - mean(age),
                                   tot_sugar = tot_sugar - mean(tot_sugar),
                                   tot_fat = tot_fat - mean(tot_fat),
                                   sleep = sleep - mean(sleep))]
summary(DS)
##     alcohol       gender        age          diabetes   tot_sugar      
##  Min.   : 1.000   1:1119   Min.   :-23.162   0:1835   Min.   :-107.99  
##  1st Qu.: 1.000   2: 898   1st Qu.:-12.162   1: 182   1st Qu.: -50.30  
##  Median : 2.000            Median : -1.162            Median : -13.84  
##  Mean   : 2.829            Mean   :  0.000            Mean   :   0.00  
##  3rd Qu.: 3.000            3rd Qu.: 10.838            3rd Qu.:  33.55  
##  Max.   :15.000            Max.   : 38.838            Max.   : 425.12  
##     tot_fat        occupation     sleep         
##  Min.   :-86.415   1:1502     Min.   :-4.45811  
##  1st Qu.:-32.415   2: 298     1st Qu.:-0.95811  
##  Median : -8.175   3: 217     Median : 0.04189  
##  Mean   :  0.000              Mean   : 0.00000  
##  3rd Qu.: 24.455              3rd Qu.: 0.54189  
##  Max.   :408.675              Max.   : 4.54189
The first 10 rows of final data
alcohol gender age diabetes tot_sugar tot_fat occupation sleep
1 1 20.837878 1 -66.013342 -10.71529 2 -1.9581061
1 2 14.837878 0 -53.553342 -40.72529 2 -0.9581061
8 1 -19.162122 0 59.396658 1.11471 1 -0.9581061
1 1 4.837878 0 6.286658 -58.29529 2 2.5418939
3 1 3.837878 0 125.736658 83.59471 3 0.5418939
2 2 -11.162122 0 180.946658 77.26471 2 -0.9581061
2 1 7.837878 0 -74.323342 -46.39529 1 -0.4581061
3 2 -14.162122 0 46.556658 17.12471 1 -0.4581061
4 1 -19.162122 0 41.956658 -26.26529 1 0.5418939
2 1 -17.162122 0 60.656658 14.34471 1 -0.4581061

After data cleaning, we got a dataset with 2017 observations.

Two-way contingency table

# preliminary exploration: -----------------------------------------------------
# two-way contingency table
table(DS$diabetes, DS$occupation)
##    
##        1    2    3
##   0 1380  268  187
##   1  122   30   30
table(DS$diabetes, DS$gender)
##    
##        1    2
##   0 1011  824
##   1  108   74
# table using proportions:
# prop.table(table(DS$diabete, DS$occupation))
# prop.table(table(DS$diabete, DS$gender))

It seems like that occupation 1 has the lowest diabete percentage and occupation 3 has the highest diabete percentage.
From the diabete percetage, men are more likely to get diabetes than women.

Logistic model

# logistic model: --------------------------------------------------------------
# training

fit = glm(diabetes ~ ., data = DS, family = "binomial")
summary(fit)
## 
## Call:
## glm(formula = diabetes ~ ., family = "binomial", data = DS)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.1869  -0.4705  -0.3273  -0.2317   2.8054  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.470979   0.191880 -12.878  < 2e-16 ***
## alcohol     -0.044419   0.044452  -0.999  0.31767    
## gender2     -0.125443   0.175450  -0.715  0.47462    
## age          0.057351   0.006170   9.294  < 2e-16 ***
## tot_sugar   -0.004542   0.001450  -3.133  0.00173 ** 
## tot_fat      0.003790   0.001872   2.025  0.04292 *  
## occupation2 -0.020747   0.224888  -0.092  0.92649    
## occupation3  0.094191   0.236026   0.399  0.68984    
## sleep       -0.047183   0.064609  -0.730  0.46521    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1222.6  on 2016  degrees of freedom
## Residual deviance: 1093.6  on 2008  degrees of freedom
## AIC: 1111.6
## 
## Number of Fisher Scoring iterations: 6

We can see from the result that age, total sugar and total fat are significant to this model.
Then do some additional analysis.

Data visualization and additional analysis

# plots: ------------------------------------------------------------------------
# graphs: 
# 1. plot the fraction of respondents who have diabete against their age
# 2. plot the fraction of all respondents against their age
# 3. plot the fraction of respondents who have diabete against their sleep hours
# 4. plot the fraction of all respondents against their hours
# 5. plot the fraction of respondents who have diabete against their total sugar
# 6. plot the fraction of all respondents against their total sugar

diabetes_1 = DS[diabetes == '1']
hist(diabetes_1$age, main = "* Graph 1: AGE pattern in diabetes *",
     xlab="age", las = 1 )

hist(DS$age, main = "* Graph 2: AGE pattern in all respondents *",
     xlab="age", las = 1 )

hist(diabetes_1$sleep, main = "* Graph 3: SLEEP pattern in diabetes *",
     xlab="sleep", las = 1  )

hist(DS$sleep, main = "* Graph 4: SLEEP pattern in all respondents *",
     xlab="sleep", las = 1 )

hist(diabetes_1$tot_sugar, main = "* Graph 5: SUGAR pattern in diabetes *",
     xlab="sugar", las = 1  )

hist(DS$tot_sugar, main = "* Graph 6: SUGAR pattern in all respondents *",
     xlab="sugar", las = 1 )

Graph 2 and 3 prove that sleeping hours has little to do with diabetes, because the distribution of both look very similar. And that is also corresponding to our model result.

We can see from the model result that the coefficient of sugar is negative, which contradicts our intuition that more sugar intake increases the probability of developing diabete. So maybe genetic effect is much more important than the amount of sugar intake. Since people who have family diabete history are likely to be more carefull about sugar intake.

Python

Load data & select variables

import pandas as pd
import os
import numpy as np
# read data and extract variables simultaneously
xpt_files = ["ALQ_I.XPT", "DEMO_I.XPT", "OCQ_I.XPT", "DIQ_I.XPT", 
             "DR1IFF_I.XPT", "SLQ_I.XPT", "DR1TOT_I.XPT"]
base = ""
vars = [["SEQN", "ALQ130"],
        ["SEQN", "RIAGENDR", "RIDAGEYR"],
        ["SEQN", "OCQ260"],
        ["SEQN", "DIQ010"],
        ["SEQN", "DR1ISUGR", "DR1ITFAT"],
        ["SEQN", "SLD012"],
        ["SEQN", "DR1TSUGR", "DR1TTFAT"]
    ]

# STORE files in a list
da = []
for idf, fn in enumerate(xpt_files):
    df = pd.read_sas(os.path.join(base, fn))
    df = df.loc[:, vars[idf]]
    da.append(df)
    
# label the datasets
alcohol = da[0]
demo = da[1]
diabete = da[3]
dietary = da[4]
occupation = da[2]
sleep = da[5]
total_nutrient = da[6]

Data preparation

# data cleaning - missing, re-categorize,
# diebete
for i in range(diabete.shape[0]):
    if diabete.iloc[i][1] == 2.0:
        diabete.iloc[i][1] = int(0)
    elif diabete.iloc[i][1] == 1.0:
        diabete.iloc[i][1] = int(diabete.iloc[i][1])
    else:
        diabete.iloc[i][1] = np.nan
        
# alcohol
for j in range(alcohol.shape[0]):
    if alcohol.iloc[j][1] in [777.0, 999.0]:
        alcohol.iloc[j][1] == np.nan
    
# occupation
for k in range(occupation.shape[0]):
    if occupation.iloc[k][1] in [77.0, 99.0, 6.0]:
        occupation.iloc[k][1] = np.nan
    elif occupation.iloc[k][1] in [2.0, 3.0, 4.0]:
        occupation.iloc[k][1] = 2.0
    elif occupation.iloc[k][1] == 5.0:
        occupation.iloc[k][1] = 3.0

# dietary
dietary = dietary.groupby('SEQN').sum()
dietary.columns = ["Tot_sugar", "Tot_fat"]
dietary = dietary.reset_index()

Merge all datasets

# merge tables
table = pd.merge(left=alcohol,right=demo, left_on='SEQN', right_on='SEQN')
table = pd.merge(left=table,right=diabete, left_on='SEQN', right_on='SEQN')
table = pd.merge(left=table,right=dietary, left_on='SEQN', right_on='SEQN')
table = pd.merge(left=table,right=occupation, left_on='SEQN', right_on='SEQN')
table = pd.merge(left=table,right=sleep, left_on='SEQN', right_on='SEQN')
table = table.dropna()
# 2087 rows left after removing nan

Center age, sleep hours and tot_fat

import statistics as stats
# center age, sleep hours, and tot_fat
table['RIDAGEYR'] = table['RIDAGEYR'] - stats.mean(table['RIDAGEYR'])
table['SLD012'] = table['SLD012'] - stats.mean(table['SLD012'])
table['Tot_fat'] = table['Tot_fat'] - stats.mean(table['Tot_fat'])

Two-way contingency table

  • Diabete vs Gender
  • Diabete vs Occupation
# contingency table - gender
pd.crosstab(table["DIQ010"], table["RIAGENDR"])
# contingency table - occupation
pd.crosstab(table["DIQ010"], table["OCQ260"])
# Change numeric to dummy
# handle multi-level categorical variable - OCQ260
occ = pd.get_dummies(table['OCQ260'], prefix='OCQ260')
gender = pd.get_dummies(table['RIAGENDR'], prefix='RIAGENDR')
occ.columns = ['business', 'government', 'self-employee']
gender.columns = ['male','female']
cols_to_keep = ['ALQ130', "RIDAGEYR","SLD012", "Tot_sugar", "Tot_fat"]

df_temp = table[cols_to_keep].join(occ.loc[: , 'government' : ])
df_temp = df_temp.join(gender['female'])
X = df_temp.iloc[:,:]
# add intercept manually
X['intercept'] = 1.0
y = table['DIQ010']

Fit logistic model

import statsmodels.api as sm
logit_model=sm.Logit(y,X)
result=logit_model.fit()
print(result.summary())

STATA

Data Loading

* Title: This script analyzes the relationship between the prevalence of diabetes
*        and dietary habits, demographic features and alcohol use.
* Import and merge ------------------------------------------------------------
pwd
cd C:\Users\xiaolc\Downloads
pwd

// response variables
// diabetes
fdause DIQ_I.XPT, clear
quietly compress
gsort +seqn
keep seqn diq010
save DIABETES.dta, replace

// predicting variables
// gender, age
fdause DEMO_I.XPT, clear
quietly compress
gsort +seqn
keep seqn riagendr ridageyr
merge 1:1 seqn using DIABETES.dta

keep if _merge == 3
save DIA_DEMO.dta, replace

// alcohol
fdause ALQ_I.XPT, clear
quietly compress
gsort +seqn
keep seqn alq130
merge 1:1 seqn using DIA_DEMO.dta, generate(merge2)

keep if merge2 == 3
save DIA_DEMO_ALC.dta, replace

// total sugars, total fat
fdause DR1TOT_I.XPT
quietly compress
gsort +seqn
keep seqn dr1tsugr dr1ttfat
merge 1:1 seqn using DIA_DEMO_ALC.dta, generate(merge3)

keep if merge3 == 3
save DIA_DEMO_ALC_NUTR.dta, replace

// occupation
fdause OCQ_I.XPT, clear
quietly compress
gsort +seqn
keep seqn ocq260
merge 1:1 seqn using DIA_DEMO_ALC_NUTR.dta, generate(merge4)

keep if merge4 == 3
save DIA_DEMO_ALC_NUTR_OCU.dta, replace

// sleep disorder
fdause SLQ_I.XPT
quietly compress
gsort +seqn
keep seqn sld012
merge 1:1 seqn using DIA_DEMO_ALC_NUTR_OCU.dta, generate(merge5)

keep if merge5 == 3
save dataset.dta, replace

Data Processing

rename sld012 sleep
rename ocq260 occupation
rename dr1tsugr tot_sugr
rename dr1ttfat tot_fat
rename alq130 alcohol
rename riagendr gender
rename ridageyr age
rename diq010 diabete

// recode diabete: 0: no diabete, 1: has diabete
replace diabete = 0 if diabete == 2

// recode occupation: 1: private business employee
//                    2: government employee
//                    3: self-employed
replace occupation = 2 if (occupation == 2 | occupation == 3 | occupation == 4)
replace occupation = 3 if occupation == 5

// remove missing values and unqualified values
generate missing = 0
replace missing = 1 if (seqn == . | diabete == 3 | diabete == 7 | diabete == 9 | diabete == . | age == . | gender == . | alcohol == 777 | alcohol == 999 | alcohol == . | tot_fat == . | tot_sugr == . | occupation == 6 | occupation == 77 | occupation == 99 | occupation == . | sleep == .)
drop if missing == 1

// center age, sleep, total fat and total sugar
summarize age, meanonly
replace age = age - r(mean)

summarize sleep, meanonly
replace sleep = sleep - r(mean)

summarize tot_fat, meanonly
replace tot_fat = tot_fat - r(mean)

summarize tot_sugr, meanonly
replace tot_sugr = tot_sugr - r(mean)

Model Fitting

// fit the logistic model
logit diabete age i.gender alcohol tot_fat tot_sugr i.occupation sleep, or
matrix model = r(table)

mata
model = st_matrix("model")
model = round(model[(1, 2, 5, 6), (1, 3, 4, 5, 6, 8, 9, 10)], 0.001)
model = model'
st_matrix("final_model", model)
end

putexcel set 506project_res.csv, replace
putexcel A2 = "age"
putexcel A3 = "2.gender"
putexcel A4 = "alcohol"
putexcel A5 = "tot_fat"
putexcel A6 = "tot_sugr"
putexcel A7 = "2.occupation"
putexcel A8 = "3.occupation"
putexcel A9 = "sleep"
putexcel B1 = "est"
putexcel C1 = "se"
putexcel D1 = "lwr"
putexcel E1 = "upr"
putexcel B2 = matrix(final_model)

putexcel save
Odds Ratios
est se lwr upr
age 1.059 0.007 1.046 1.072
2.gender 0.882 0.155 0.625 1.244
alcohol 0.957 0.043 0.877 1.044
tot_fat 1.004 0.002 1.000 1.007
tot_sugr 0.995 0.001 0.993 0.998
2.occupation 0.979 0.220 0.630 1.522
3.occupation 1.099 0.259 0.692 1.745
sleep 0.954 0.062 0.840 1.083

Summary

From the regression outputs, we observe that age, total sugar intake, and total fat intake are most associated with the prevalence of diabetes. Whereas gender, sleep hours, occupation (government, self-employee, business), and alcohol intake are not significant predictors. In particular, the diabetes are more prevalent in the group of elder people, and in the group of people who ingest more fat and less sugar.
The odds ratio for age is 1.059. It means that with one unit increase in age, the odd of being diagnosed with diabetes will increase by a factor of 1.059. The odds ratio for total fat is 1.004. It means that with one unit increase in total fat intake, the odd of diagnosis will increase by a factor of 1.004. The odds ratio for total sugar is 0.995. It means that with one unit increase in total sugar intake, the odd of diagnosis will decrease by a factor of 0.995.

Discussion(To be improved)

  1. Variable selection: There may be other variables in other datasets that may have stronger associations with the prevalence of diabetes. Due to the limited time, we were not able to explore as many datasets/variables as we wanted.
  2. Model selection: In addition to the logistic regression, there may exist other models that also fit the data well such as lasso or ridge regression. If possible, we could try these models.
  3. We can improve the model performance by adding interaction terms (two-way or three-way).
  4. Our goal was to explore the data and do the inference. But beyond that we are also interested in prediction. For doing so we can split dataset into training and testing sets and use MSE to evaluate our model.

References

  1. Faraway, Julian J. Extending the linear model with R: generalized linear, mixed effects and nonparametric regression models. Chapman and Hall/CRC, 2016.
  2. American Diabetes Association, 1995 - 2019.