Diabetes is a disease that occurs when your body doesn’t make or use the hormone insulin properly. In the past, doctors thought that only adults were at risk for developing type 2 diabetes. However, an increasing number of children in the United States are now being diagnosed with the disease. Doctors think this increase is mostly because more children’s eating habits are unhealthy. Also, as people are paying more attention to their body health nowadays, it is interesting to consider the effect of health insurance on people’s eating habbits. Therefore, we proposed a question relevant to eating habits, health insurance and diabetes situations. Our question is:
Do people’s eating habits have the same effect on their diabetes status with or without health insurance?
To answer this question, we decide to first consider a bunch of diet elements in our analysis, and then use a penalized logistic model to help us select the important ones. In order to see whether the eating habits effect is different after considering the insurance, we first create new dietary variables interacted with insurance and then implement in our model.
The data used is from the National Health and Nutrition Examination Survey (NHANES). The specific datasets used are: 2015-2016 Demographic Variables and Sample Weights, 2015-2016 Health Insurance, 2015-2016 Diabetes, 2015-2016 Dietary Interview - Total Nutrient Intakes (First Day), and 2015-2016 Dietary Interview - Total Nutrient Intakes (Second Day). The following variables were used from the 2015-2016 Demographic Variables and Sample Weights dataset: respondent sequence number (seqn), age, gender, and income. From the 2015-2016 Health Insurance variables used were: respondent sequence number (seqn) and insurance (whether respondent was covered insurance or not). From the 2015-2016 Diabetes dataset, we selected the following variables: respondent sequence number (seqn) and doctor telling respondent if they had diabetes. Lastly, variables that were chosen from the 2015-2016 Dietary Interviews - Total Nutrient Intakes (First and Second Day) were: Iron, Calcium, Zinc, Sodium, Vitamin E, Vitamin A, Alcohol, Vitamin C, Total fat, Dietary, fiber, Total sugars, Carbohydrate, Energy, and Protein. Full data is available here.
Logistic Regression with Lasso Penalty
Diabetes is a binary variable. For binary classification problems, linear logistic regression model is most often used. Although logistic regression often performs comparably to competing methods, such as support vector machine and linear discriminant analysis, it is chosen in this problem because of its several advantages. The notable advantages include that: it provides a direct estimate of class probability; it tends to be more robust in the case \(k>>n\) since it makes no assumptions on the distribution of the predictors; and it doesn’t need tuning parameters. The logistic regression model presents the class-conditional probabilities through a linear function of the predictors and is expressed as:
\[log(\frac{p}{1-p})=\beta_0+\sum_{i=1}^k{x_i\beta_i}\]
Logistic regression coefficients are typically estimated by maximizing the following binomial log-likelihood:
\[\max_{\beta}\sum_{i=1}^n{\{y_i(x_i^T\beta)-\log(1+\exp(x_i^T\beta))\}}\]
In order to select the variables and avoid overfitting, here we try to use logistic regression with lasso penalty. Now it becomes to be estimated by minimizing the following formula: \[\min_{\beta}\{-\frac{1}{n}\sum_{i=1}^n{\{y_i(x_i^T\beta)-\log(1+\exp(x_i^T\beta))\}}+P_\lambda(\beta)\},P_\lambda(\beta)=\lambda\sum_{j=1}^k|\beta_j|\]
ROC curve and AUC
Because the data is unbalanced. So we use AUC to judge our model.
P(true) | N(true) | |
---|---|---|
P’(predicted) | TP | FP |
N’(predicted) | FN | TN |
ROC curve: Y-axis is TPR=TP/(TP+FN) , X-axis is FPR=FP/(FP+TN).
AUC is value of area under ROC curve.
Several equivalent interpretations of AUC:
* The expectation that a uniformly drawn random positive is ranked before a uniformly drawn random negative.
* The expected proportion of positives ranked before a uniformly drawn random negative.
* The expected true positive rate if the ranking is split just before a uniformly drawn random negative.
* The expected proportion of negatives ranked after a uniformly drawn random positive.
* 1 – the expected false positive rate if the ranking is split just after a uniformly drawn random positive.
Demographic_data <- read.xport("C:\\Users\\95260\\Desktop\\stats506\\group_project\\DEMO_I.XPT")
Diabetes_data <- read.xport("C:\\Users\\95260\\Desktop\\stats506\\group_project\\DIQ_I.XPT")
Day1_data <- read.xport("C:\\Users\\95260\\Desktop\\stats506\\group_project\\DR1TOT_I.XPT")
Day2_data <- read.xport("C:\\Users\\95260\\Desktop\\stats506\\group_project\\DR2TOT_I.XPT")
Insurance_data <- read.xport("C:\\Users\\95260\\Desktop\\stats506\\group_project\\HIQ_I.XPT")
For the data cleaning part , I decide to use the dplyr package to help me solve it.
# Demographic data
New_Demo_dat <- Demographic_data %>%
select("id" = SEQN,
"gender" = RIAGENDR,
"age" = RIDAGEYR,
"income" = INDFMIN2) %>%
filter(!is.na(income) & income != 12 & income != 13
& income != 77 & income != 99) %>%
transmute(id = id,
gender = factor(gender, labels =c("male","female")),
age_group = factor(ifelse(age < 13, 1,
ifelse(age < 19,2,
ifelse(age < 41, 3,
ifelse(age < 60, 4, 5)))),
labels = c("younger", "teenager",
"adult", "seniors", "elder")),
income = factor(ifelse(income == 14 | income ==15, income - 2,
income)))
# Insurance data
# remove values other than 1 or 2 and divide insurance variable into 2 columns
New_Insur_dat <- Insurance_data %>%
select("id" = SEQN, "insurance" = HIQ011) %>%
filter(insurance != 7 & insurance != 9) %>%
transmute(id = id,
insurance = ifelse(insurance == 1, 1, 0))
# diabetes data
# remove values other than 1 or 2, and change 2 to 0
New_Diabetes_dat <- Diabetes_data %>%
select("id" = SEQN, "diabetes" = DIQ010)%>%
filter(!is.na(diabetes) & diabetes != 3 & diabetes != 9) %>%
transmute(id = id,
diabetes = ifelse(diabetes == 2, 0, 1))
## Day1 data
New_Day1_dat <- Day1_data %>%
select("id" = SEQN, "weight" = WTDRD1, "iron" = DR1TIRON,
"calcium" = DR1TCALC, "zinc" = DR1TZINC, "sodium" = DR1TSODI,
"VE" = DR1TATOC, "VA" = DR1TVARA, "VC" = DR1TVC,
"alcohol" = DR1TALCO, "fat" = DR1TTFAT, "fiber" = DR1TFIBE,
"sugar" = DR1TSUGR, "carbonhydrate" = DR1TCARB,
"energy" = DR1TKCAL, "protein" = DR1TPROT) %>%
filter(weight != 0) %>%
filter_at(vars(-c("id", "weight")), all_vars(!is.na(.))) %>%
# for data visualization part, we don't standaardize the variables
mutate_at(vars(-c("id", "weight")), scale) %>%
mutate(survey_day = factor(1))
# Day2 data
New_Day2_dat <- Day2_data %>%
select("id" = SEQN, "weight" = WTDR2D, "iron" = DR2TIRON,
"calcium" = DR2TCALC, "zinc" = DR2TZINC, "sodium" = DR2TSODI,
"VE" = DR2TATOC, "VA" = DR2TVARA, "VC" = DR2TVC,
"alcohol" = DR2TALCO, "fat" = DR2TTFAT, "fiber" = DR2TFIBE,
"sugar" = DR2TSUGR, "carbonhydrate" = DR2TCARB,
"energy" = DR2TKCAL, "protein" = DR2TPROT) %>%
filter(weight != 0) %>%
filter_all(all_vars(!is.na(.))) %>%
# for data visualization part, we don't standaardize the variables
mutate_at(vars(-c("id", "weight")), scale) %>%
mutate(survey_day = factor(2))
# Merge all dataset together
Data_final <- New_Day1_dat %>%
rbind(New_Day2_dat) %>%
inner_join(New_Demo_dat, by = "id") %>%
inner_join(New_Diabetes_dat, by = "id") %>%
inner_join(New_Insur_dat, by = "id")
dim(Data_final)
## [1] 13245 22
So after cleaning process, I finally got a data with 13245 observations and 22 variables.
id | weight | iron | calcium | zinc | sodium | VE | VA | VC | alcohol |
---|---|---|---|---|---|---|---|---|---|
83732 | 92670.70 | 0.3012482 | -0.50937280 | -0.2226749 | 1.17162515 | 0.4383692 | -0.49671953 | 0.3942093 | -0.2633655 |
83733 | 16454.11 | -0.3121057 | -0.56027338 | -0.8197414 | -0.01921412 | -0.6230568 | -0.23275429 | -0.5722687 | 3.8017474 |
83734 | 6529.62 | 1.5475834 | -0.07232987 | 0.5958102 | 1.37307453 | 0.6091925 | -0.04834021 | 0.1101634 | -0.2633655 |
For better understanding the final data we got, we decide to do some visualization jobs, here I am going to plot the difference of average zinc iron and sodium intake amount between diabetes and non-diabetes groups at each level of people. By ploting these data, we can directly understand the eating habbit’s difference in microelements intake of people.
So, fron the above 3 graphs, we can see that:
(1) For the iron intake, we can see there exist some difference between diabetes groups. Especially for the male adults(19-40) who don’t have health insurance, the diabetes group would obvious overload iron intakes in their meals.
(2) For the zinc intake, according to the research paper, zinc can somewhat have beneficial effect on preventing diabetes problem. That can be supported by my plot that for those male teenagers(13-18) with health insurance, the non-diabetes group tend to get more zinc intakes in their meals.
(3) For the sodium intake, people with more salt intake will tend to have a blood pressure problem and maybe relevant to diabetes problem. So from my plot we can see that for the male teenager group with health insurance, those have diabetes taking more amount of sodium than those without diabetes.
we are going to build penalized logistic model for our question. And we implement cross validation methods to get lambda parameter; parallel computing method to improve efficiency; AUC value to evaluate our model performance.
# get the number of obs
Nnum <- nrow(Data_final)
# the number of obs as test data
test_Num <- round(0.2 * Nnum)
# the number of obs as train and cv data
train_Num <- Nnum - test_Num
# building the model for N_test time in order to avoid overfitting problem
result = foreach(i = 1:N_test) %dopar% {
library(glmnet)
library(ROCit)
# divide data into training set and test set
set.seed(i)
test_id <- sample(1:Nnum, test_Num)
test_data <- Data_final[test_id,]
train_cv_data <- Data_final[-test_id,]
# divide the dependent variables and response in the training set
x_train <- model.matrix(diabetes ~ ., train_cv_data)[, -1]
y_train <- train_cv_data$diabetes
# use cross validation metthod to find the optimal lambda parameter
cv.lasso <- cv.glmnet(x_train, y_train, alpha = 1, family = "binomial",
type.measure = "auc", parallel = TRUE)
# use the lambda which gives the best AUC value and avoid improving model
# complexity to build our model
model_small <- glmnet(x_train, y_train, alpha = 1, famaily = "binomial",
lambda = cv.lasso$lambda.min)
# store the optimal lambda value
lambda <- cv.lasso$lambda.min
# make prediction using test data
x_test <- model.matrix(diabetes ~ ., test_data)[, -1]
y_test <- test_data$diabetes
y_hat_small <- predict.glmnet(model_small, newx = x_test, type = "response")
# evaluate model performance on test data using ROC curve and AUC value
ROC_obj_small <- rocit(score = as.vector(y_hat_small), class = y_test)
# store the AUC value and plot the ROC curve
# plot(ROC_obj_small)
AUC <- ROC_obj_small$AUC
c(lambda, AUC)
}
# building final model for all of our data
result <- matrix(unlist(result), ncol = N_test, nrow = 2)
x_all <- model.matrix(diabetes~., Data_final)[, -1]
y_all <- Data_final$diabetes
# choose the mean value of the optimal lambdas above as the final optimal lambda
# for our final model
lambda_final <- rowMeans(result)[1]
model <- glmnet(x_all, y_all, family = "binomial", alpha = 1, lambda = lambda_final)
# the coefficients of the final model
beta_all <- coef(model)
# estimation of the final model performance using the average AUC value above
AUC_final <- rowMeans(result)[2]
stopCluster(cl)
## Estimate model performance by AUC: 0.8441269
## 46 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) -4.881773527
## iron_ins 0.092347522
## calcium_ins .
## zinc_ins .
## sodium_ins 0.139805247
## VE_ins -0.033651538
## VA_ins .
## VC_ins -0.014845925
## alcohol_ins -0.020803170
## fat_ins 0.077537639
## fiber_ins .
## sugar_ins -0.118528228
## carbonhydrate_ins .
## energy_ins .
## protein_ins -0.003628424
## survey_day2 .
## genderfemale -0.353677328
## age_groupteenager -0.259302592
## age_groupadult 1.744122645
## age_groupseniors 3.464953312
## age_groupelder 4.179725214
## income2 0.201212414
## income3 0.233047713
## income4 -0.184400970
## income5 0.098262564
## income6 -0.187208226
## income7 -0.040121849
## income8 .
## income9 .
## income10 -0.260310836
## income12 -0.342530319
## income13 -0.525611229
## iron_ogn 0.012161561
## calcium_ogn -0.035544821
## zinc_ogn -0.054355918
## sodium_ogn 0.026982772
## VE_ogn .
## VA_ogn -0.027004841
## VC_ogn .
## alcohol_ogn -0.158321949
## fat_ogn 0.001802192
## fiber_ogn .
## sugar_ogn -0.219323669
## carbonhydrate_ogn -0.014614533
## energy_ogn .
## protein_ogn .
According to the final model we got above, we only remain 14 variables that are important chosen by lasso method.
And then we can see that:
(1) For sugar predictor, the coefficients of the original one and the insurance interacted one are both negative, that means if a person has much sugar intake in his/her meal, then it may have less chance to be told that he/she has a diabetes issue.
(2) For age predictor, the coefficients of the elder people are negative, that means an elder person may be more likely to be told having a diabetes issue.
(3) For income predictor, people with higher income tend to have less probability to get diabetes issue. That make sense because rich people maybe pay more attention to their health status.
(4) For insurance interacted term, there are only sugar and sodium interacted remain in the model. And that can answer our main question:
People with health insurance are more likely to have diabetes issue with same amount sodium, iron, fat intakes than those without insurance;
And they are less likely to have diabetes issue with same amoount VC, VE, sugar, alcohol, protein intakes than those without insurance.
#env set
import pandas as pd
#read DEMO_I.XPT
demo = pd.read_sas('/Users/weijiepan/Desktop/506_project/DEMO_I.XPT')
#select SEQN(id) RIDAGEYR(age) RIAGENDR(gender) INDFMIN2(Annual Family Income)
#lower case
demo = demo[["SEQN","RIAGENDR","RIDAGEYR","INDFMIN2"]]
demo = demo.rename(columns={"SEQN":"id","RIAGENDR": "gender", "RIDAGEYR": "age","INDFMIN2":"income"})
#income remove 12 13 and 14->12 15->13
demo = demo[((demo.income>=0) & (demo.income<=11)) | (demo.income==14) | (demo.income==15)]
demo.income[demo.income==14]=12
demo.income[demo.income==15]=13
#age divede 0~12 => 1 | 13~18 => 2 | 19~40 => 3 | 41~59 => 4 | 60~ => 5 demo.age[(demo.age>=0) & (demo.age<=12)] = 1
demo.age[(demo.age>=13) & (demo.age<=18)] = 2
demo.age[(demo.age>=19) & (demo.age<=40)] = 3
demo.age[(demo.age>=41) & (demo.age<=59)] = 4
demo.age[(demo.age>=60)] = 5
#Diabetes
diabetes = pd.read_sas('/Users/weijiepan/Desktop/506_project/DIQ_I.XPT')
diabetes = diabetes[["SEQN","DIQ010"]]
diabetes = diabetes.rename(columns={"SEQN":"id","DIQ010":"diabetes"})
#change diabetes 2 -> No 1->Yes remove other
diabetes=diabetes[(diabetes.diabetes==1) | (diabetes.diabetes==2)]
diabetes.diabetes[diabetes.diabetes==2]=0
#insurance
insurance = pd.read_sas('/Users/weijiepan/Desktop/506_project/HIQ_I.XPT')
insurance = insurance[["SEQN","HIQ011"]]
insurance = insurance.rename(columns={"SEQN":"id","HIQ011":"insurance"})
insurance = insurance[(insurance.insurance==1) | (insurance.insurance==2)]
insurance.insurance[insurance.insurance==2]=0
#day1
#generate new col surveyday
#(df-df.mean())/df.std()
#drop day2 weight (WTDR2D)
day1 = pd.read_sas('/Users/weijiepan/Desktop/506_project/DR1TOT_I.XPT')
day1 = day1[["SEQN","WTDRD1","DR1TIRON","DR1TCALC","DR1TZINC","DR1TSODI","DR1TATOC","DR1TVARA",
"DR1TALCO","DR1TVC","DR1TTFAT","DR1TFIBE","DR1TSUGR","DR1TCARB","DR1TKCAL","DR1TPROT"]]
day1 = day1.rename(columns={"SEQN":"id","WTDRD1":"weight",
"DR1TIRON":"iron","DR1TCALC":"calcium","DR1TZINC":"zine","DR1TSODI":"sodium",
"DR1TATOC":"VE","DR1TVARA":"VA","DR1TALCO":"alcohol","DR1TVC":"VC",
"DR1TTFAT":"fat","DR1TFIBE":"dietary fiber","DR1TSUGR":"sugar",
"DR1TCARB":"carbohydrate","DR1TKCAL":"energy","DR1TPROT":"protein"})
#(df-df.mean())/df.std()
#normalize
df=day1[["iron","calcium","zine","sodium","VE","VA","alcohol","VC",
"fat","dietary fiber","sugar","carbohydrate","energy","protein"]]
day1[["iron","calcium","zine","sodium","VE","VA","alcohol","VC",
"fat","dietary fiber","sugar","carbohydrate","energy","protein"]] = (df-df.mean())/df.std()
day1["surveyday"]=1
#day2
#drop day1 weight (WTDRD1)
day2 = pd.read_sas('/Users/weijiepan/Desktop/506_project/DR2TOT_I.XPT')
day2 = day2[["SEQN","WTDR2D","DR2TIRON","DR2TCALC","DR2TZINC","DR2TSODI","DR2TATOC","DR2TVARA",
"DR2TALCO","DR2TVC","DR2TTFAT","DR2TFIBE","DR2TSUGR","DR2TCARB","DR2TKCAL","DR2TPROT"]]
day2 = day2.rename(columns={"SEQN":"id","WTDR2D":"weight",
"DR2TIRON":"iron","DR2TCALC":"calcium","DR2TZINC":"zine","DR2TSODI":"sodium",
"DR2TATOC":"VE","DR2TVARA":"VA","DR2TALCO":"alcohol","DR2TVC":"VC",
"DR2TTFAT":"fat","DR2TFIBE":"dietary fiber","DR2TSUGR":"sugar",
"DR2TCARB":"carbohydrate","DR2TKCAL":"energy","DR2TPROT":"protein"})
#nomorlize
df=day2[["iron","calcium","zine","sodium","VE","VA","alcohol","VC",
"fat","dietary fiber","sugar","carbohydrate","energy","protein"]]
day2[["iron","calcium","zine","sodium","VE","VA","alcohol","VC",
"fat","dietary fiber","sugar","carbohydrate","energy","protein"]] = (df-df.mean())/df.std()
day2["surveyday"]=2
#demo -> demo + insurance
demo = demo.merge(insurance,left_on="id",right_on="id")
#demo -> diabetes
demo_diabetes = demo.merge(diabetes,left_on="id",right_on="id")
#day1 -> demo_diabetes
day1_demo_diabetes = day1.merge(demo_diabetes,left_on="id",right_on="id")
#day2 -> demo_diabetes
day2_demo_diabetes = day2.merge(demo_diabetes,left_on="id",right_on="id")
#vhat (day1_demo_diabetes,day2_demo_diabetes)
day12_demo_diabetes = pd.concat([day1_demo_diabetes,day2_demo_diabetes])
#drop nan
day12_demo_diabetes=day12_demo_diabetes.dropna(axis=0,how='any')
day12_demo_diabetes[(day12_demo_diabetes.weight!=0)]
#after discussion decide no longer use the weight so drop it
day12_demo_diabetes = day12_demo_diabetes.drop(["weight"],axis=1)
As we can see from the above pictures, the intakes of vitamin A,C and E are almost the same in mostly group except for the group (teenage 13-18, male, with_ins) and the group (teenage 13-18, female, with_ins). It seems for me that vitamin A,C and E do NOT play important roles in this model. Teenage with diabetes can not eat too much sugar, which push them to eat more fruits or vegetables.
import pandas as pd
#model part
#read the data
data = pd.read_csv("/Users/weijiepan/Desktop/506_project/506_project_data.csv")
data = data.drop(["Unnamed: 0"],axis=1)
#create intecept terms
colnames=["iron","calcium","zine","sodium","VE","VA","alcohol","VC",
"fat","dietary fiber","sugar","carbohydrate","energy","protein"]
for i in colnames:
data[i+"_ins"] = data[i]*data["insurance"]
#divide it into X(preditors) y(result)
X = data.drop(["id","diabetes","insurance"],axis=1)
y = data["diabetes"]
#deal with categorical features
#age, gender, income
gender = pd.get_dummies(X['gender'],drop_first=True)
gender.columns=["gender"+str(i) for i in gender.columns]
age = pd.get_dummies(X['age'],drop_first=True)
age.columns=["age"+str(i) for i in age.columns]
income = pd.get_dummies(X['income'],drop_first=True)
income.columns=["income"+str(i) for i in income.columns]
X.drop(['gender','age','income'],axis=1,inplace=True)
X = pd.concat([X,gender,age,income],axis=1)
#split it into training set and test set
from sklearn import model_selection
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y,
test_size=0.2,
random_state=42,
stratify=y)
#build the model
from sklearn.metrics import accuracy_score
import numpy as np
from sklearn.metrics import roc_auc_score
from sklearn.linear_model import LogisticRegressionCV
model = LogisticRegressionCV(penalty='l1',scoring='roc_auc',solver="saga",
cv=10,class_weight={1:.9,0:.1},max_iter=1000)
model.fit(X_train,y_train)
import matplotlib.pyplot as plt
#display the coef (sorted)
coef = pd.Series(model.coef_[0,:], index = X_train.columns)
plt.rcParams['figure.figsize'] = (8.0, 10.0)
coef.sort_values().plot(kind="barh")
from sklearn.metrics import roc_curve
#plot auc curve
ns_probs = [0 for _ in range(len(y_test))]
# predict probabilities
lr_probs = model.predict_proba(X_test)
# keep probabilities for the positive outcome only
lr_probs = lr_probs[:, 1]
# calculate scores
ns_auc = roc_auc_score(y_test, ns_probs)
lr_auc = roc_auc_score(y_test, lr_probs)
# summarize scores
print('No Skill: ROC AUC=%.3f' % (ns_auc))
print('Logistic: ROC AUC=%.3f' % (lr_auc))
## [1] "No Skill: ROC AUC=0.500"
## [1] "Logistic: ROC AUC=0.849"
# calculate roc curves
ns_fpr, ns_tpr, _ = roc_curve(y_test, ns_probs)
lr_fpr, lr_tpr, _ = roc_curve(y_test, lr_probs)
# plot the roc curve for the model
plt.plot(ns_fpr, ns_tpr, linestyle='--', label='No Skill')
plt.plot(lr_fpr, lr_tpr, marker='.', label='Logistic')
# axis labels
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# show the legend
plt.legend()
# show the plot
plt.show()
According to the final model we got above, we only remain 25 variables that are important chosen by lasso method.
As we can see form the coef_sorted picture:
(1) The age plays an important role in the model. As people become older, they may take a greater risk to have a diabete. What interesting is that the income play a tricky role in this model. We can see at first when income increase, people have more risk to have a diabete. But after that as the imcome increase, people have lower risk to have a diabetes.
(2) For sugar predictor, the coefficients of the original one and the insurance interacted one are both negative, that means if a person has much sugar intake in his/her meal, then it may have less chance to be told that he/she has a diabetes issue.
(3) From the interaction terms, we can see sodium, iro, and fat are positive. And VC alcohol and VE are negative.
People with health insurance are more likely to have diabetes issue with same amount sodium, iron, fat intakes than those without insurance;
And they are less likely to have diabetes issue with same amoount VE, alcohol, VA intakes than those without insurance.
Data was merged, cleaned, and dietary (food) variables were standardized. Additional insurance variable was added to account for opposite effect of having insurance in regards to consumption. Original and alternative variables were created to act as interactives to account for when having insurance and not having insurance.
/ Load Demographic Data
fdause https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DEMO_I.XPT, clear
// Keep relevant variables
keep seqn riagendr ridageyr indfmin2
// Rename variables
rename seqn id
rename riagendr gender
rename ridageyr age
rename indfmin2 income
// Drop Missing Observations
drop if id == .
drop if gender == .
drop if age == .
drop if income == .
// Recode gender
// Male - 1
// Female - 0
*replace gender = 0 if gender == 2
// Recode by Age
// 1 - Age 0-12
// 2 - 12-18 Teenager
// 3 - 18-40
// 4 - 40 - 59
// 5 - 59+
gen age1 = inrange(age,0,12)
gen age2 = inrange(age,13,18)
replace age2 = 2 if age2 == 1
gen age3 = inrange(age,19,40)
replace age3 = 3 if age3 == 1
gen age4 = inrange(age,41,59)
replace age4 = 4 if age4 == 1
gen age5 = inrange(age,60,1000)
replace age5 = 5 if age5 == 1
gen age_full = age1 + age2 + age3 + age4 + age5
replace age = age_full
drop age1 age2 age3 age4 age5 age_full
// Drop some of the income variables
drop if income == 12
drop if income == 13
drop if income == 77
drop if income == 99
// Recode income variables
replace income = 12 if income == 14
replace income = 13 if income == 15
// Save demographic data
save demo_data, replace
* -----------------------------------------------------------------------------
// Load Health Insurance Data
fdause https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/HIQ_I.XPT, clear
// Keep relevant variables
keep seqn hiq011
// Rename variables
rename seqn id
rename hiq01 insurance1
// Recode insurance1 variable
// 0 - No Insurance
// 1 - Insurance
replace insurance1 = 0 if insurance1 == 2
// Drop refused, dont know, and missing data from insurance
keep if insurance1 == 0 | insurance1 == 1
// Generate second insurance variable
generate insurance2 = insurance1 - insurance1
replace insurance2 = 1 if insurance1 == 0
// Save Insurance Data
save ins_data, replace
// Merge Demographic and Health Insurance Data
merge 1:1 id using demo_data
// Keep only matching/merged observations
keep if _merge == 3
drop _merge
save main_data, replace
* -----------------------------------------------------------------------------
// Load Diabetes Data
fdause https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DIQ_I.XPT, clear
// Keep relevant variables
keep seqn diq010
// Rename variables
rename seqn id
rename diq010 diabetes
// Re-Code diabetes variable
// 0 - No diabetes
// 1 - Diabetes
replace diabetes = 0 if diabetes == 2
// Drop observations other than yes and no
keep if diabetes == 0 | diabetes == 1
// Save diabetes data
save diabetes_data, replace
// Merge with main data set
merge 1:1 id using main_data
// Keep only matching/merged observations
keep if _merge == 3
drop _merge
save main_data, replace
* -----------------------------------------------------------------------------
// Load Total Foods Data Day 1
fdause https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR1TOT_I.XPT, clear
// Keep relevant variables
keep seqn wtdrd1 wtdr2d dr1tiron dr1tcalc dr1tzinc dr1tsodi dr1tatoc dr1tvara dr1talco dr1tvc dr1ttfat dr1tfibe dr1tsugr dr1tcarb dr1tkcal dr1tprot
// Rename variables
rename seqn id
rename wtdrd1 weight1
rename wtdr2d weight2
rename dr1tiron iron
rename dr1tcalc calcium
rename dr1tzinc zinc
rename dr1tsodi sodium
rename dr1tatoc ve
rename dr1tvara va
rename dr1talco alcohol
rename dr1tvc vc
rename dr1ttfat fat
rename dr1tfibe fiber
rename dr1tsugr sugars
rename dr1tcarb carbohydrate
rename dr1tkcal energy
rename dr1tprot protein
// Drop observations with missing data
drop if id == .
drop if energy == .
drop if protein == .
drop if carbohydrate == .
drop if sugars == .
drop if fiber == .
drop if fat == .
drop if ve == .
drop if va == .
drop if vc == .
drop if calcium == .
drop if iron == .
drop if zinc == .
drop if sodium == .
drop if alcohol == .
// Generate day of survey
generate survey_day = 1
// Standardize Food Data
egen std_energy = std(energy)
egen std_protein = std(protein)
egen std_carbohydrate = std(carbohydrate)
egen std_sugars = std(sugars)
egen std_fiber = std(fiber)
egen std_fat = std(fat)
egen std_ve = std(ve)
egen std_va = std(va)
egen std_vc = std(vc)
egen std_calcium = std(calcium)
egen std_iron = std(iron)
egen std_zinc = std(zinc)
egen std_sodium = std(sodium)
egen std_alcohol = std(alcohol)
// Replace Food variables with their standardized versions
replace energy = std_energy
replace protein = std_protein
replace carbohydrate = std_carbohydrate
replace sugars = std_sugars
replace fiber = std_fiber
replace fat = std_fat
replace ve = std_ve
replace va = std_va
replace vc = std_vc
replace calcium = std_calcium
replace iron = std_iron
replace zinc = std_zinc
replace sodium = std_sodium
replace alcohol = std_alcohol
drop std*
// Recreate weight variable
replace weight1 = weight2 if survey_day == 2
rename weight1 weight
drop weight2
drop if weight == .
// Arrange data
order id survey_day
// Save day 1 total data
save day1_total, replace
* -----------------------------------------------------------------------------
// Load Total Foods Data Day 2
fdause https://wwwn.cdc.gov/Nchs/Nhanes/2015-2016/DR2TOT_I.XPT, clear
// Keep relevant variables
keep seqn wtdrd1 wtdr2d dr2tiron dr2tcalc dr2tzinc dr2tsodi dr2tatoc dr2tvara dr2talco dr2tvc dr2ttfat dr2tfibe dr2tsugr dr2tcarb dr2tkcal dr2tprot
// Rename variables
rename seqn id
rename wtdrd1 weight1
rename wtdr2d weight2
rename dr2tiron iron
rename dr2tcalc calcium
rename dr2tzinc zinc
rename dr2tsodi sodium
rename dr2tatoc ve
rename dr2tvara va
rename dr2talco alcohol
rename dr2tvc vc
rename dr2ttfat fat
rename dr2tfibe fiber
rename dr2tsugr sugars
rename dr2tcarb carbohydrate
rename dr2tkcal energy
rename dr2tprot protein
// Drop observations with missing data
drop if id == .
drop if energy == .
drop if protein == .
drop if carbohydrate == .
drop if sugars == .
drop if fiber == .
drop if fat == .
drop if ve == .
drop if va == .
drop if vc == .
drop if calcium == .
drop if iron == .
drop if zinc == .
drop if sodium == .
drop if alcohol == .
// Generate survey day
generate survey_day = 2
// Standardize food variables
egen std_energy = std(energy)
egen std_protein = std(protein)
egen std_carbohydrate = std(carbohydrate)
egen std_sugars = std(sugars)
egen std_fiber = std(fiber)
egen std_fat = std(fat)
egen std_ve = std(ve)
egen std_va = std(va)
egen std_vc = std(vc)
egen std_calcium = std(calcium)
egen std_iron = std(iron)
egen std_zinc = std(zinc)
egen std_sodium = std(sodium)
egen std_alcohol = std(alcohol)
// Replace Food variables with their standardized versions
replace energy = std_energy
replace protein = std_protein
replace carbohydrate = std_carbohydrate
replace sugars = std_sugars
replace fiber = std_fiber
replace fat = std_fat
replace ve = std_ve
replace va = std_va
replace vc = std_vc
replace calcium = std_calcium
replace iron = std_iron
replace zinc = std_zinc
replace sodium = std_sodium
replace alcohol = std_alcohol
drop std*
// Recreate weight variable
replace weight1 = weight2 if survey_day == 2
rename weight1 weight
drop weight2
drop if weight == .
// Order data
order id survey_day
// Save day 2 data
save day2_total, replace
// Append day 1 and day 2 data
append using day1_total
// Sort data by id
sort id survey_day
// Save data
save day1_day2_data, replace
// Merge day 1 and day 2 data with main data
merge m:1 id using main_data
// Keep only matching/merged observations
keep if _merge == 3
drop _merge
// Order Data
order id survey_day weight diabetes insurance1 insurance2 gender age income
save main_data, replace
export delimited main_data_csv_eric_stata.csv, replace
// -----------------------------------------------------------------------------
// Create insurance and no insurance variables
gen energy_org = energy*insurance1
gen energy_alt = energy*insurance2
gen protein_org = protein*insurance1
gen protein_alt = protein*insurance2
gen carbohydrate_org = carbohydrate*insurance1
gen carbohydrate_alt = carbohydrate*insurance2
gen sugars_org = sugars*insurance1
gen sugars_alt = sugars*insurance2
gen fiber_org = fiber*insurance1
gen fiber_alt = fiber*insurance2
gen fat_org = fat*insurance1
gen fat_alt = fat*insurance2
gen ve_org = ve*insurance1
gen ve_alt = ve*insurance2
gen va_org = va*insurance1
gen va_alt = va*insurance2
gen vc_org = vc*insurance1
gen vc_alt = vc*insurance2
gen calcium_org = calcium*insurance1
gen calcium_alt = calcium*insurance2
gen iron_org = iron*insurance1
gen iron_alt = iron*insurance2
gen zinc_org = zinc*insurance1
gen zinc_alt = zinc*insurance2
gen sodium_org = sodium*insurance1
gen sodium_alt = sodium*insurance2
gen alcohol_org = alcohol*insurance1
gen alcohol_alt = alcohol*insurance2
// Drop variables that are not needed anymore
drop weight energy protein carbohydrate sugars fiber fat ve va vc calcium iron zinc sodium alcohol
drop insurance1 insurance2
// Main data modified with interaction terms and some dropped variables
save main_data_modified, replace
export delimited main_data_modified.csv, replace
// List first three rows of data
list if _n <= 3
To perform data visualization, people with diabates and without diabates were looked at by group according to age, gender, and insurance status. To assess differences in eating habits for selected nutrients (fat, carbohydrates, and sugar), the mean consumption of those nutrients were calculated by group.
# Include plots for the data visulation of nutrients per group
import delimited nutrients_differences.csv, clear
label define gender_label 1 "Male" 2 "Female"
label define insurance1_label 1 "Insurance" 0 "No Insurance"
label define age_label 1 "0-12" 2 "12-18" 3 "40-59" 4 "40-59" 5 "59+"
label values gender gender_label
label values insurance1 insurance1_label
label values age age_label
* Graph fat consumption by groups
graph hbar fat_nd fat_d, by(age gender insurance1)bargap(-50) legend(label(1 "Mean Fat (ND)") label(2 "Mean Fat (D)"))
* Graph sugar consumption by groups
graph hbar sugar_nd sugar_d, by(age gender insurance1) bargap(-50) legend(label(1 "Mean Sugar (ND)") label(2 "Mean Sugar (D)"))
* Graph carbohydrate consumption by groups
graph hbar carbohydrate_nd carbohydrate_d, by(age gender insurance1) bargap(-50) legend(label(1 "Mean Carbohydrate (ND)") label(2 "Mean Carbohydrate (D)"))
* Graph all nutrient consumption in one by groups
graph hbar fat_nd fat_d sugar_nd sugar_d carbohydrate_nd carbohydrate_d, by(age gender insurance1) legend(label(1 "Mean Fat (ND)") label(2 "Mean Fat (D)") label(3 "Mean Sugar (ND)") label(4 "Mean Sugar (D)") label(5 "Mean Carbohydrate (ND)") label(6 "Mean Carbohydrate (D)")) bar(1, color(ltblue)) bar(2, color(navy)) bar(3, color(sand)) bar(4, color(sandb)) bar(5, color(eltgreen)) bar(6, color(dkgreen))
Figure 1: Data Visualization for the mean consumption of carbhoydrate for successfully merged groups of people who had diabates and those who did not by Age, Gender, and Insurance.
Figure 2: Data Visualization for the mean consumption of fat for successfully merged groups of people who had diabates and those who did not by Age, Gender, and Insurance.
Figure 3: Data Visualization for the mean consumption of sugar for successfully merged groups of people who had diabates and those who did not by Age, Gender, and Insurance.
Figure 4: Data Visualization for the mean consumption of fat, sugar, and carbhoydrates for successfully merged groups of people who had diabates and those who did not by Age, Gender, and Insurance.
Dataset is divided into ten training sets and 10 test sets. Then LASSO 10-Fold Cross-Validation is performed to obtain optimal lambdas. Lambda chosen was the one that minimizes loss measure.
* Import Main Data Set
use main_data_modified.dta, clear
* Create 10 Training and 10 Test Data Sets
* Build Training Data - 80 % of Main Data Set
* Build Test Data - 20 % of Main Data Set
* Key:
* Training Data if "training" == 1
* Test Data if "training" == 0
forval v = 1/10 {
set seed 11`v'
gen random`v' = uniform()
sort random`v'
gen byte training`v' = _n <= 13245*.8
}
drop random*
save training_test_data, replace
* Use 10-Fold Cross-Validation with Logistic Regression on Training Data
* to obtain optimal lambda
* NOTE: Install "lassopack" "pdslasso" and "cvAUROC"
ssc install pdslasso
ssc install lassopack
ssc install cvAUROC
forval v = 1/10 {
cvlassologit diabetes gender-alcohol_alt if training`v' == 1, nfolds(10) seed(11`v')
}
* Lambda Values (that minimize loss measure):
* 8.6141729
* 15.424845
* 13.24682
* 13.310124
* 17.646587
* 17.67645
* 18.184071
* 11.675487
* 27.217884
* 15.427758
Now use those lambda’s to run predicitons and obtain the ROC area to evaluate each models’ performance with those ten selected lambdas.
* Predict on test data using each lambda and calculate ROC for each
lassologit diabetes gender age income energy_org-alcohol_alt if training1== 1, lambda(8.6141729)
predict pdiabetes1 if training1 == 0, pr
roctab diabetes pdiabetes1, graph summary
* ROC Area = 0.8482
lassologit diabetes gender age income energy_org-alcohol_alt if training2== 1, lambda(15.424845)
predict pdiabetes2 if training2 == 0, pr
roctab diabetes pdiabetes2, graph summary
* ROC Area = 0.8396
lassologit diabetes gender age income energy_org-alcohol_alt if training3 == 1, lambda(13.24682)
predict pdiabetes3 if training3 == 0, pr
roctab diabetes pdiabetes3, graph summary
* ROC Area = 0.8545
lassologit diabetes gender age income energy_org-alcohol_alt if training4 == 1, lambda(13.310124)
predict pdiabetes4 if training4 == 0, pr
roctab diabetes pdiabetes4, graph summary
* ROC Area = 0.8495
lassologit diabetes gender age income energy_org-alcohol_alt if training5 == 1, lambda(17.646587)
predict pdiabetes5 if training5 == 0, pr
roctab diabetes pdiabetes5, graph summary
* ROC Area = 0.8546
lassologit diabetes gender age income energy_org-alcohol_alt if training6 == 1, lambda(17.67645)
predict pdiabetes6 if training6 == 0, pr
roctab diabetes pdiabetes6, graph summary
* ROC Area = 0.8551
lassologit diabetes gender age income energy_org-alcohol_alt if training7 == 1, lambda(18.184071)
predict pdiabetes7 if training7 == 0, pr
roctab diabetes pdiabetes7, graph summary
* ROC Area = 0.8394
lassologit diabetes gender age income energy_org-alcohol_alt if training8 == 1, lambda(11.675487)
predict pdiabetes8 if training8 == 0, pr
roctab diabetes pdiabetes8, graph summary
* ROC Area = 0.8462
lassologit diabetes gender age income energy_org-alcohol_alt if training9 == 1, lambda(27.217884)
predict pdiabetes9 if training9 == 0, pr
roctab diabetes pdiabetes9, graph summary
* ROC Area = 0.8513
lassologit diabetes gender age income energy_org-alcohol_alt if training10 == 1, lambda(15.427758)
predict pdiabetes10 if training10 == 0, pr
roctab diabetes pdiabetes10, graph summary
* ROC Area = 0.8534
Figure 5: Summary for ROC curve for model with lambda = 8.6141729
Figure 6: Summary for ROC curve for model with lambda = 15.424845
Figure 7: Summary for ROC curve for model with lambda = 13.24682
Figure 8: Summary for ROC curve for model with lambda = 13.310124
Figure 9: Summary for ROC curve for model with lambda = 17.646587
Figure 10: Summary for ROC curve for model with lambda = 17.67645
Figure 11: Summary for ROC curve for model with lambda = 18.184071
Figure 12: Summary for ROC curve for model with lambda = 11.675487
Figure 13: Summary for ROC curve for model with lambda = 27.217884
Figure 14: Summary for ROC curve for model with lambda = 15.427758
Taking the average of the ten optimal lambdas we obtain: 15.8424198, which we use on the final model. We then obtain coefficients for the final model.
* Fit final model with average lambda
* And calculate ROC to judge performance of model
* Average lambda = 15.84241989
* Average ROC Area = .76423
lassologit diabetes gender age income energy_org-alcohol_alt, lambda(15.84241989)
predict pdiabetesfinal, pr
roctab diabetes pdiabetesfinal, graph summary
Figure 15: Coefficients for Final Model
Figure 16: Summary for ROC curve for Final Model
Figure 17: ROC Curve for Final Model
The LASSO Cross-Validation selected ten lambdas and the ten were averaged to find the lambda that will be used for the final model. Using LASSO logistic regression with a lambda of 15.84241989, we narrowed (shrunk) down our variables. The variables that were selected were: gender, age, income, original sugars, alternative sugars, original fat, original vitamin e, alternative vitamin a, original vitamin c, original calcium, alternative calcium, orginal iron, original zinc, alternative zinc, original sodium, original alcohol, and alternative alcohol. The area under the ROC curve was found to be 0.8508, which tells us that the model performs well. We see that eating habits do in fact differ. We can see that the original sugars variable has a negative effect, thus we can say that people who have insurance are more likely not going to be told they have diabetes. Assessing the fat variable we can say that more fat consumption with insurance leads to a doctor telling an individual that they have diabetes. Similarly with Vitamin E consumption, with insurance, the chance of being told by a doctor that they have diabetes becomes smaller as they consume more. Similar interpretations can be done with the remaining variables. In addition, we can observe what our demographic variables tell us. For example, we see that people with a higher income are less likely to be told by their doctors that they have diabetes.
People are less likely to have diabetes with the same amount of sugar, vitamin a, calcium, zinc, and alcohol consumption than those who do not have insurance.
Common conclusions from the 3 final models we build:
(1) The final AUC values we get from 3 final models are approximately 0.85, which means our model performances are all excellent!
(2) The age predictors all plays an important role in our models. As people become older, they may take on a greater risk to have a diabetes.
(3) Both the original sugar predictors and the interacted one’ s coefficients are negative, that means if people take a lot of sugar in their meals, they may be less likely to have diabetes.
(4) People with higher income level tend to not have diabetes problems.
(5) For the eating habits effect on people with or without health insurance:
People with health insurance are more likely to have a diabetes issue with the same amount sodium, iron, fat intakes than those without insurance;
People are less likely to have diabetes with the same amount of sugar, vitamin a, vitamin e, and alcohol consumption than those who do not have insurance.
During Cross-Validation method of picking the optimal lambda, programming done in Python and R used the AUC to choose the optimal lambda whereas programming done in STATA used the lambda that minimizes the loss measure. This is due to a limitation in STATA that was encountered.
Methods:
Ying Zhu, Tuck Lee Tan, Wai Kwong Cheang, Penalized logistic regression for classification and feature selection with its application to detection of two official species of Ganoderma, Chemometrics and Intelligent Laboratory Systems, Volume 171, 2017, Pages 55-64, ISSN 0169-7439, https://doi.org/10.1016/j.chemolab.2017.09.019.
STATA:
Luque-Fernandez MA, Redondo-Sanchez D, Maringe C. (2019). Cross-validated Area Under the Curve. GitHub repository, https://github.com/migariane/cvauroc
Ahrens, A., Hansen, C.B., Schaffer, M.E. 2019. lassologit: Stata module for logistic lasso regression. http://ideas.repec.org/c/boc/bocode/XXXXX.html