Group Project

Overview

The group project will be completed in groups of three. Groups have been assigned randomly and will be posted to Canvas. There may be 1-2 groups of four.

Each group will propose a substantive question that can be answered using NHANES data and an analysis plan for answering that question.

Analytical tools requirment

Groups will then carry out the core analysis using three distinct “tools” (or four, one per group member). Those “tools” can be either different statistical softwares (R, Stata, SAS, python) or, when approprite, use different tools (e.g. different R packages) within a single software. You may choose the distinct analytic tools subject to three criteria: (1) at least one analysis must be in R; (2) at least one analysis must be in a software other than R; and, (3) you must justify (to me) how analyses within the same software differ from one another.

As an example of how to meet the criteria, consider a project proposal in which your question is the same as that posed in question 3 of problem set 4. You could propose to answer that question using R, Stata, and SAS (or R, python, XX). You could also propose to answer the question using Stata and then provide two solutions in R, one using dplyr for data manipulation and lme4 to fit the model and the other using the data.table and nmle packages.

Question and analysis scope

Your question should be of a scope similar to that of a typical homework question. Moreover, it should be a substantive inferential question rather than a narrow question about the data itself.

For instance, using the example from problem set 4,

Are people in the US more likely to drink water on a weekday than a weekend day?

is a better question than,

What fraction of people reported drinking water on weekends and weekdays in the 2006-2008 NHANES sample?

Here are some rules of thumb to guage if a question is of sufficient scope:

  1. It requires three or more datasets to answer, e.g. not just demographics and one other dataset.

  2. It requires only two datasets, but involves substantial cleaning, rehsaping, etc.

  3. It uses “advanced” analytic techniques (e.g. margins, model checking, elastic net).

To get an idea of scope, you can compare to completed projects from the last two years:

Those projects are slightly different in that they are tutorials on a technique with an analysis as an exemple, rather than primarily focused on the analysis.

Project Proposal

Each group should write a short proposal containing:

  • your substantive question,
  • the specific datasets and variables to be used in answering that question,
  • the analytic or modeling techniques that will be used to answer the question,
  • the software/tools to be used for the parallel examples.

Each member of the group should assume primary responsibility for software/tool and these responsibilities should be included in your proposal.

You may use a language (i.e. Python, Matlab) not taught in this course for one or more analyses as long as I approve it in your proposal.

A group liaison should submit the group’s proposal to me via email with the subject header “Stats 506 Group Project Proposal”.

Guidelines

Introduction

Your introduction and overview should be approximately 2-3 paragraphs explaining:

  • what your question is, why it is interesting, and what your analysis will show

Data

  • a description of the specific datasets and variables used

Methods

  • a description of the analysis done
  • the languages/tools you used
  • any information needed to make the core analyses parallel

Core Analysis

All three analyses should follow a common outline to the extent possible. Deviations should be due to limitations or stylistic differeneces in the languages chosen rather than lack of coordination among group members. Where deviations do occur, please explain and justify them in the methods above or the discussion below.

Use tabbed sections to include code and make it easy compare the analyses.

Additional Analysis

It is permissible for one or more analyses to extend beyond the scope of others provided that all three analyses share a common core set of tasks. For instance, you may wish to provide plots illustrating your data or fitted models – these could be done in a single language only and included in the “results” below.

Results

What is the answer to your question? What did you learn about the data?

Discussion (Optional)

You can include a section comparing and contrasting the tools used if needed to understand the similarities or differences in the separate analyses.

Git

Groups should use git to coordinate their work. Each member of the group should create an account at github.com. One group member should create a public repository for the project with others submitting pull requests to them. Your git repo is considered part of the final submission and should include at minimum:

  • sources files for your tutorial including all core analyses
    • one or more scripts for each core analysis
    • an Rmd file to create the final submission
    • an html page or pdf document with the final submisison
  • a readme with a short description of the project
  • a minimum of two commits per team member
  • some evidence of code reviews form each group member reviewing each other member’s work (commits) one or more times.

Excluding extraordinary circumstances, all group members will receive the same grade. However, I reserve the right to modify this policy in cases where one or more group members clearly put in less effort than the others.

Timeline

  1. Proposals Due: Tuesday November 26 by 5pm. Your group must have your proposal approved by me prior to this time. Groups are required to select unique questions so it is to your benefit to submit early.

  2. Draft Due Date: Thursday December 5 by 5pm. Drafts should at a minimum have an outline of the entire project, and preferrably be mostly complete with a concise to-do list of outstanding items. Submit drafts to Canvas as a link to the official git repo. Each member should submit a link and reference the files which they worked on.

  3. Peer Review Due: Monday December 9 by 5pm. You will be asked to provide constructive feedback to another group following the peer review guidelines. Additional guidelines for how to structure this feedback will be provided.

  4. Final Due Date: Wednesday December 11 by midnight. This is the deadline to submit the final version of your tutorial. Please submit a link to the official git repo and the final html document. Please make edits in response to peer feedback.

Approved Group Proposals

I will post group proposals here as they are approved. Two groups may not choose the same or closely related topics.

I plan to post your final analyses to this page – you may include or omit your name from the author information at your own discretion. If omitting your name from the html document, also include a second copy of your scripts with "_anon" appeneded to the file name and your name scrubbed from the header.

  • Group 1, repo

    Languages: R, Python, Stata

    Question: Do relationships between eating habits and diabetes status differ based on health insurance status?

  • Group 2, repo

    Languages: R (data.table / glmnet), R (dplyr / glmnet), Stata

    Question: Which predictor variables for blood pressure differ the most between males and females?

  • Group 3, repo

    Languages: R, Python, SAS

    Question:
    Do people diagnosed with diabetes consume fewer calories in the U.S.?

  • Group 4, repo

    Languages: Python, R, Stata

    Question: What factors are associated with levels of low-density lipoprotein (LDL) cholesterol?

  • Group 5, repo

    Languages: Python, R (lm / data.table), R (GLS / dplyr)

    Question: Does working overtime predict abnormal blood pressure?

  • Group 6, repo

    Languages: R, Stata

    Question: Do people with higher carbohyrdate intake feel more sleep during the day?

  • Group 7, repo

    Languages: Python, R, SAS

    Question: Are there differences in cardiovascular health between men and women?

  • Group 8, repo

    Languages: R, Stata, SAS

    Question:
    What is the relationship between eating habits and alcohol use? Are certain types of eating habits associated with higher alcohol consumption?

  • Group 9, repo

    Languages: Matlab, Python, R

    Question: How do you drinking habits compare by health status, among the healthy?
  • Group 10, repo

    Languages: Python, R, SAS

    Question:
    Is salt intake associated with blood pressure? If so, to what extent is that relationship mediated or moderated by age or waist size?
  • Group 11, repo

    Languages: R, Python, Stata

    Question: In terms of demographic features, alcohol use and dietary habits, what features are most associated with the prevalence of diabetes?

  • Group 12, repo

    Languages: Python, R (dplyr / mgcv), R (data.table / splines)

    Question: Do people who report more physical activity also report longer sleep?