Group Project

Overview

The group project will be completed in groups of 3-4. Groups have been assigned according to an algorithm and are posted at Project Groups.

Each group will prepare and explain an example related to one of the topics below. Each member of the group will prepare a version of the software using a different statistical software (e.g. R, Stata, SAS, Python, Matlab) and write a short tutorial on the topic in that software as it relates to the example.

Topics

Here are is a collection of topics from which to choose. Some topics should be considered “umbrella” topics from within which you should choose 1-3 specific concepts to cover in your tutorial.

Data Manipulation Topics

  • Using regular expressions to work with string variables or variable names.
  • Importing, formatting, or otherwise working with dates and/or date-times.
  • Performing rolling joins, in which the most recent observed values on a subject from one data set are “carried forward” for that subject in the merged data.
  • Organizing transactional data (e.g. from log files) into one or more rectangular data sets.

Graphics

  • Pick 2-3 specific graphical concepts and provide a tutorial on how to carry them out. Include code to customize aesthetics in order to make the graphs as similar as possible. [Warning: I know nothing about graphics in SAS or Stata.]

Software Requirements

  • Each member of the group must contribute a version of the example in a different statistical software (R, Stata, SAS, python, Matlab).

  • When applicable, two examples may come from a single software if they use different tools (e.g. different R packages).

  • As a whole, each group’s examples must meet the following criteria:

    1. at least one example must be in R;
    2. at least one analysis must be from Stata or SAS;
    3. if two examples will be in the same software, you must justify to me that the examples are sufficiently different.

Here are examples of how to meet the software criteria for a group of 3: - R, Stata, SAS - R, Stata, Python - R (tidyverse), R (data.table), SAS.

To get an idea of scope, you can compare to completed projects from previous years:

Those projects are slightly different in that some are tutorials and some are focused more specifically on a data analysis. Your projects can be simpler than these.

Project Proposal

Each group should write a short proposal containing:

  • your selected topic,
  • an outline of the example, including data sets and specific variables that will be used,
  • the software/tools to be used for the parallel examples, including the core commands/functions/packages related to the topic,
  • which group member will take primary responsibility for the example in each software.

Languages other than those listed above (R, Stata, SAS, Python, Matlab) may be included provided you meet the software requirements and receive approval from me on your proposal.

A group liaison should submit the group’s proposal to me via email with the subject header “Stats 506 Group Project Proposal” before Monday November 2, 5pm.

Guidelines

Introduction

Write an introduction 3-5 paragraphs in length covering the following topics.

  • Describe your topic and explain when or why it is useful.
  • Explain any key concepts related to your topic.
  • Describe the data you will use for the examples.
  • Discuss the software/tools you will use for the examples.

Core Example 1

This section should consist of tabbed sections for each version of the example. Make the examples as similar and parallel to one another as possible. Use tabbed sections to include code and make it easy compare the examples. Use common language in the code comments to facilitate comparisons.

Additional Core Examples (Option 1)

If your core example is fairly simple, you should either provide multiple examples or collaborate on extending a subset of the examples as described below.

Extended Examples (Option 2)

Extend one or more of the examples beyond the common scope of the others. Use this section to extend a subset of the examples to:

  • motivate a reader about why the topic is important, e.g. if your topic was “re-shaping data” you might show an analysis making use of the re-shaped data;

  • demonstrate functionality available in one tool, but not in another, e.g. pivoting multiple columns to long at once using data.table::melt();

  • explain or demonstrate something else you’d like a reader (e.g. this class) to know.

Conclusion / Discussion

Include a section comparing and contrasting the tools used. Emphasize any important differences in functionality or defaults necessary to understand similarities or differences in the separate versions of the example.

Git

Groups should use git and GitHub to coordinate their work. One group member should create a public repository for the project with others submitting pull requests to them. Your git repo is considered part of the final submission and should include at a minimum:

  • sources files for your tutorial including all examples and:
    • one or more scripts for each core analysis
    • an Rmd file to create the final submission
    • an html page or pdf document with the final submission
  • a readme with a short description of the project
  • a minimum of two commits or merge requests per team member
  • some evidence of code reviews from each group member reviewing each other member’s work (commits) one or more times.

Excluding extraordinary circumstances, all group members will receive the same grade. However, I reserve the right to modify this policy in cases where one or more group members clearly put in less effort than the others.

Timeline

  1. Proposals Due: Monday November 2 by 5pm. Your group must have your proposal approved by me prior to this time. Groups are required to select unique topics so it is to your benefit to submit early.

  2. Draft Due Date: Friday November 13 by 5pm. Submit drafts to Canvas as a link to the official git repo. Each member should submit a link and reference the files which they worked on to facilitate peer review from other groups. At a minimum, your draft must include:

    • a first draft of the introduction,
    • complete or nearly complete core examples,
    • an outline of the entire project,
    • a concise to-do list of any unfinished items.
  3. Peer Review Due: Wednesday November 18 by 5pm. You will be asked to provide constructive feedback to two peers following the peer review guidelines. Additional guidelines for how to structure this feedback will be provided.

  4. Final Due Date: Wednesday November 25, noon. This is the deadline to submit the final version of your tutorial. Please submit a link to the official git repo and the final html document.

Please make edits in response to peer feedback.

Approved Group Proposals

I will post group proposals here as they are approved. Two groups may not choose the same or closely related topics.

I also plan to post your final analyses to this page.
You may include or omit your name from the author information at your own discretion. If omitting your name from the html document, also include a second copy of your scripts with "_anon" appended to the file name and your name scrubbed from the header.

  • Group 1
    • repo
    • Languages: R, SAS, Stata, Python
    • Topic: Using Propensity Scores to Understand the
    • Data: NHANES
  • Group 2
  • Group 3
    • repo
    • Languages: R, Python, Stata
    • Topic: ARIMA Models for Time-Series Data
    • Data: Nifty Fifty
  • Group 4
  • Group 5
    • repo
    • Languages: R, Stata, Python
    • Topic: Hosmer-Lemeshow Model Calibration
    • Data: Churn Modelling
  • Group 6
    • repo
    • Languages: R, Stata, Python
    • Topic: Splines
    • Data: Wage
  • Group 7
    • repo
    • Languages: R, Python, Stata
    • Topic: Inference for linear and non-linear combinations of regression coefficients.
    • Data: Global Suicide Rates