Course Homepage

About

The course project is worth 20% of the course grade. It will be graded out of 200 points evenly split between group and individual components. Read more about each part below.

Individual Component

The open cities data for this question can be found here.

In this component, you should pose two clearly-defined research questions relating to the open city data sets above. You will then answer each question yourself, using one or more data sets to support your arguments.

You must use at least three data sets in your solutions. You may either pose two questions related to the same group of three or more of these data sets, or pose one question based on a single data set and then a second utilizing a group of two or more data sources.

Your questions should be limited in their overall scope, similar to a single question from one of the first three problem sets. Your answer to each question - including any text, tables, and graphs - should be a single typed page, submitted as a pdf via Canvas. You should also submit the code you write to answer your questions. Your code can use any of the software packages we have learned this semester. Code should follow the style guidelines and be reasonably concise and efficient. Submit code as plain text (with .txt extension) with the software used clearly labeled in the header.

Individual Component Timeline

There are three due dates for this component:

  1. Project Proposals Tuesday November 21 at 2pm: You must have the questions you pose approved by me. Your proposal should include two well-posed questions, each followed by a brief (1-3 sentence) description of the data sets, approach, and software you will used to answer it. Your proposal should be sent to me in the body of an email with subject header: “Stats 506 Individual Project Proposal”. I reserve the right to deny proposed questions that are too similar to those already submitted; it is to your benefit to submit your proposal in advance of the deadline.

  2. Draft Due: Tuesday December 12 at 9am via Canvas. We will engage in a peer-review process in class this day. Please bring three printed versions of your submissions with you. You will have an opportunity to revise your submission based on the feedback you receive.

If your draft is incomplete in any way, please include placeholders or an outline with a “to do” list.

  1. Final Due Date: Thursday December 14 at noon Monday December 18 at 7am. This is the deadline to submit your final draft via Canvas.

Group Component

The group component will be completed in groups of three. Groups have been assigned randomly and will be posted to Canvas.

Each group will choose a data management, analysis, or visualization technique and produce a tutorial on the selected topic. The tutorial should include:

Topics should be of similar scope to the UCLA page here.

Each group should write a short proposal containing: the topic for the tutorial and the languages/packages they will use for the examples. Each member of the group should assume primary responsibility for an example script and these responsibilities should be included in your proposal.

The group liaison should submit the groups’ proposal to me via email with the subject header “Stats 506 Group Project Proposal”.

The final tutorial should also be submitted to me via email as a stand alone html page, such as generated by R markdown.
I plan to post your submissions to this page – you may include or omit your name from the author information at your own discretion.

Group Component Timeline

  1. Topic Proposals due November 22 at 4pm: Your group must have your proposal approved by me prior to this time. Groups are required to select unique topics so it is again to your benefit to submit early.

  2. Draft Due Date: Friday December 8 at 4pm. Drafts should be mostly complete and contain a concise to-do list of outstanding items.

  3. Final Due Date: Wednesday December 20 at 7am. This is the deadline to submit the final version of your tutorial. Please submit as a webpage to me via email.

Approved Group Proposals

I will post group proposals here as they are approved. Two groups may not choose the same or related topics. The first two topics below are also reserved:

Reserved

  1. Reshaping Data between long and wide formats

  2. The “split-apply-combine” pattern

Approved

  1. Group 15 - Topic: Logistic Regression; Data: Titanic Survival; Languages: Python, R, Stata
  2. Group 1 - Topic: Robust Regression; Data: Employee Compensation in San Francisco; Languages: R (MASS::rlm), Stata, Matlab
  3. Group 6 - Topic: Interactive Graphics; Data: Video Game Sales; Languages: R (Shiny + Base), R (Shiny + ggplot2), Serving Shiny using AWS.
  4. Group 21 - Topic: Model Selection; Data: Home Sale Prices from King County; Languages: R, Python, Stata
  5. Group 19 - Topic: Permutation Testing; Data: Gene Expression and Running Distances from Rats; Languages: R, Python, Stata
  6. Group 11 - Topic: Poisson Regression; Data: Nesting Horshoe Crabs; Languages: Python, R, Stata
  7. Group 10 - Topic: Negative Binomial Regression; Data: Video Game Sales; Languages: R, SAS, Stata
  8. Group 7 - Topic: Decision Trees; Data: Iris; Languages: Python, R, SAS
  9. Group 16 - Topic: Comparing Clustering Techniques; Data: Funded Kickstarter projects; Languages: R, SAS, Stata
  10. Group 2 - Topic: Nearest Neighbors Classification; Data: Seed; Languages: Python, R, Stata
  11. Group 9 - Topic: Support Vector Machines; Data: Parkinson’s Disease; Languages: Matlab, Python, R
  12. Group 14 - Topic: K-means Clustering Data: Seeds Languages: Python, R, Stata
  13. Group 13 - Topic: Ridge regression; Data: faraway::seatpos (R); Languages: R, SAS, and Python
  14. Group 20 - Topic: Zero-inflated Poisson; Data: Crab; Languages: Python, R (pscl), Stata
  15. Group 4 - Topic: Weighted least squares and diagnostics; Data: Abalone; Languages: R, SAS, and Python 16.Group 12 - Topic: Artificial Nueral Networks; Data: Boston Housing (MASS::Boston); Languages: Matlab (‘Neural Net Clustering toolbox’), Python (‘Sklearn’), R (‘Neuralnet’)
  16. Group 17 - Topic: Lasso Regression; Data: Hand-written digits; Languages: Matlab, Python (Scikitlearn), R (glmnet)
  17. Group 8 - Topic: Changepoint Analysis; Data: Life Expectancy; Languages: R (bcp & changepoint) and SAS (proc mcmc)
  18. Group 18 - Topic: Principal Components Analysis; Data: USAarrests (R); Languages: Python (Scikitlearn), R (princomp()), Stata
  19. Group 3 - Topic: K-means w/ visualization and performance evaluation; Data: Iris; Languages: R, STATA and Python.
  20. Group 5 - Topic: Adaboost; Data: Occupancy Languages: R, Python, SAS