9/1/2020

Computatational Methods and Tools in Statistics

  • This course take a broad view of computational methods that encompasses the many ways - both routine and specialized - that computers are used by statistical analysts and data scientists.

Computatational Methods and Tools in Statistics

  • This broad view encompasses but is not limited to:

    • Managing, obtaining, and organizing data

    • Data exploration and visualization

    • Using statistical software for data analysis

    • Reproducible reporting and presentation of analyses

    • Computationally intensive methods in statistics and data science

Course Overview

  • Course Website: www.jbhender.github.io/Stats506/F20/

  • General computing skills: Linux shell, git, literate programming

  • Scripting Languages: R, Stata, SAS

  • “Advanced” R: dplyr, data.table, functional and object oriented programming

  • Other languages: (R)Markdown, (minimal) html

  • High performance, parallel, and batch computing

  • Computational algorithms: cross-validation, bootstrap re-sampling, permutation testing, Monte Carlo estimation, and simulation studies.

Canvas

  • Recorded lectures, reading assignments, quizzes, and surveys

  • There is a “Computing Experience” survey due this Thursday.

  • The quiz for the first module, Linux Shell Skills, is due next Tuesday.

  • New modules, with lectures, reading assignments and a quiz, will be released on Mondays. The quiz is always due the following Tuesday.

Class meetings

  • Beginning September 8, we will use class meetings on Tuesday for a synchronous activity related to the module released the previous week.

  • Activities will run each Tuesday until the Thanksgiving recess, with the exception of Tuesday November 3 (US election). This is 10 meetings.

  • Attendance on activity Tuesdays is required. Each class counts 10 points (1%) toward the course grade.

  • Class meetings on Thursdays and Tuesdays after the Thanksgiving recess will be brief Q & A sessions.

Optional Texts

  • There are three optional texts for this course, all on using R.

  • None of texts are required to complete problem sets or quizzes, but I will occasionally include optional readings from these texts to supplement course material.

  • The Art of R Programming, by Norman Matloff, is recommended for those with little to no previous experience in R.

  • R for Data Science by Garrett Grolemund, and Hadley Wickham is a useful read for both new and experienced R users and will be a frequent reference for this course.

  • Advanced R, by Hadley Wickham, is for those who would like to develop a deep understanding of R and its inner workings.

Statistical Software

  • Why will we focus on R, SAS, and Stata?

  • These appear to be the most in demand in the job market for statistical analysts.

  • For data science jobs, R and Python appear to be the most in demand.

  • Other courses devoted exclusively to Python.

Software Popularity

  • In The Popularity of Data Science Software, Robert Muenchen presents analyses measuring the popularity of various software in job postings and academic articles.

  • We will review the first 5 or so plots from the article.

Health Services Research

  • In health services research a large majority of academic articles use SAS or Stata (Dembe, et al; 2011)

About (some of) these languages

  • R, SAS, and Stata are examples of domain specific languages used primarily for statistics and data analysis.

  • SQL or ‘structured query language’ is a specialized language for querying databases.

  • Python is a general purpose scripting language with a number of libraries for math, statistics, data analysis, and machine learning including: numpy, pandas, statsmodels, sckitlearn, and tensorflow.

  • System and application languages such as C/C++, Java, and go can be used to produce very high performance code – but have a larger learning and development curve.