9/3/2019

Computatational Methods and Tools in Statistics

  • This course take a broad view of computational methods that encompasses the many ways - both routine and specialized - that computers are used by statistical analysts and data scientists.

Computatational Methods and Tools in Statistics

  • This broad view encompasses but is not limited to:

    • Managing, obtaining, and organizing data

    • Data exploration and visualization

    • Using statistical software for data analysis

    • Reproducible reporting and presentation of analyses

    • Computationally intensive methods in statistics and data science

Course Overview

  • Course Website: www.jbhender.github.io/Stats506/F19/

  • General computing skills: Linux shell, git, literate programming

  • Scripting Languages: R, Stata, SAS

  • “Advanced” R: dplyr, data.table, functional and object oriented programming, Rcpp

  • Other languages: (R)Markdown, html, C++, SQL

  • High performance and parallel computing

  • Computational algorithms: cross-validation, bootstrap re-sampling, permutation testing, Monte Carlo estimation, and simulation studies.

Canvas

  • Reading assignments, quizzes, and surveys

  • First of these is a survey and due this Thursday.

  • The second is a quiz and due next Tuesday.

  • For future readings and quizzes I will always give you at least five days and usually a week or more to complete them.

Optional Texts

  • There are three optional texts for this course, all on using R.

  • None of texts are required to complete problem sets or quizzes, but I will occasionally include optional readings from these texts to supplement course material.

  • The Art of R Programming, by Norman Matloff, is recommended for those with little to no previous experience in R.

  • R for Data Science by Garrett Grolemund, and Hadley Wickham is a useful read for both new and experienced R users and will be a frequent reference for this course.

  • Advanced R, by Hadley Wickham, is for those who would like to develop a deep understanding of R and its inner workings.

Statistical Software

  • Why will we focus on R, SAS, and Stata?

  • These appear to be the most in demand in the job market for statistical analysts.

  • For data science jobs, R and Python appear to be the most in demand.

  • Other courses devoted exclusively to Python.

Software Popularity

  • In The Popularity of Data Science Software, Robert Muenchen presents analyses measuring the popularity of various software in job postings and academic articles.

  • We will review the first 5 or so plots from the article.

Health Services Research

  • In health services research a large majority of academic articles use SAS or Stata (Dembe, et al; 2011)

About (some of) these languages

  • R, SAS, and Stata are examples of domain specific languages used primarily for statistics and data analysis.

  • SQL or ‘structured query language’ is a specialized language for querying databases.

  • Python is a general purpose scripting language with a number of libraries for math, statistics, data analysis, and machine learning including: numpy, panda, statsmodels, sckitlearn, and tensorflow.

  • System and application languages such as C/C++, Java, and go can be used to produce very high performance code – but have a larger learning and development curve.