- This course take a broad view of computational methods that encompasses the many ways - both routine and specialized - that computers are used by statistical analysts and data scientists.
9/3/2019
This broad view encompasses but is not limited to:
Managing, obtaining, and organizing data
Data exploration and visualization
Using statistical software for data analysis
Reproducible reporting and presentation of analyses
Computationally intensive methods in statistics and data science
Course Website: www.jbhender.github.io/Stats506/F19/
General computing skills: Linux shell, git, literate programming
Scripting Languages: R, Stata, SAS
“Advanced” R: dplyr, data.table, functional and object oriented programming, Rcpp
Other languages: (R)Markdown, html, C++, SQL
High performance and parallel computing
Computational algorithms: cross-validation, bootstrap re-sampling, permutation testing, Monte Carlo estimation, and simulation studies.
Reading assignments, quizzes, and surveys
First of these is a survey and due this Thursday.
The second is a quiz and due next Tuesday.
For future readings and quizzes I will always give you at least five days and usually a week or more to complete them.
There are three optional texts for this course, all on using R.
None of texts are required to complete problem sets or quizzes, but I will occasionally include optional readings from these texts to supplement course material.
The Art of R Programming, by Norman Matloff, is recommended for those with little to no previous experience in R.
R for Data Science by Garrett Grolemund, and Hadley Wickham is a useful read for both new and experienced R users and will be a frequent reference for this course.
Advanced R, by Hadley Wickham, is for those who would like to develop a deep understanding of R and its inner workings.
Why will we focus on R, SAS, and Stata?
These appear to be the most in demand in the job market for statistical analysts.
For data science jobs, R and Python appear to be the most in demand.
Other courses devoted exclusively to Python.
In The Popularity of Data Science Software, Robert Muenchen presents analyses measuring the popularity of various software in job postings and academic articles.
We will review the first 5 or so plots from the article.
R, SAS, and Stata are examples of domain specific languages used primarily for statistics and data analysis.
SQL or ‘structured query language’ is a specialized language for querying databases.
Python is a general purpose scripting language with a number of libraries for math, statistics, data analysis, and machine learning including: numpy, panda, statsmodels, sckitlearn, and tensorflow.
System and application languages such as C/C++, Java, and go can be used to produce very high performance code – but have a larger learning and development curve.