- This course take a broad view of computational methods that encompasses the many ways - both routine and specialized - that computers are used by statistical analysts and data scientists.
9/1/2020
This broad view encompasses but is not limited to:
Managing, obtaining, and organizing data
Data exploration and visualization
Using statistical software for data analysis
Reproducible reporting and presentation of analyses
Computationally intensive methods in statistics and data science
Course Website: www.jbhender.github.io/Stats506/F20/
General computing skills: Linux shell, git, literate programming
Scripting Languages: R, Stata, SAS
“Advanced” R: dplyr, data.table, functional and object oriented programming
Other languages: (R)Markdown, (minimal) html
High performance, parallel, and batch computing
Computational algorithms: cross-validation, bootstrap re-sampling, permutation testing, Monte Carlo estimation, and simulation studies.
Recorded lectures, reading assignments, quizzes, and surveys
There is a “Computing Experience” survey due this Thursday.
The quiz for the first module, Linux Shell Skills, is due next Tuesday.
New modules, with lectures, reading assignments and a quiz, will be released on Mondays. The quiz is always due the following Tuesday.
Beginning September 8, we will use class meetings on Tuesday for a synchronous activity related to the module released the previous week.
Activities will run each Tuesday until the Thanksgiving recess, with the exception of Tuesday November 3 (US election). This is 10 meetings.
Attendance on activity Tuesdays is required. Each class counts 10 points (1%) toward the course grade.
Class meetings on Thursdays and Tuesdays after the Thanksgiving recess will be brief Q & A sessions.
There are three optional texts for this course, all on using R.
None of texts are required to complete problem sets or quizzes, but I will occasionally include optional readings from these texts to supplement course material.
The Art of R Programming, by Norman Matloff, is recommended for those with little to no previous experience in R.
R for Data Science by Garrett Grolemund, and Hadley Wickham is a useful read for both new and experienced R users and will be a frequent reference for this course.
Advanced R, by Hadley Wickham, is for those who would like to develop a deep understanding of R and its inner workings.
Why will we focus on R, SAS, and Stata?
These appear to be the most in demand in the job market for statistical analysts.
For data science jobs, R and Python appear to be the most in demand.
Other courses devoted exclusively to Python.
In The Popularity of Data Science Software, Robert Muenchen presents analyses measuring the popularity of various software in job postings and academic articles.
We will review the first 5 or so plots from the article.
R, SAS, and Stata are examples of domain specific languages used primarily for statistics and data analysis.
SQL or ‘structured query language’ is a specialized language for querying databases.
Python is a general purpose scripting language with a number of libraries for math, statistics, data analysis, and machine learning including: numpy, pandas, statsmodels, sckitlearn, and tensorflow.
System and application languages such as C/C++, Java, and go can be used to produce very high performance code – but have a larger learning and development curve.