Readings

The readings below are assigned through Canvas:

  1. Linux Shell Skills from Prof. Shedden’s 2016 Course notes

  2. A tmux Primer by Daniel Meissler

  3. Statistics and Computation Service by UM ITS


Terminals

If you are using a computer running a Mac or Linux OS you have access to a Linux terminal installed on your computer as Terminal.

If you are using a computer running a Windows OS, you will need to install a terminal or shell program capable of acting as such. Most people use Putty but Powershell may also be an option.

Experienced users may be interested in zsh but we will not make use of it for this class.


Connecting to UM Machines

AFS

In order to connect to university Linux servers, you need to have an AFS home directory. If you do not have one, you can set it up by visiting http://mfile.umich.edu/ and selecting the ‘AFS Self-Provisioning Tool’.

Hosts

You can connect to a UM Linux server using ssh as follows:

  ssh uniqname@login.itd.umich.edu
  ssh uniqname@scs.dsc.umich.edu
  ssh uniqname@mario.dsc.umich.edu
  ssh uniqname@luigi.dsc.umich.edu

Replace uniqname with your UM unique name which is same as the first part of your UM email address.

Trouble connecting

If you have trouble connecting to the SCS servers please visit this help page.

Exercises

  1. In a web browser, visit https://mfile.umich.edu and log in to view your files. If needed, use the self-provisioning tool first.
  2. Find or install your terminal program and open it.
  3. Connect to scs.dsc.umich.edu using ssh. Which host were you connected to? Log out.
  4. Connect to login.itd.umich.edu using ssh. Which host were you connected to? Log out an connect again. Were you connected to the same host?

File system

In Linux essentially everything is a file: this includes program executable, system configurations, as well as your own data and source files.

Files are organized hierarchically into directories beginning with the root directory /. Directories can contain files and sub-directories with locations in the directory hierarchy separated by a /. This collection of directories and files is called a file tree.

Use the following commands to navigate and interact with the file tree:

In working with files, it is helpful to know:

Environment variables

Environment variables determine certain aspects of how the OS behaves and responds to your instructions. Here are a couple of important ones:

  • HOME (location of your home directory)
  • SHELL (the shell you are using to interface with the machine)
  • PATH (locations to search for executable programs)

In the Bash shell, use $ to access the value of an environment variable. The echo command can be used to print these values to the screen echo $SHELL.

Use which to search your $PATH for an executable command.

A tilde ~ will often be expanded as $HOME.

Resources

For more see the GNU Coreutils documentation.

Text Editors

In order to edit files in the shell, you will need to use a text editor. Some popular choices are:

You can find links to tutorials in Prof Shedden’s notes assigned as reading. I personally use emacs and that will be the editor I use in examples presented to the class.

If you do not already have a preferred text editor, please pick one to learn for the course, find a tutorial on it, and work through that tutorial.


Terminal Multiplexers

A terminal multiplexer allows you to invoke multiple shells from the same terminal connection and to keep these sessions running after you log off. The two most common are screen and tmux, with the latter being the preferred option for this course.

When using a terminal multiplexer with a networked file system such as AFS, be aware that your credentials or “ticket” for accessing the networked files will typically expire after a fixed amount of time (e.g. 24 hours). You can renew this ticket for a fixed amount of time using kinit:

kinit -4d
aklog

Transferring data

There are many ways to transfer data to a remote server using the shell.
Three common ways to do this from the command line are:

+scp to copy to/from your local computer,

+wget to download directly from the web,

+sftp or ‘secure file transfer protocol’ for transferring large volumes of data.

To transfer a single smallish file from the working directory on your local machine to your AFS space:

scp ./local_file.ext uniqname@login.itd.umich.edu:~/remote_directory/

To transfer a file from the remote directory to your local computer reverse the arguments:

scp uniqname@login.itd.umich.edu:~/remote_directory/remote_file.ext ./

For larger transfers you should use sftp for efficiency. Transfer data using the login pool to avoid adding strain to the computation servers.

To download data directly from a website to a remote server use a web browser to find the URL to the file and use wget:

wget https://remote.url.edu/path/to/file/data.txt

Make sure you are only download from trusted sources!

Use sftp for interactive sessions, read more using man sftp on one of the University remotes.

Exercise

Use one of the methods above to transfer the RECS data to your AFS space.


Pipes and file redirections

We will review some of the examples from sections 2.3.4 and 2.3.5 of Data Science at the Command Line. You may wish to read all of section 2.3.


Compression and archiving

Large files often contain redundant data and can be stored using less space on disk in a compressed format. Depending on the system and the file, compression can make reading from or writing to a file more efficient as reading the bits off disk is “I/O bound” while decoding/decompressing is “CPU bound”. This is particularly useful on shared systems with I/O bottlenecks.

Disk utilization

The du or disk utilization utility can be used to see the space on disk used by a file or set of files. Use the -h option to print values in human readable units. Use -s to get sum totals for a directory.

gzip

There are many compression tools, one of the most popular is gzip. The command,

gzip file.txt

compresses file.txt into file.gz.

The file can be uncompressed using,

gunzip file.gz

or

gzip -d file.gz

the original extension is stored in the compressed file.

You can retain the compressed copy and unzip directly to standard output using the -c option:

gunzip -c file.gz > file.txt

zcat is a shortcut that does the same thing.

tar

A tarball is an archive of a file tree and often compressed. This can be useful for transferring directories between machines manually. It is also a way to cleanly archive files from projects you would like to retain, but no longer need to use frequently. Many programs have the ability to work directly with archived and/or compressed data.

The two most common use cases are creating an archive,

tar cvfz name.tgz ./parent_folder

and extracting the archive,

tar xvfz name.tgz

The extension .tgz is short for .tar.gz indicating that the archive has been compressed using gzip.

Other common tools

You may at times find the following command line tools useful:

We will look at examples in class as time permits. You may wish to read chapter 5 of Data Science at the Command Line.


Shell scripting

A shell script is a program constructed from shell commands. You can view an example here.

For more on shell scripting, see Chapter 4 from Data Science at the Command Line by Jeroen Janssens.

Statistics 506, Fall 2018 Homepage