Readings

The readings below are assigned through Canvas:

  1. Linux Shell Skills from Prof. Shedden’s 2016 Course notes

  2. A tmux Primer by Daniel Meissler

  3. Statistics and Computation Service by UM ITS


Terminals

If you are using a computer running a Mac or Linux OS you have access to a Linux terminal installed on your computer as Terminal.

If you are using a computer running a Windows OS, you will need to install a terminal or shell program capable of acting as such. Most people use Putty but Powershell may also be an option.

Experienced users may be interested in zsh but we will not make use of it for this class.


Connecting to UM Machines

AFS

In order to connect to university Linux servers, you need to have an AFS home directory. If you do not have one, you can set it up by visiting http://mfile.umich.edu/ and selecting the ‘AFS Self-Provisioning Tool’.

Hosts

You can connect to a UM Linux server using ssh as follows:

Replace uniqname with your UM unique name which is same as the first part of your UM email address.

Trouble connecting

If you have trouble connecting to the SCS servers please visit this help page.

Exercises

  1. In a web browser, visit https://mfile.umich.edu and log in to view your files. If needed, use the self-provisioning tool first. Here is a direct link to the self provisioning tool: https://ifsprovisioning.its.umich.edu/ifs_storage/request .
  2. Find or install your terminal program and open it.
  3. Connect to scs.dsc.umich.edu using ssh. Which host were you connected to? Log out.
  4. Connect to login.itd.umich.edu using ssh. Which host were you connected to? Log out an connect again. Were you connected to the same host?

File system

In Linux essentially everything is a file: this includes program executables, system configurations, as well as your own data and source files.

Files are organized hierarchically into directories beginning with the root directory /. Directories can contain files and sub-directories with locations in the directory hierarchy separated by a /. This collection of directories and files is called a file tree.

Use the following commands to navigate and interact with the file tree:

In working with files, it is helpful to know:

Environment variables

Environment variables determine certain aspects of how the OS behaves and responds to your instructions. Here are a couple of important ones:

  • HOME (location of your home directory)
  • SHELL (the shell you are using to interface with the machine)
  • PATH (locations to search for executable programs)

In the Bash shell, use $ to access the value of an environment variable. The echo command can be used to print these values to the screen echo $SHELL. You can also recongize the bash shell by its prompt $ vs the % prompt used by csh and other shells.

Use which to search your $PATH for an executable command.

A tilde ~ will often be expanded as $HOME.

Resources

For more see the GNU Coreutils documentation.

Text Editors

In order to edit files in the shell, you will need to use a text editor. Some popular choices are:

You can find links to tutorials in Prof Shedden’s notes assigned as reading. I personally use emacs and that will be the editor I use in examples presented to the class.

If you do not already have a preferred text editor, please pick one to learn for the course, find a tutorial on it, and work through that tutorial.


Terminal Multiplexers

A terminal multiplexer allows you to invoke multiple shells from the same terminal connection and to keep these sessions running after you log off. The two most common are screen and tmux, with the latter being the preferred option for this course.

When using a terminal multiplexer with a networked file system such as AFS, be aware that your credentials or “ticket” for accessing the networked files will typically expire after a fixed amount of time (e.g. 24 hours). You can renew this ticket for a fixed amount of time using kinit:

Transferring data

There are many ways to transfer data to a remote server using the shell. Three common ways to do this from the command line are:

+scp to copy to/from your local computer,

+wget to download directly from the web,

+sftp or ‘secure file transfer protocol’ for transferring large volumes of data.

To transfer a single smallish file from the working directory on your local machine to your AFS space:

To transfer a file from the remote directory to your local computer reverse the arguments:

For larger transfers you should use sftp for efficiency. Transfer data using the login pool to avoid adding strain to the computation servers.

To download data directly from a website to a remote server use a web browser to find the URL to the file and use wget:

Make sure you only download from trusted sources!

Use sftp for interactive sessions, read more using man sftp on one of the University linux servers.

Examples

We will work through the following example.

Exercise

Use one of the methods above to transfer the RECS data to your AFS space.


Pipes and file redirections

We will review some of the examples from sections 2.3.4 and 2.3.5 of Data Science at the Command Line. You may wish to read all of section 2.3.


Compression and archiving

Large files often contain redundant data and can be stored using less space on disk in a compressed format. Depending on the system and the file, compression can make reading from or writing to a file more efficient as reading the bits off disk is an “I/O-bound” task while decoding/decompressing is a “CPU-bound” task. This is particularly useful on shared systems with I/O bottlenecks.

Disk utilization

The du or disk utilization utility can be used to see the space on disk used by a file or set of files. Use the -h option to print values in human readable units. Use -s to get sum totals for a directory.

gzip

There are many compression tools, one of the most popular is gzip. The command,

compresses file.txt into file.gz.

The file can be uncompressed using,

or

the original extension is stored in the compressed file.

You can retain the compressed copy and unzip directly to standard output using the -c option:

zcat is a shortcut that does the same thing.

tar

A tarball is an archive of a file tree and often compressed. This can be useful for transferring whole directories between machines manually. It is also a way to cleanly archive files from projects you would like to retain, but no longer need to use frequently. Many programs have the ability to work directly with archived and/or compressed data.

The two most common use cases are creating an archive,

and extracting the archive,

The extension .tgz is short for .tar.gz indicating that the archive has been compressed using gzip.

Other common tools

You will likley find the following command line tools useful:

We will look at examples in class as time permits. You may also wish to read chapter 5 of Data Science at the Command Line.


Shell scripting

A shell script is a program constructed from shell commands. You can view an example here.

For more on shell scripting, see Chapter 4 from Data Science at the Command Line by Jeroen Janssens.

Statistics 506, Fall 2019 Homepage