Readings

The readings below are assigned through Canvas:

  1. Linux Shell Skills from Prof. Shedden’s 2016 Course notes

  2. Tactical Tmux by Daniel Meissler

  3. Statistics and Computation Service by UM ITS


Terminals

If you are using a computer running a Mac or Linux OS you have access to a Linux terminal installed on your computer as Terminal.

If you are using a computer running Windows OS, you will need to install a terminal or shell program capable of acting as such or a stand alone ssh client:

Experienced users may be interested in zsh – now the default in Mac OS – but we will not make use of its additional features for this class.

You can also access a virtual windows machine using MiDesktop, with both of the programs above available through AppsAnywhere.


Connecting to UM Machines

AFS

In order to connect to university Linux servers, you need to have an AFS home directory. If you do not have one, you can set it up by visiting http://mfile.umich.edu/ and selecting the ‘AFS Self-Provisioning Tool’.

Hosts

You can connect to a UM Linux server using ssh as follows:

  ssh uniqname@login.itd.umich.edu
  ssh uniqname@scs.dsc.umich.edu
  ssh uniqname@mario.dsc.umich.edu
  ssh uniqname@luigi.dsc.umich.edu

Replace uniqname with your UM unique name which is same as the part of your UM email address preceding @umich.edu.

Trouble connecting

If you have trouble connecting to the SCS servers please visit this help page.

Exercises

  1. In a web browser, visit https://mfile.umich.edu and log in to view your files. If needed, use the self-provisioning tool first. Here is a direct link to the self-provisioning tool: https://ifsprovisioning.its.umich.edu/ifs_storage/request .
  2. Find or install your terminal program and open it.
  3. Connect to scs.dsc.umich.edu using ssh. Which host were you connected to? Log out.
  4. Connect to login.itd.umich.edu using ssh. Which host were you connected to? Log out an connect again. Were you connected to the same host?

File system

In Linux essentially everything is a file: this includes program executables, system configurations, as well as your own data and source files.

Files are organized hierarchically into directories beginning with the root directory /. Directories can contain files and sub-directories with locations in the directory hierarchy separated by a /. This collection of directories and files is called a file tree.

Use the following commands to navigate and interact with the file tree:

In working with files, it is helpful to know:

Environment variables

Environment variables determine certain aspects of how the OS behaves and responds to your instructions. Here are a couple of important ones:

  • HOME (location of your home directory)
  • SHELL (the shell you are using to interface with the machine)
  • PATH (locations to search for executable programs)

In the Bash shell, use $ to access the value of an environment variable. The echo command can be used to print these values to the screen echo $SHELL. You can also recognize the bash shell by its prompt $ vs the % prompt used by csh, zsh, and other shells.

Use which to search your $PATH for an executable command.

A tilde ~ will often be expanded as $HOME.

Resources

For more see the GNU Coreutils documentation.

Text Editors

In order to edit files in the shell, you will need to use a text editor. Some popular choices are:

I personally use emacs and that will be the editor I use in examples presented to the class.

If you do not already have a preferred text editor, please pick one to learn for the course, find a tutorial on it, and work through that tutorial. I also recommend finding a “cheat sheet” or reference card you can refer to as needed.


Terminal Multiplexers

A terminal multiplexer allows you to invoke multiple shells from the same terminal connection and to keep these sessions running after you log off. The two most common are screen and tmux, with the latter being the preferred option for this course.

When using a terminal multiplexer with a networked file system such as AFS, be aware that your credentials or “ticket” for accessing the networked files will typically expire after a fixed amount of time (e.g. 24 hours). You can renew this ticket for a fixed amount of time using kinit:

kinit -4d
aklog

Transferring data

There are many ways to transfer data to a remote server using the shell. Three common ways to do this from the command line are:

scp

To transfer a single smallish file from the working directory on your local machine to your AFS space:

scp ./local_file.ext uniqname@login.itd.umich.edu:~/remote_directory/

To transfer a file from the remote directory to your local computer reverse the arguments:

scp uniqname@login.itd.umich.edu:~/remote_directory/remote_file.ext ./

For larger transfers you should use sftp for efficiency.

wget

To download data directly from a website to a remote server use a web browser to find the URL to the file and use wget:

wget https://remote.url.edu/path/to/file/data.txt

sftp

You can use sftp for interactive data transfer sessions.

First, connect to the server using the sftp command.

sftp uniqname@login.itd.umich.edu

Then, use put to transfer files from the local host to the remote server or get to transfer files in the other direction. Use familiar file system commands such as ls and cd to navigate the file tree on the remote server.

If you want to read more, refer to the documentation using man sftp on one of the University linux servers.

General Advice

Transfer data to and from your AFS space using the login pool to avoid adding strain to the computation servers.

Make sure you only download from trusted sources!

Example

We will work through the following example.

Exercise

Use one of the methods from the example above to transfer the RECS data from the example to your AFS space.


Streams, pipes and file redirections

There are three standard streams or channels used to communicate data in most computer programs: stdin (standard input), stdout (standard output), and stderr (standard error). In the Linux shell, these streams are files located in the /dev/ directory, e.g. /dev/stdin, /dev/stdout and /dev/stderr.

Most command line tools utilize stdout and stderr to communicate to the user as these streams print to the console. However, these streams can be redirected using the symbols > (for stdout) and & (for stderr). Similarly we can redirect the contents of a regular file to stdin using <, e.g. < file.txt.

For example, the command below will print “hello!” to stdout.

echo hello!

In comparison, the command below redirects the text to a file welcome.txt.

echo hello! > welcome.txt

We can also append stdout to an existing file using >>.

echo 'stats 506!' >> welcome.txt

The file we’ve just created can be redirected to stdin in the example below, where we translate the alphabetic characters to all caps using the tr command.

< welcome.txt tr '[a-z]' '[A-Z]'

Finally, we can build pipelines combining multiple command line tools by redirecting stdout from one command to stdin for the next using a pipe |. Short pipelines such as these are often called “one-liners”.

echo 'hello stats 506!' | tr '[a-z]' '[A-Z]' | tr ' ' \\n

For additional examples read section 2.3 (especially 2.3.4 and 2.3.5) of Data Science at the Command Line.


Compression and archiving

Large files often contain redundant data and can be stored using less space on disk in a compressed format. Depending on the system and the file, compression can make reading from or writing to a file more efficient as reading the bits off disk is an “I/O-bound” task while decoding/decompressing is a “CPU-bound” task. This is particularly useful on shared systems with I/O bottlenecks.

Disk utilization

The du or disk utilization utility can be used to see the space on disk used by a file or set of files. Use the -h option to print values in human readable units. Use -s to get sum totals for a directory.

gzip

There are many compression tools, one of the most popular is gzip. The command,

gzip file.txt

compresses file.txt into file.gz.

The file can be uncompressed using,

gunzip file.gz

or

gzip -d file.gz

the original extension is stored in the compressed file.

You can retain the compressed copy and unzip directly to standard output using the -c option:

gunzip -c file.gz > file.txt

zcat is a shortcut that does the same thing.

tar

A tarball is an archive of a file tree and often compressed. This can be useful for transferring whole directories between machines manually.
It is also a way to cleanly archive files from projects you would like to retain, but no longer need to use frequently. Many programs have the ability to work directly with archived and/or compressed data.

The two most common use cases are creating an archive,

tar cvfz name.tgz ./parent_folder

and extracting the archive,

tar xvfz name.tgz

The extension .tgz is short for .tar.gz indicating that the archive has been compressed using gzip.

Other common tools

You will likley find the following command line tools useful:

You may also wish to read chapter 5 of Data Science at the Command Line.


Shell scripting

A shell script is a program constructed from shell commands. You can view an example here.

For more on shell scripting, see Chapter 4 from Data Science at the Command Line by Jeroen Janssens.

Statistics 506, Fall 2020 Homepage