The readings below are assigned through Canvas:
Linux Shell Skills from Prof. Shedden’s 2016 Course notes
Tactical Tmux by Daniel Meissler
Statistics and Computation Service by UM ITS
If you are using a computer running a Mac or Linux OS you have access to a Linux terminal installed on your computer as Terminal.
If you are using a computer running Windows OS, you will need to install a terminal or shell program capable of acting as such or a stand alone ssh client:
Experienced users may be interested in zsh – now the default in Mac OS – but we will not make use of its additional features for this class.
You can also access a virtual windows machine using MiDesktop, with both of the programs above available through AppsAnywhere.
In order to connect to university Linux servers, you need to have an AFS home directory. If you do not have one, you can set it up by visiting http://mfile.umich.edu/ and selecting the ‘AFS Self-Provisioning Tool’.
You can connect to a UM Linux server using ssh
as follows:
ssh uniqname@login.itd.umich.edu
ssh uniqname@scs.dsc.umich.edu
ssh uniqname@mario.dsc.umich.edu
ssh uniqname@luigi.dsc.umich.edu
Replace uniqname with your UM unique name which is same as the part of your UM email address preceding @umich.edu
.
If you have trouble connecting to the SCS servers please visit this help page.
scs.dsc.umich.edu
using ssh
. Which host were you connected to? Log out.login.itd.umich.edu
using ssh
. Which host were you connected to? Log out an connect again. Were you connected to the same host?In Linux essentially everything is a file: this includes program executables, system configurations, as well as your own data and source files.
Files are organized hierarchically into directories beginning with the root directory /
. Directories can contain files and sub-directories with locations in the directory hierarchy separated by a /
. This collection of directories and files is called a file tree.
Use the following commands to navigate and interact with the file tree:
ls
(list files), ls -a
, ls -l
cd
(change directories)pwd
(print the current or working directory)mkdir
(make directory), mkdir -p
rmdir
(remove directory)rm
(remove a file), rm -r
mv
Move a file or directoryfind
Find a file.In working with files, it is helpful to know:
.
refers to the current directory,..
refers to the parent directory, one step up the file tree.cd
invoked with no arguments will return you to your home directory..
. To see these files, use ls -a
.*
matches any sequence of characters?
matches any single character.Environment variables determine certain aspects of how the OS behaves and responds to your instructions. Here are a couple of important ones:
In the Bash shell, use $
to access the value of an environment variable. The echo
command can be used to print these values to the screen echo $SHELL
. You can also recognize the bash shell by its prompt $
vs the %
prompt used by csh, zsh, and other shells.
Use which
to search your $PATH
for an executable command.
A tilde ~
will often be expanded as $HOME
.
For more see the GNU Coreutils documentation.
In order to edit files in the shell, you will need to use a text editor. Some popular choices are:
I personally use emacs and that will be the editor I use in examples presented to the class.
If you do not already have a preferred text editor, please pick one to learn for the course, find a tutorial on it, and work through that tutorial. I also recommend finding a “cheat sheet” or reference card you can refer to as needed.
A terminal multiplexer allows you to invoke multiple shells from the same terminal connection and to keep these sessions running after you log off. The two most common are screen
and tmux
, with the latter being the preferred option for this course.
When using a terminal multiplexer with a networked file system such as AFS, be aware that your credentials or “ticket” for accessing the networked files will typically expire after a fixed amount of time (e.g. 24 hours). You can renew this ticket for a fixed amount of time using kinit
:
There are many ways to transfer data to a remote server using the shell. Three common ways to do this from the command line are:
scp
to copy to/from your local computer,
wget
to download directly from the web,
sftp
or ‘secure file transfer protocol’ for transferring large volumes of data.
scp
To transfer a single smallish file from the working directory on your local machine to your AFS space:
To transfer a file from the remote directory to your local computer reverse the arguments:
For larger transfers you should use sftp
for efficiency.
wget
To download data directly from a website to a remote server use a web browser to find the URL to the file and use wget
:
sftp
You can use sftp
for interactive data transfer sessions.
First, connect to the server using the sftp
command.
Then, use put
to transfer files from the local host to the remote server or get
to transfer files in the other direction. Use familiar file system commands such as ls
and cd
to navigate the file tree on the remote server.
If you want to read more, refer to the documentation using man sftp
on one of the University linux servers.
Transfer data to and from your AFS space using the login pool to avoid adding strain to the computation servers.
Make sure you only download from trusted sources!
Use one of the methods from the example above to transfer the RECS data from the example to your AFS space.
There are three standard streams or channels used to communicate data in most computer programs: stdin
(standard input), stdout
(standard output), and stderr
(standard error). In the Linux shell, these streams are files located in the /dev/
directory, e.g. /dev/stdin
, /dev/stdout
and /dev/stderr
.
Most command line tools utilize stdout
and stderr
to communicate to the user as these streams print to the console. However, these streams can be redirected using the symbols >
(for stdout
) and &
(for stderr
). Similarly we can redirect the contents of a regular file to stdin
using <
, e.g. < file.txt
.
For example, the command below will print “hello!” to stdout.
In comparison, the command below redirects the text to a file welcome.txt
.
We can also append stdout
to an existing file using >>
.
The file we’ve just created can be redirected to stdin
in the example below, where we translate the alphabetic characters to all caps using the tr
command.
Finally, we can build pipelines combining multiple command line tools by redirecting stdout
from one command to stdin
for the next using a pipe |
. Short pipelines such as these are often called “one-liners”.
For additional examples read section 2.3 (especially 2.3.4 and 2.3.5) of Data Science at the Command Line.
Large files often contain redundant data and can be stored using less space on disk in a compressed format. Depending on the system and the file, compression can make reading from or writing to a file more efficient as reading the bits off disk is an “I/O-bound” task while decoding/decompressing is a “CPU-bound” task. This is particularly useful on shared systems with I/O bottlenecks.
The du
or disk utilization utility can be used to see the space on disk used by a file or set of files. Use the -h
option to print values in human readable units. Use -s
to get sum totals for a directory.
There are many compression tools, one of the most popular is gzip
. The command,
compresses file.txt
into file.gz
.
The file can be uncompressed using,
or
the original extension is stored in the compressed file.
You can retain the compressed copy and unzip directly to standard output using the -c
option:
zcat
is a shortcut that does the same thing.
A tarball is an archive of a file tree and often compressed. This can be useful for transferring whole directories between machines manually.
It is also a way to cleanly archive files from projects you would like to retain, but no longer need to use frequently. Many programs have the ability to work directly with archived and/or compressed data.
The two most common use cases are creating an archive,
and extracting the archive,
The extension .tgz
is short for .tar.gz
indicating that the archive has been compressed using gzip
.
You will likley find the following command line tools useful:
head
- read the first n lines of a file
tail
- read the last n lines of a file
wc
- count words or use wc -l
to count lines
grep
- find lines in files that match string patterns
sort
- sort a file on one or more fields
cut
- extract select columns from a delimited file
paste
- concatenate files line by line
join
- merge two files based on a common field.
nl
- number the lines in a file.
You may also wish to read chapter 5 of Data Science at the Command Line.