Course: Running Bioinformatics Software on a Linux Computer Cluster
Licence This manual is © 2025, Steven Wingett
This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that you are free:
• to copy, distribute, display, and perform the work
• to make derivative works
Under the following conditions:
• Attribution. You must give the original author credit.
• Non-Commercial. You may not use this work for commercial purposes.
• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one.
Please note that:
• For any reuse or distribution, you must make clear to others the licence terms of this work. • Any of these conditions can be waived if you get permission from the copyright holder. • Nothing in this license impairs or restricts the author’s moral rights.
Full details of this licence can be found at http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode
Connect to the LMB cluster (you should already be connected to the LMB intranet) using a Mac terminal or Putty for Windows.
What message do you see when you log in?
Is your username displayed when you log in?
Connect to the cluster using FileZilla.
Using FileZilla, create a folder in your cluster home directory named LMB_Cluster_Course_Exercises
.
Note: your home directory will be your username (e.g. /lmb/home/jsmith
).
Copy the data files given to you at the start of this course into the directory on the cluster you just created.
Change your working directory to LMB_Cluster_Course_Exercises
(which should already be in your home directory).
Confirm your new working directory with pwd
.
Create a folder in here named Exercise2
.
Rename this directory to Exercise_2
.
Go into the folder Exercise_2
.
Type ls
to confirm the folder is empty.
We’ve not introduced the touch
command so far, so let’s try it now. Enter in the command line:
touch file1.txt
Has anything happened?
file1.txt
? (You will need to use a command to do this.)
`Make a copy of file1.txt
named file2.txt
.
Are the file sizes the same?
file1.txt
.Go “up a level” in the filesystem hierarchy (when you have done this you should be able to see the folder Exercise_2
when you run the command ls
).
Delete the folder Exercise_2
with the command rmdir
.
Did that work? If not, why not, and find a way to delete Exercise_2
(look in the manual about recursive deletions).
Display the recent Bash commands you have entered.
Only perform this step if you were given the data files in the form of tar
archive.
Let’s open the archive file we copied from your local machine to the cluster:
tar xvzfp [course_title].tar.gz
(Use the actual filename when running this command i.e. not “[course_title]”.)
Apparently, certain web browser setting cause the archive to unzipped upon downloading. If your archive on the cluster ends with the file extension .tar
instead of .tar.gz
, then execute the following command:
tar xvfp [course_title].tar
(The tar
command is useful, as it allows multiple files and the associated file hierarchy to be stored within a single archive file. You don’t need to understand this command at the moment.)
Explore the MAZE
folder. Use cd
to move around the maze and ls
to check what is in each room. Can you find the treasure?
In the folder Exercise_3
you will find a file entitled poem.txt
. Write out the contents to the screen using the command cat
.
Now run a command so you can scroll through the text one page at a time.
Write out the first 7 lines of the poem to the screen.
Write out the bottom 3 lines of poem to the screen.
Add the line “The Waste Land by T. S. Elliot” to the start of the file. And a blank line below this. Save the file.
Compress the file. Notice whether the file size changed after compression. Did the filename change?
Read the compressed file with cat
and then zcat
. What happens in each case?
What does the command date
do?
Try this command and redirect the output to a file named date.txt
.
Now try the command again, but this append the output to date.txt
.
View the contents of date.txt.
What is the result of adding the flag --version
to date
?
Look at the file named uk_counties.csv
. Using the grep
command, create a new file named scotland_counties.csv
that contains only Scottish counties.
Using the grep
command, create a new file named other_counties.csv
that contains all counties except Scottish counties. We advise reading the manual on grep
to assist with this. (Hint: this is inverting the grep matching.)
Using the commands head
and tail
and also using a pipe, create a file named subset_counties.csv
that contains the counties on lines 27-39 of the original file.
Using a single wildcard, create symbolic links to files in the files_list
folder that meet the following criteria:
.tsv
.B
or C
and end with the file extension .txt
.The links should be generate outside the file_list folder, in separate folders named TSV_links
and TXT_links
.
View the contents of the file add_name1.txt
and then edit the file by adding your name to the end of the file. However, don’t do this with a text editor (such as nano
), but instead use a single Bash command which contains a re-direct.
Repeat 1., but this time edit the file /usr/bin/who
. Where you able to do this? If not, why not? Maybe checking the file permissions will clarify the situation?
To what groups do you belong (the name of the required Linux command is quite intuitive)? To what groups does the person who is running the course belong? If you can’t work out how to do this, then look in the man
pages.
sorted.txt
.curl
can name the downloaded file the same as the file on the remote server – look in the Linux man pages to find the relevant curl
flag.)Print the contents of the $USER
variable to the screen. Look familiar?
List your running processes with ps
. Then look at all jobs running on your current node with top
. Then look at ONLY YOUR jobs running on your current node using top
(look in the man
pages for top
, there is a flag that enables users to do this).
Where is the curl
program found on your system? Check this location is indeed in the $PATH
variable.
Use the sleep
command to suspend execution on your system for 10 seconds.
Try the sleep
command again, but end the job once it has started.
Execute the sleep command for 60s, but this time background the job. Can you see the running sleep command with ps
and top
?
Try again, but set the sleep to 100s. Suspend the command. Can you see the suspended command with ps
? Now kill
the sleep command.
Look at the available modules for the latest version of R. Import this latest version of R as a cluster module. Check this version of R is indeed now running on your system using the command: Rscript --version
. [Note: to run R code already saved to a file, you need to execute the Rscript
command.]
Look at all the currently running jobs submitted to the cluster. Can you see any long-running jobs? Are CPU / GPU nodes being used? Who has most jobs running on the cluster?
Look at the sqsummary
command. What percentage of CPUs are currently available?
Look at the sinfo
command.
Look at qinfo
webpage: http://nagios2/qinfo/
Are most of the CPU nodes in use? Are any nodes listed as “down”? Which user is using the most CPU nodes?
Log in to a compute node, try some commands and then exit the compute node.
Log in to a compute node, but this time reserve 4 cores and 5GB RAM.
Let’s run the R script norm_dist_1_billion.R
by submitting the job to the cluster as a non-interactive job.
This R script randomly generates 1 billion data values from a Normal Distribution. The results are then plotted as a histogram.
Firstly, make a bash script named norm_dist.sh
which contains the command:
Rscript norm_dist_1_billion.R
.
(Don’t forget the link to the Bash Shell at the top of the file!)
Now run the job, allocating 1 core and 1GB RAM. Make sure the cluster emails you about the job’s progress.
Did the job succeed? If not, try again but increase the RAM allocation to 30Gb.
Check how much RAM was actually used by this job.