LMB Logo

This manual is distributed under the creative commons Attribution-Non-Commercial-Share Alike 2.0 licence. This means that you are free:

• to copy, distribute, display, and perform the work

• to make derivative works

Under the following conditions:

• Attribution. You must give the original author credit.

• Non-Commercial. You may not use this work for commercial purposes.

• Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under a licence identical to this one.

Please note that:

• For any reuse or distribution, you must make clear to others the licence terms of this work. • Any of these conditions can be waived if you get permission from the copyright holder. • Nothing in this license impairs or restricts the author’s moral rights.

Full details of this licence can be found at http://creativecommons.org/licenses/by-nc-sa/2.0/uk/legalcode

Accessing the Cluster

Exercise 1

a.

Connect to the LMB cluster (you should already be connected to the LMB intranet) using a Mac terminal or Putty for Windows.
What message do you see when you log in?
Is your username displayed when you log in?

b.

Connect to the cluster using FileZilla.
Using FileZilla, create a folder in your cluster home directory named LMB_Cluster_Course_Exercises. Note: your home directory will be your username (e.g. /lmb/home/jsmith).
Copy the data files given to you at the start of this course into the directory on the cluster you just created.

Getting to grips with Linux

Exercise 2

a.

Change your working directory to LMB_Cluster_Course_Exercises (which should already be in your home directory).
Confirm your new working directory with pwd.
Create a folder in here named Exercise2.
Rename this directory to Exercise_2.

b.

Go into the folder Exercise_2.
Type ls to confirm the folder is empty.
We’ve not introduced the touch command so far, so let’s try it now. Enter in the command line:

touch file1.txt

Has anything happened?

c.

What is the size of file1.txt? (You will need to use a command to do this.) `
Make a copy of file1.txt named file2.txt.
Are the file sizes the same?
Delete file1.txt.

d.

Go “up a level” in the filesystem hierarchy (when you have done this you should be able to see the folder Exercise_2 when you run the command ls).
Delete the folder Exercise_2 with the command rmdir.
Did that work? If not, why not, and find a way to delete Exercise_2 (look in the manual about recursive deletions).
Display the recent Bash commands you have entered.

Exercise 3

a.

Only perform this step if you were given the data files in the form of tar archive.

Let’s open the archive file we copied from your local machine to the cluster:

tar xvzfp [course_title].tar.gz

(Use the actual filename when running this command i.e. not “[course_title]”.)

Apparently, certain web browser setting cause the archive to unzipped upon downloading. If your archive on the cluster ends with the file extension .tar instead of .tar.gz, then execute the following command:

tar xvfp [course_title].tar

(The tar command is useful, as it allows multiple files and the associated file hierarchy to be stored within a single archive file. You don’t need to understand this command at the moment.)

b.

Go to the Exercise_3 folder and then explore the MAZE folder. Use cd to move around the maze and ls to check what is in each room. Can you find the treasure?

c.

In the folder Exercise_3 you will find a file entitled poem.txt. Write out the contents to the screen using the command cat.
Now run a command so you can scroll through the text one page at a time.
Write out the first 7 lines of the poem to the screen.
Write out the bottom 3 lines of poem to the screen.
Add the line “The Waste Land by T. S. Elliot” to the start of the file. And a blank line below this. Save the file.
Compress the file. Notice whether the file size changed after compression. Did the filename change?
Read the compressed file with cat and then zcat. What happens in each case?

Exercise 4

a.

What does the command date do?
Try this command and redirect the output to a file named date.txt.
Now try the command again, but this append the output to date.txt.
View the contents of date.txt.
What is the result of adding the flag --version to date?

b.

Look at the file named uk_counties.csv. Using the grep command, create a new file named scotland_counties.csv that contains only Scottish counties.
Using the grep command, create a new file named other_counties.csv that contains all counties except Scottish counties. We advise reading the manual on grep to assist with this. (Hint: this is inverting the grep matching.)
Using the commands head and tail and also using a pipe, create a file named subset_counties.csv that contains the counties on lines 27-39 of the original file.

c.

Using a single wildcard, create symbolic links to files in the files_list folder that meet the following criteria:

end with the file extension .tsv.
start with B or C and end with the file extension .txt.

The links should be generated outside the file_list folder, in separate folders named TSV_links and TXT_links.

Exercise 5

a.

View the contents of the file add_name1.txt and then edit the file by adding your name to the end of the file. However, don’t do this with a text editor (such as nano), but instead use a single Bash command which contains a re-direct.
Repeat 1., but this time edit the file /usr/bin/who. Where you able to do this? If not, why not? Maybe checking the file permissions will clarify the situation?
To what groups do you belong (the name of the required Linux command is quite intuitive)? To what groups does the person who is running the course belong? If you can’t work out how to do this, then look in the man pages.

b.

Write a single-line Bash command that takes the contents of the file letters.txt, sorts them alphabetically and then writes them to a new file named sorted.txt.

c.

Write a Bash command to download the file: https://raw.githubusercontent.com/StevenWingett/Bioinformatics_Computer_Cluster_Course/refs/heads/main/README.md (Hint: curl can name the downloaded file the same as the file on the remote server – look in the Linux man pages to find the relevant curl flag.)

d.

Print the contents of the $USER variable to the screen. Look familiar?
List your running processes with ps. Then look at all jobs running on your current node with top. Then look at ONLY YOUR jobs running on your current node using top (look in the man pages for top, there is a flag that enables users to do this).
Where is the curl program found on your system? Check this location is indeed in the $PATH variable.
Use the sleep command to suspend execution on your system for 10 seconds.
Try the sleep command again, but end the job once it has started.
Execute the sleep command for 60s, but this time background the job. Can you see the running sleep command with ps and top?
Try again, but set the sleep to 100s. Suspend the command. Can you see the suspended command with ps? Now kill the sleep command.

Slurm

Exercise 6

a.

Look at the available modules for the latest version of R. Import this latest version of R as a cluster module. Check this version of R is indeed now running on your system using the command: Rscript --version. [Note: to run R code already saved to a file, you need to execute the Rscript command.]

b.

Look at all the currently running jobs submitted to the cluster. Can you see any long-running jobs? Are CPU / GPU nodes being used? Who has most jobs running on the cluster?
Look at the sqsummary command. What percentage of CPUs are currently available?
Look at the sinfo command.
Look at qinfo webpage: http://nagios2/qinfo/

Are most of the CPU nodes in use? Are any nodes listed as “down”? Which user is using the most CPU nodes?

c.

d.

Let’s run the R script norm_dist_1_billion.R by submitting the job to the cluster as a non-interactive job.

This R script randomly generates 1 billion data values from a Normal Distribution. The results are then plotted as a histogram.

Firstly, make a bash script named norm_dist.sh which contains the command: Rscript norm_dist_1_billion.R.

(Don’t forget the link to the Bash Shell at the top of the file!)

Now run the job, allocating 1 core and 1GB RAM. Make sure the cluster emails you about the job’s progress.

Did the job succeed? If not, try again but increase the RAM allocation to 30Gb.

Check how much RAM was actually used by this job.

e.

Let’s run the R script norm_dist_1_billion.R once again as a non-interactive job using a Slurm script. Also, in the R script, import R version 4.5.1 module prior to running the R script.

*f (optional challenge question).

This question is intended to introduce you to the concept of SLURM arrays and involves searching online to work out the answer.

You have been given 10 DNA FASTA files. Since these are human samples, the GC content should be around 41%. However, a sample swap has occurred, meaning that one of the samples actually conatains E. coli DNA, which has a GC content of around 51%.

You have also been given a Python script that calculates the GC content of a FASTA file. You could run this file sequentially on the all 10 files, but instead you should try to process all the FASTA files simultaneously by making use of a SLURM array.

Firstly, we need to make the data. To do this, run the python script make_fasta_files.py.

That should have generated 10 FASTA files. Now use the script calc_gc.py in a SLURM array to determine the GC content of each sample.