Using PacBio Data Processing on a cluster¶
- abstract:
This document describes the process of installing and using PacBio Data Processing on a cluster with a queueing system. It will be illustrated with the goethe-HLR cluster (at the University of Frankfurt) that employs slurm as a queueing system.
Introduction¶
A computing cluster, or cluster, is a system that provides access to high computing power by joining multiple nodes and coordinating their usage. A node is basically a single powerful computer.
There are two obvious ways to give a cluster more computing power:
by adding more nodes. This strategy is sometimes referred to as horizontal scaling.
by using more powerful nodes (with faster CPU, more cores or RAM, etc). Increasing the power of each node is sometimes called vertical scaling.
Processing large bam files with PacBio Data Processing can greatly benefit from the resources provided by a cluster: you typically get access to many powerful nodes, hence you can potentially increase the throughput of your application both horizontally and vertically. Needless to say, that comes with a price to pay in terms of complexity of usage.
The goal of this document is to lower the complexity barrier to use a cluster to speed up the analysis of bam files with PacBio Data Processing. You will find some explanations on how this can be done in practice. We will use as example the goethe-HLR cluster managed by the CSC at the University of Frankfurt.
Preparation¶
Before anything you need a valid account on the cluster.
For the goethe-HLR cluster
To get an account on goethe-HLR, follow the instructions to submit a CSC user application. At the time of writing this document access to the FUCHS (sub-)cluster is granted to academic institutions in the state of Hesse (Germany).
Installation¶
Once you have an account on a cluster you need to install PacBio Data Processing on the system. Follow these instructions:
Login to the cluster. Typically after getting permission to use the cluster, you are provided by a user name and a password that will allow you to do login through ssh.
For the goethe-HLR cluster
You will receive a username (in my case it is
palao
, which I will use in the examples) and instructions to get the password. In order to access the cluster you need to ssh into the cluster. If you are using a terminal, type the following command (without the $, which is a symbol to indicate that you are expected to type what follows in a shell or terminal):ssh palao@goethe.hhlr-gu.de
and type the password in, when requested. On success (i.e. you enter the correct password), that command will start a remote shell on the cluster which will be our main interface to it. If the ssh command is not in your system, one ssh client must be installed to be able to access goethe-HLR.
Install PacBio Data Processing on the cluster. Follow the instructions found in the section about Installation to install PacBio Data Processing.
For the goethe-HLR cluster
In order to have the correct version of Python on the cluster you have several options:
Install Python directly from sources, or
Follow the instructions to use spack at Goethe-HLR and install the needed version of Python with Spack.
Installing Python from sources seems daunting but it ends up being easier than using Spack if you need a version that is not available in Spack. Of course it all depends on your experience. In case of doubts do not hesitate to contact the admins. They will hopefully give you a good advice in this regard.
PacBio Data Processing needs some external tools for the main pipeline
sm-analysis
to work: pbindex, pbmm2 and ccs. Follow the links to install them on the cluster.Note
Since these are external dependencies, they can be installed anywhere as far as the tools are accesible to PacBio Data Processing. For instance, in my case I did the following:
ccs
. This tool’s latest version is only provided as an executable (i.e., they closed the source). I downloaded it and stored it side by side withsm-analysis
in the samevenv
. It can also be installed withconda
. See the section CCS.pbmm2
. I installed it withconda
in a dedicatedbioconda
environment and passed the path topbmm2
directly tosm-analysis
:sm-analysis ... -a ~/miniconda/bin/pbmm2 ...
Notice that the
...
are symbolic.pbindex
. I installed manually withmeson
thepbbam
package, but it can be installed withconda
in abioconda
environment. And, again, pass the path to the executable in each call tosm-analysis
:sm-analysis ... -p ~/src/pbbam/build/tools/pbindex ...
Once PacBio Data Processing is installed, you are almost ready to use it. But on a cluster
there is typically a queueing system that manages the resources. In the next
section we describe the usage of PacBio Data Processing through slurm
, a very common
queueing system.
Running¶
A typical workflow to run software in a cluster managed by slurm
(or any other
queueing system) is:
prepare a batch script, and
submit it and wait for the results.
The syntax and options of batch files are wide topics covered elsewhere. In
this section we focus in preparing minimally functional batch scripts to use
PacBio Data Processing with slurm
.
For the goethe-HLR cluster
In the webpage of CSC you can find plenty of information about the
Goethe-HLR Cluster Usage and the FUCHS Cluster Usage including
details about using slurm
, recommended storage locations and much more.
In the rest of the section we will provide some examples of batch scripts and we will assume the following:
PacBio Data Processing has been installed in a
venv
such that the activation step is:source ~/.venvs/PacbioDataProcessing/bin/activate
The working directory will be:
/scratch/darmstadt/palao/projects/pacbio/m45
In that directory there is a bam file named
m45.bam
that we are interested in analyze on a per molecule basis withsm-analysis
. There is a reference too in thefasta
format:reference.fasta
andreference.fasta.fai
.
A simple slurm
batch script for sm-analysis
¶
The following listing contains a batch script that:
reserves
1
compute node from the partition named fuchs for2
days and12
hoursstarts
10
simultaneous instances ofipdSummary
, each spawning4
worker processes
#!/bin/bash
#SBATCH --job-name=m45
#SBATCH --partition=fuchs
#SBATCH --nodes=1
#SBATCH --time=2-12:00:00
#SBATCH --mail-type=ALL
source ~/.venvs/PacbioDataProcessing/bin/activate
cd /scratch/darmstadt/palao/projects/pacbio/m45
sm-analysis m45.bam reference.fasta -n 4 -N 10
A slurm
batch script to run sm-analysis
in parallel¶
For large bam files it could be beneficial to employ more than a single node to speed up the analysis process.
The following listing contains a batch script that:
reserves
16
compute nodes from the partition named fuchs for10
daysstarts, in each node,
10
simultaneous instances ofipdSummary
, each in turn spawning4
worker processes
#!/bin/bash
#SBATCH --job-name=m45
#SBATCH --partition=fuchs
#SBATCH --nodes=16
#SBATCH --time=10-0:00:00
#SBATCH --mail-type=ALL
source ~/.venvs/PacbioDataProcessing/bin/activate
cd /scratch/darmstadt/palao/projects/pacbio/m45
for (( t=1; t <= SLURM_NNODES; t++)); do
srun --nodes=1 sm-analysis m45.bam reference.fasta -n 4 -N 10 -P ${t}:${SLURM_NNODES} &
sleep 5
done
wait
Pay attention to the following points:
we are splitting the processing in
16
partitions. Each node will produce the output corresponding to one 16-th of the original bam file in this example.sm-analysis
is run with the help ofsrun
to letslurm
choose an empty node for each partition.at the end of the
srun
line there is a&
and at the end of the script there is await
command. It is very important not to forget these two details.
Submitting the job¶
Finally, once the batch script is ready, it is time to submit a job. A
job is what the queueing system creates when you tell it to run some program.
In order to tell the cluster to execute the task described in the script, save it
as, e.g. sm-analysis.slurm
and run the following command in the cluster to
submit the job:
sbatch sm-analysis.slurm
Since the mail notifications are all active, you should receive an email when the job starts running and when it finishes.
However the squeue
command could be handy to have immediate feedback on the status
of the job:
squeue -u palao
the -u palao
part means that we will get information only on jobs submitted
by user palao
. Other useful commands are available too. Please have a look
at Goethe-HLR Cluster Usage or at FUCHS Cluster Usage for more details.
Once the job successfully completes, you will find the results in the working
directory, /scratch/darmstadt/palao/projects/pacbio/m45
in our case,
and a log
file created by slurm with all the outputs generated by
the commands executed during the job. The name of the log
file is, by
default something like slurm-??????.out
, where ??????
is the job number
assigned by slurm
.