Using PacBio Data Processing on a cluster

abstract:

This document describes the process of installing and using PacBio Data Processing on a cluster with a queueing system. It will be illustrated with the goethe-HLR cluster (at the University of Frankfurt) that employs slurm as a queueing system.

Introduction

A computing cluster, or cluster, is a system that provides access to high computing power by joining multiple nodes and coordinating their usage. A node is basically a single powerful computer.

There are two obvious ways to give a cluster more computing power:

  • by adding more nodes. This strategy is sometimes referred to as horizontal scaling.

  • by using more powerful nodes (with faster CPU, more cores or RAM, etc). Increasing the power of each node is sometimes called vertical scaling.

Processing large bam files with PacBio Data Processing can greatly benefit from the resources provided by a cluster: you typically get access to many powerful nodes, hence you can potentially increase the throughput of your application both horizontally and vertically. Needless to say, that comes with a price to pay in terms of complexity of usage.

The goal of this document is to lower the complexity barrier to use a cluster to speed up the analysis of bam files with PacBio Data Processing. You will find some explanations on how this can be done in practice. We will use as example the goethe-HLR cluster managed by the CSC at the University of Frankfurt.

Preparation

Before anything you need a valid account on the cluster.

For the goethe-HLR cluster

To get an account on goethe-HLR, follow the instructions to submit a CSC user application. At the time of writing this document access to the FUCHS (sub-)cluster is granted to academic institutions in the state of Hesse (Germany).

Installation

Once you have an account on a cluster you need to install PacBio Data Processing on the system. Follow these instructions:

  1. Login to the cluster. Typically after getting permission to use the cluster, you are provided by a user name and a password that will allow you to do login through ssh.

    For the goethe-HLR cluster

    You will receive a username (in my case it is palao, which I will use in the examples) and instructions to get the password. In order to access the cluster you need to ssh into the cluster. If you are using a terminal, type the following command (without the $, which is a symbol to indicate that you are expected to type what follows in a shell or terminal):

    ssh palao@goethe.hhlr-gu.de
    

    and type the password in, when requested. On success (i.e. you enter the correct password), that command will start a remote shell on the cluster which will be our main interface to it. If the ssh command is not in your system, one ssh client must be installed to be able to access goethe-HLR.

  2. Install PacBio Data Processing on the cluster. Follow the instructions found in the section about Installation to install PacBio Data Processing.

    For the goethe-HLR cluster

    In order to have the correct version of Python on the cluster you have several options:

    1. Install Python directly from sources, or

    2. Follow the instructions to use spack at Goethe-HLR and install the needed version of Python with Spack.

    Installing Python from sources seems daunting but it ends up being easier than using Spack if you need a version that is not available in Spack. Of course it all depends on your experience. In case of doubts do not hesitate to contact the admins. They will hopefully give you a good advice in this regard.

  3. PacBio Data Processing needs some external tools for the main pipeline sm-analysis to work: pbindex, pbmm2 and ccs. Follow the links to install them on the cluster.

    Note

    Since these are external dependencies, they can be installed anywhere as far as the tools are accesible to PacBio Data Processing. For instance, in my case I did the following:

    • ccs. This tool’s latest version is only provided as an executable (i.e., they closed the source). I downloaded it and stored it side by side with sm-analysis in the same venv. It can also be installed with conda. See the section CCS.

    • pbmm2. I installed it with conda in a dedicated bioconda environment and passed the path to pbmm2 directly to sm-analysis:

      sm-analysis ... -a ~/miniconda/bin/pbmm2 ...
      

      Notice that the ... are symbolic.

    • pbindex. I installed manually with meson the pbbam package, but it can be installed with conda in a bioconda environment. And, again, pass the path to the executable in each call to sm-analysis:

      sm-analysis ... -p ~/src/pbbam/build/tools/pbindex ...
      

Once PacBio Data Processing is installed, you are almost ready to use it. But on a cluster there is typically a queueing system that manages the resources. In the next section we describe the usage of PacBio Data Processing through slurm, a very common queueing system.

Running

A typical workflow to run software in a cluster managed by slurm (or any other queueing system) is:

  1. prepare a batch script, and

  2. submit it and wait for the results.

The syntax and options of batch files are wide topics covered elsewhere. In this section we focus in preparing minimally functional batch scripts to use PacBio Data Processing with slurm.

For the goethe-HLR cluster

In the webpage of CSC you can find plenty of information about the Goethe-HLR Cluster Usage and the FUCHS Cluster Usage including details about using slurm, recommended storage locations and much more.

In the rest of the section we will provide some examples of batch scripts and we will assume the following:

  • PacBio Data Processing has been installed in a venv such that the activation step is:

    source ~/.venvs/PacbioDataProcessing/bin/activate
    
  • The working directory will be:

    /scratch/darmstadt/palao/projects/pacbio/m45
    
  • In that directory there is a bam file named m45.bam that we are interested in analyze on a per molecule basis with sm-analysis. There is a reference too in the fasta format: reference.fasta and reference.fasta.fai.

A simple slurm batch script for sm-analysis

The following listing contains a batch script that:

  • reserves 1 compute node from the partition named fuchs for 2 days and 12 hours

  • starts 10 simultaneous instances of ipdSummary, each spawning 4 worker processes

#!/bin/bash
#SBATCH --job-name=m45
#SBATCH --partition=fuchs
#SBATCH --nodes=1
#SBATCH --time=2-12:00:00
#SBATCH --mail-type=ALL

source ~/.venvs/PacbioDataProcessing/bin/activate
cd /scratch/darmstadt/palao/projects/pacbio/m45

sm-analysis m45.bam reference.fasta -n 4 -N 10

A slurm batch script to run sm-analysis in parallel

For large bam files it could be beneficial to employ more than a single node to speed up the analysis process.

The following listing contains a batch script that:

  • reserves 16 compute nodes from the partition named fuchs for 10 days

  • starts, in each node, 10 simultaneous instances of ipdSummary, each in turn spawning 4 worker processes

#!/bin/bash
#SBATCH --job-name=m45
#SBATCH --partition=fuchs
#SBATCH --nodes=16
#SBATCH --time=10-0:00:00
#SBATCH --mail-type=ALL

source ~/.venvs/PacbioDataProcessing/bin/activate
cd /scratch/darmstadt/palao/projects/pacbio/m45

for (( t=1;  t <= SLURM_NNODES; t++)); do
  srun --nodes=1 sm-analysis m45.bam reference.fasta -n 4 -N 10 -P ${t}:${SLURM_NNODES} &
  sleep 5
done
wait

Pay attention to the following points:

  • we are splitting the processing in 16 partitions. Each node will produce the output corresponding to one 16-th of the original bam file in this example.

  • sm-analysis is run with the help of srun to let slurm choose an empty node for each partition.

  • at the end of the srun line there is a & and at the end of the script there is a wait command. It is very important not to forget these two details.

Submitting the job

Finally, once the batch script is ready, it is time to submit a job. A job is what the queueing system creates when you tell it to run some program. In order to tell the cluster to execute the task described in the script, save it as, e.g. sm-analysis.slurm and run the following command in the cluster to submit the job:

sbatch sm-analysis.slurm

Since the mail notifications are all active, you should receive an email when the job starts running and when it finishes.

However the squeue command could be handy to have immediate feedback on the status of the job:

squeue -u palao

the -u palao part means that we will get information only on jobs submitted by user palao. Other useful commands are available too. Please have a look at Goethe-HLR Cluster Usage or at FUCHS Cluster Usage for more details.

Once the job successfully completes, you will find the results in the working directory, /scratch/darmstadt/palao/projects/pacbio/m45 in our case, and a log file created by slurm with all the outputs generated by the commands executed during the job. The name of the log file is, by default something like slurm-??????.out, where ?????? is the job number assigned by slurm.