Quickstart on a cluster

This is a very synthetic document that summarizes the steps necessary to install and use the PacBio Data Processing software on a cluster. Not many details are given here, since this document is only intended to be a brief reference. If you need more details on each step, please follow the provided links.

Goal

Starting with a PacBio sequencing file (bam file) and a reference sequence (fasta file), the sm-analysis tool from PacBio Data Processing will generate a csv file (a so-called methylation report). Each row corresponds to one molecule, (hole number or ZMW in the PacBio parlance) with columns containing properties for each molecule that overcame good quality filters.

Additional to this, a Summary Reports is generated containing some basic statistics about the input, the process and output files.

Steps

  1. Create a cluster account. Needless to say, this step is strongly dependent on the cluster and details cannot be given here (but see Using PacBio Data Processing on a cluster if you plan to use the Goethe-HLR cluster).

  2. Open a terminal and login to access to the cluster (see Using PacBio Data Processing on a cluster).

  3. Install Python-3.9 (or above) in the cluster (see the Installation document).

  4. Create a virtual environment (see the Installation document).

  5. Install the external dependences pbindex, pbmm2, kineticsTools and ccs (see the Using PacBio Data Processing on a cluster document).

  6. Install PacBio Data Processing (see the Installation document).

  7. Copy the input files to the cluster. Assuming that you want to process a file called pbsequencing.bam and your reference is stored in a file called reference.fasta (with its companion index reference.fasta.fai), run the following command in a terminal:

    scp pbsequencing.bam reference.fasta{,.fai} dave@goethe.hhlr-gu.de:/scratch/fuchs/darmstadt/dave/myproject/
    

    YMMV: the paths will change depending on the name of your account, and the destination directory. The destination directory must exist. Recent versions of rsync accept a --mkpath option to create missing components of the destination path; don’t count on having recent versions of software by default on a cluster ;-)

    Note

    The cluster administrators tend to be very concerned about a proper usage of the filesystems available in a cluster. Quite often they provide different filesystems with different properties (speed, size, etc) along with suggestions and policies to use them properly. Try to find out what is the situation in your case and stick, as much as you can, to their policy to minimize performance problems. On the Goethe-HLR cluster website you can learn about filesystems in the Goethe-HLR storage or FUCHS-CSC storage sections.

  8. Prepare and submit a Job (see Using PacBio Data Processing on a cluster). This step is where the analysis done by PacBio Data Processing is carried out.

  9. Copy the output files to your personal computer:

    scp dave@goethe.hhlr-gu.de:/scratch/fuchs/darmstadt/dave/[file to transfer] .
    

    where the trailing . (dot) can be replaced by any other local path, of course. The special case of . means current working directory.

    Or, to synchronize the remote location with your current working directory:

    rsync -avz dave@goethe.hhlr-gu.de:/scratch/fuchs/darmstadt/dave/myproject/ ./