Single molecule analysis with the sm-analysis program

sm-analysis is the main program provided by PacBio Data Processing. It analyzes a given input PacBio BAM file evaluating the methylation status of each molecule.

Note

As mentioned in the Installation section, by default only the command line interface for the single molecule analysis, sm-analysis, is installed. The corresponding graphical user interface can be enabled but requires an extra library: [wxpython]. Both sm-analysis and sm-analysis-gui provide access to the same pipeline, the same program is run under the hood. They only differ in the interface.

Input

sm-analysis expects a PacBio BAM file as first input and the corresponding FASTA file containing the reference as the second argument:

$ sm-analysis myseq.bam bacterium.fasta

Other options can be added to that line to customize the default behaviour. For example, to speed up the analysis we might want to use more than one CPU core. PacBio Data Processing allows us to run several instances of the ipdSummary program in parallel. And we might also let each ipdSummary instance use multiple worker processes. The following line adds two command line options to do all that:

$ sm-analysis myseq.bam bacterium.fasta -N 3 -n 2

which, upon execution, will analize the given BAM file with the given reference using 3 ipdSummary instances, and each instance will use 2 worker processes in turn.

For a complete list of options available and their description, see the section Command line options below.

Output

Once the sm-analysis program runs successfully, some files are produced:

How it works

The following figure is a flow chart of the sm-analysis pipeline.

sm-analysis

Flow chart of the sm-analysis program.

One of the first things that the program will do is to ensure that the input BAM is aligned. Actually two alignment processes will be carried out with the help of an aligner on the input BAM file, each one producing what we call an alignment variant. One is the straight alignment variant and another is an alignment with a rotated reference, the so-called pi shifted alignment variant, where the starting point of the reference is rotated by 180 degrees. The aim of this second alignment process is to catch molecules that cross the origin. With the help of those two files a complete set of molecules can be detected.

Before running the aligner the program will try to find the two aligned versions of the input file, if it is unaligned.

On the other hand, if the input file is actually aligned, a pi shifted version of it will be sought. And if found, it will be used.

If only a straight aligned file is at hand, the circular topology of the reference will not be considered.

To find the aligned versions of the input BAM, the program tries to answer three questions:

  1. Is there a candidate with the expected filename?

  2. Is the candidate aligned?

  3. Are the molecules in it a subset of the molecules in the input BAM?

If the answer to the three questions is yes, the candidate is considered a plausible aligned version of the input BAM, and it is as such used within the rest of the analysis. If not, the alignment process is carried out.

Filters

The sm-analysis program applies several filters to each molecule in the input BAM file. The aim is to ensure a minimum of quality in the sequencing data for the processed molecules. The following filters are applied, in the given order:

  1. A minimum of 50 base pairs is required to each molecule. The sequence corresponding to each molecule is taken from the aligned CCS BAM file.

  2. A minimum value is set for the mapping quality (column 5 in each subread) to half of the estimated maximum mapping quality found in the aligned input bam file. That is done using the pabio_data_processing.bam_utils.estimate_max_mapping_quality() function. This filter is applied to the subreads in each molecule.

  3. At least 90% of the subreads in each molecule must have a mapping of 0 or 16.

  4. Only subreads with mapping in the set {0, 16} are taken.

  5. Only molecules with:

    1. at least 20 subreads, and

    2. data for both strands (+ and -)

    are taken.

For details about the technical terms used in the description of the filters (i.e. what is a mapping or what is the meaning of mapping quality), please consult the SAM/BAM format specification.

Command line options

The sm-analysis program has a Command Line Interface (CLI) with the following options:

<BAM-FILE>

Input file in BAM format

<ALIGNMENT-FILE>

Input file containing the alignment in FASTA format (typically a file ending in ‘.fa’ or ‘.fasta’). A companion ‘.fa.fai’/’.fasta.fai’ file is also needed but it will be created if not found.

-M <MODEL>, --ipd-model <MODEL>

Model to be used by ipdSummary to identify the type of modification. MODEL must be either the model name or the path to the ipd model. First, the program will make an attempt to interprete MODEL as a path to a file defining a model; if that fails, MODEL will be understood to be the name of a model that must be accessible in the resources directory of kineticsTools (e.g. -M SP3-C3 would trigger a search for a file called SP3-C3.npz.gz within the directory with models provided by kineticsTools). If this option is not given, the default model in ipdSummary is used.

-a <PROGRAM>, --aligner <PROGRAM>

program to use as aligner. It can be a path or an executable in the PATH (default: Pbmm2)

-p <PROGRAM>, --pbindex <PROGRAM>

program to generate indices of BAM files. It must have the same interface as PacBio’s pbindex and it can be a path or an executable in the PATH (default: pbindex)

-i <PROGRAM>, --ipdsummary <PROGRAM>

program to analyze the IPDs. It must have the same interface as PacBio’s ipdSummary. It can be a path or an executable in the PATH (default: ipdSummary).

-c <PROGRAM>, --ccs <PROGRAM>

program to compute the Hi-Fi version of the input BAM. It must have the same interface as PacBio’s CCS. It can be a path or an executable in the PATH (default: ccs)

-N <INT>, --num-simultaneous-ipdsummarys <INT>

Number of simultaneous instances of ipdSummary that will cooperate to process the molecules (default: 1).

-n <INT>, --num-workers-per-ipdsummary <INT>

Number of worker processes that each instance of ipdSummary will spawn (default: 1).

--nprocs-blasr <INT>

Number of worker processes that each instance of blasr will spawn (default: 1).

-P <PARTITION:NUMBER-OF-PARTITIONS>, --partition <PARTITION:NUMBER-OF-PARTITIONS>

This option instructs the program to only analyze a fraction (partition) of the molecules present in the input bam file. The file is divided in NUMBER OF PARTITIONS (almost) equal pieces but ONLY the PARTITION-th partition (fraction) is analyzed. For instance, –partition 3:7 means that the bam file is divided in seven pieces but only the third piece is analyzed by the current instance of sm-analysis. By default, all the file is analyzed.

-C <PATH>, --CCS-bam-file <PATH>

The CCS file in BAM format can be optionally provided; otherwise it is computed. It is necessary to create the reference mapping between hole numbers and the DNA sequence of the corresponding fragment, or molecule. After being aligned, the file will be also used to determine the position of each molecule in the report of methylation states. If the CCS BAM file is provided, and any of the necessary aligned versions of it is not found, the CCS file will be aligned to be able to get the positions. If this option is not used, a CCS BAM will be generated from the original BAM file using the ccs program.

--keep-temp-dir

Should we keep a copy of the temporary files generated? (default: No).

-m <MOD-TYPE>, --modification-types <MOD-TYPE>

Focus only in the requested modification types (default: m6A). Multiple space-separated values can be given.

--only-produce-methylation-report

Use this flag to only produce the methylation report from the per detection csv file (default: No).

--use-blasr-aligner

this option sets blasr as the aligner, instead of the default aligner (pbmm2)

--mapping-quality-threshold <INT>

minimum mapping quality that each individual subread is required to have in order to pass the filters. The possible mapping quality values are positive integers in the range [0, 255] (default: half the estimated maximum value found in the aligned BAM file).

-v, --verbose

Run the program in verbose mode such that much more runtime details are produced.

Graphical User Interface: sm-analysis-gui

Despite the power, beauty and vintage flavor of the command line, PacBio Data Processing offers a Graphical User Interface (GUI) for its main executable sm-analysis: sm-analysis-gui which, upon execution, will open a window that will allow you to drive the single molecule analysis.