What is the PacBio sequencing technology?

Warning

This document, or parts of it, might be obsolete. Be careful.

PacBio sequencing technology allows the measurement of DNA polymerization kinetics during the sequencing process. When the DNA presents some type of modification such as methylation, the PacBio sequencer records a change in the polymerization kinetics. The relationship between modified DNA and the change in polymerization time detected by this technology has allowed the study of methylation in many organisms.

PacBio protocol

  1. DNA extraction

  2. DNA processing

  3. DNA sequencing

  4. Data output

How the PacBio sequencer output is processed?

The first version of the PacBio sequencer (PacBio RS) produced only one output file named bas.h5, in the following versions of the sequencer (PacBio RS II), the output file produced four output files composed of 3 bax.h5 files and one bas.h5 file. Currently, the latest generation of PacBio sequencers produces a single output file with a .bam extension (binary alignment map) . It is possible to convert the output formats bas.h5/bax.h5 to a .bam format using bioinformatic tools that can be installed through PacBio-Bioconda which offers several tools that will be useful during the primary and secondary analysis of the data generated by the PacBio sequencer.

Primary and Secondary analysis

After sequencing, the generated PacBio output is processed in two steps.

  • A Primary analysis is made by the instrument to obtain information related to the sequenced DNA like IPD values and sequencing quality.

  • A Secondary analysis includes some of the steps included in sm-analysis software like alignment and circular consensus sequence.

More information can be found here.

PacBio Tools

PacBio provides the following analysis tools:

  • SMRT: Must be installed on server

  • PacBio Bioconda: Can be installed on server and personal computer

PacBio-Bioconda tools are installed using a virtual environment. This includes several tools, here we describe some of them:

  • bax2bam - bax2bam allows to transform pacbio files from old versions to a bam file manageable with pacbio tools.

  • blasr - The official PacBio aligner adapted for long sequencing reads. Although other aligners such as BWA, Segemehl, and pbalign were compared, blasr had the best mapping along with pbaling, both aligners found in the pacbio bioconda tools. It was decided to take blasr as the aligner to do the analyses because it is the only one whose result includes the ipd columns needed to be able to detect DNA modifications.

  • pbmm2 - pbmm2 is a an aligner suggested to be a substitute for blasr. When evaluated, it turned out to be faster in the alignment process, however, there is not a big difference in the total number of aligned subreads. The output was not sorted by molecule and has therefore been discarded for the time being.

  • CCS - This tool generate the Circular Consensus Sequence ( CCS )combining multiple subreads.

Modification detection tools

There are several Base Modification Tools related to the detection of modified bases in DNA from files generated by PacBio sequencing.

Base Modification Tool

Programming language

R-kinetics

R

kineticsTools

Python

MotifMaker

Java

MotifFinder

SMRT, R

kineticsTools

For the analyses we have decided to use kineticsTools for two reasons:

  1. It uses python as a programming language just like PacBio bioconda and it is the programming language used to code sm-analysis.

  2. It has a tool called ipdSummary that allows us to predict DNA modifications without the need for a control sample.

ipdSummary

ipdSummary allows us to predict sequence modifications using a computational model (in-silico model). This tool allows us to detect modifications that occur at the nucleotide level in sequences that present m5C, m4C or m6A methylation, the last one being the best detected with this tool.

IpdSummary has its filters:

  • Mapping Quality - By default the minimum mapping quality required is 10, which implies that BLASR is 90% confident that the read is mapped correctly. However, we also find many subreads that have a mapping in more than one position that is sometimes at a great distance from each other.

  • Number of subreads per molecule - ipdSummary is effective on molecules that have at least 20 mapped subreads.

  • Length of the subreads - The PacBio output file can have subframes of different sizes ranging from less than 50 bases to thousands of bases.

  • Multi-mapping - Some molecules may sometimes contain subreads with different mapping positions and this affects the confidence of the predicted modification in a position. In some cases, multi-mapping occur in the region comprising the 0 positions of the reference sequence

sm-analysis

sm-analysis is based on the usage of different tools from PacBio-Bioconda that allow us the analysis of information coming from PacBio sequencers. The difference of sm-analysis with the other approach is the possibility to analyze every PacBio bell (DNA molecule with adapters) separately, to obtain their methylation status, circular consensus sequence, position start, end, and GATCs (in case of having it) relative to the reference sequence. sm-analysis uses ipdSummary which is a pre-trained algorithm able to detect m6A methylations using the Inter-Pulse Durations (IPD) values.

As we read previously there are different PacBio chemistries that undergo periodic updates that create conflicts in data compatibility during the data processing. To solve this, sm-analysis offers to the user the option to change the chemistry to be used in the analysis.

We have also developed a bam-filter step that includes options to filter our data. The following sequence of instructions can be used to filter the aligned subreads.bam file:

Option

Description

-l 50 -q 254 -m 0 16 -R 0.9

Minimum subread length of 50 bases Keep only high quality mapping Molecules with 90% unique mapping

-m 0 16

Keeps only uniquely mapped subreads

-r 20

Only molecules with at least 20 subreads

*In the table above, the filters are applied to an aligned subreads.bam file using blasr. When using a different aligner like pbmm2 you should change the parameters for mapped subreads and mapping quality.

Pipeline

TBD

Usage

  • Input

  • Execution

  • Output description