What is the PacBio sequencing technology?
Warning
This document, or parts of it, might be obsolete. Be careful.
PacBio sequencing technology allows the measurement of DNA
polymerization kinetics during the sequencing process. When the DNA
presents some type of modification such as methylation, the PacBio
sequencer records a change in the polymerization kinetics. The
relationship between modified DNA and the change in polymerization time
detected by this technology has allowed the study of methylation in many
organisms.
PacBio protocol
DNA extraction
DNA processing
DNA sequencing
Data output
How the PacBio sequencer output is processed?
The first version of the PacBio sequencer (PacBio RS) produced only one
output file named bas.h5, in the following versions of the sequencer
(PacBio RS II), the output file produced four output files composed of 3
bax.h5 files and one bas.h5 file. Currently, the latest generation of
PacBio sequencers produces a single output file with a .bam extension
(binary alignment
map) . It is
possible to convert the output formats bas.h5/bax.h5 to a .bam format
using bioinformatic tools that can be installed through
PacBio-Bioconda
which offers several tools that will be useful during the primary and
secondary analysis of the data generated by the PacBio sequencer.
Primary and Secondary analysis
After sequencing, the generated PacBio output is processed in two steps.
A Primary analysis is made by the instrument to obtain information
related to the sequenced DNA like IPD values and sequencing quality.
A Secondary analysis includes some of the steps included in
sm-analysis software like alignment and circular consensus sequence.
More information can be found here.
ipdSummary
ipdSummary allows us to predict sequence modifications using a
computational model (in-silico model). This tool allows us to detect
modifications that occur at the nucleotide level in sequences that
present m5C, m4C or m6A methylation, the last one being the best
detected with this tool.
IpdSummary has its filters:
Mapping Quality - By default the minimum mapping quality required
is 10, which implies that BLASR is 90% confident that the read is
mapped correctly. However, we also find many subreads that have a
mapping in more than one position that is sometimes at a great
distance from each other.
Number of subreads per molecule - ipdSummary is effective on
molecules that have at least 20 mapped subreads.
Length of the subreads - The PacBio output file can have
subframes of different sizes ranging from less than 50 bases to
thousands of bases.
Multi-mapping - Some molecules may sometimes contain subreads
with different mapping positions and this affects the confidence of
the predicted modification in a position. In some cases,
multi-mapping occur in the region comprising the 0 positions of the
reference sequence
sm-analysis
sm-analysis is based on the usage of different tools from
PacBio-Bioconda
that allow us the analysis of information coming from PacBio sequencers.
The difference of sm-analysis with the other approach is the possibility
to analyze every PacBio bell (DNA molecule with adapters) separately, to
obtain their methylation status, circular consensus sequence, position
start, end, and GATCs (in case of having it) relative to the reference
sequence. sm-analysis uses
ipdSummary
which is a pre-trained algorithm able to detect m6A methylations using
the Inter-Pulse Durations (IPD) values.
As we read previously there are different PacBio chemistries that
undergo periodic updates that create conflicts in data compatibility
during the data processing. To solve this, sm-analysis offers to the
user the option to change the chemistry to be used in the analysis.
We have also developed a bam-filter step that includes options to filter
our data. The following sequence of instructions can be used to filter
the aligned subreads.bam file:
Option |
Description |
-l 50 -q 254 -m 0 16 -R 0.9 |
Minimum subread length of 50
bases Keep only high quality
mapping Molecules with 90% unique
mapping |
-m 0 16 |
Keeps only uniquely mapped
subreads |
-r 20 |
Only molecules with at least 20
subreads |
*In the table above, the filters are applied to an aligned subreads.bam
file using blasr.
When using a different aligner like
pbmm2 you should
change the parameters for mapped subreads and mapping quality.
Usage
Input
Execution
Output description