BAM format¶
The input/output sequencing files used by PacBio Data Processing are BAM files (see SAM/BAM format for a description).
The bam file¶
The bam file is is a binary file. It is a compressed version of a sam file —that is a human readable format— containing all the sequencing information. To manipulate the bam file format you can use packages like pysam compatible only with python 2 or pybam for python2 and pybam for python 3. However, for the PacBio bam file Pysam and Pybam are usefull to explore basic fiels in the aligned PacBio bam file.
PacBio bam file¶
Compared with the standard bam format, the PacBio.bam have extra columns as those containing the Interpulse Duration (IPD) value and Pulse Width (PW) that are important for the kinetic analysis.
The bam file of the main PacBio output (filename.subreads.bam) contains
26 columns which can be inspected using samtools
. For instance, the
following shell
command tells us the number of columns in a bam file:
samtools view filename.subreads.bam|awk -F'\t' '{print NF; exit}'
In the next table, the most important columns in a PacBio bam are described:
Column number |
Tag or content |
Description |
---|---|---|
1 |
Molecule identifier |
Molecule indentifier containing {movieName}/ {holeNumber}/ {qStart}_{qEnd} |
10 |
…AGTAC… |
Sequence |
11 |
…~~C!~… |
QUAL |
12 |
RG |
ReadGroup |
13 |
dq |
DeletionQV |
14 |
dt |
DeletionTag |
15 |
ip |
Ipd: B,C or B,S (raw frames or codec V1) |
16 |
iq |
InsertionQv |
17 |
mq |
MergeQv |
18 |
np |
NumPasses |
19 |
pw |
PulseWith: B,C or B,S (raw frames or codec V1) |
20 |
qe |
0_based end |
21 |
qs |
0_based start |
22 |
rq |
Float in [0,1] encoding expected accuracy |
23 |
sn |
4 floats for the average signal-to-noise ratio of A,C,G, and T (in that order) over the HQ region |
24 |
sq |
SubstitutionQV |
25 |
zm |
ZNW hole number |
26 |
cx |
Subread local context flags |
If you want to check the differents length of the subreads using command line, you can type:
samtools view filename.subreads.bam|awk '{print length ($10)}'|sort -nur
For more information see the following link: BAM format specification for PacBio
Aligned PacBio bam file¶
The aligned bam file of the main PacBio output (moviename.subreads.bam)
contains 28 columns. Again, the next shell
one-liner counts the
columns in the bam:
samtools view moviename.subreads.bam|awk -F'\t''{print NF; exit}'
It follows a brief description of the most important columns:
Column number |
Tag or content |
Description |
---|---|---|
1 |
Molecule identifier |
Molecule indentifier containing {movieName}/ {holeNumber}/ {qStart}_{qEnd} |
2 |
mapping flag |
Value related to the alignment type (forward strand (0) and reverse strand (16) are the most important. More details in the link ‘’Map Format Specification’’ below) |
4 |
position |
Position where the sequence was mapped |
5 |
mapping quality |
Quality of the mapping |
10 |
…AGTAC… |
Sequence |
11 |
…~~C!~… |
QUAL |
12 |
RG |
ReadGroup |
13 |
dq |
DeletionQV |
14 |
dt |
DeletionTag |
15 |
ip |
Ipd: B,C or B,S (raw frames or codec V1) |
16 |
iq |
InsertionQv |
17 |
mq |
MergeQv |
18 |
np |
NumPasses |
19 |
pw |
PulseWith: B,C or B,S (raw frames or codec V1) |
20 |
qe |
0_based end |
21 |
qs |
0_based start |
22 |
rq |
Float in [0,1] encoding expected accuracy |
23 |
sn |
4 floats for the average signal-to-noise ratio of A,C,G, and T (in that order) over the HQ region |
24 |
sq |
SubstitutionQV |
25 |
zm |
ZNW hole number |
26 |
cx |
Subread local context flags |
27 |
AS |
Alignment score generated by aligner |
28 |
NM |
Number of differences (mismatches plus inserted and deleted bases) between the sequence and reference |
For more information:
Fields¶
In this section we give details on some particular fields (columns) in a bam file.
Quailty of sequencing¶
In the SAM/BAM format specification it is declared that the 11-th
column in the alignment section of BAM files is named QUAL
, and
it is described like follows:
(brief description) ASCII of Phred-scaled base QUALity+33
QUAL: ASCII of base QUALity plus 33 (same as the quality string in the Sanger FASTQ format). A base quality is the phred-scaled base error probability which equals -10 log10 Pr{base is wrong}. This field can be a ‘*’ when quality is not stored. If not a ‘*’, SEQ must not be a ‘*’ and the length of the quality string ought to equal the length of SEQ.
And the Wikipedia (FASTQ) explains:
The byte representing quality runs from
0x21
(lowest quality;!
in ASCII) to0x7e
(highest quality;~
in ASCII). Here are the quality value characters in left-to-right increasing order of quality (ASCII):!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
For example, each base in a sequence like (in the 10-th column of a BAM file):
AATGCTAGCTAGCTCCTTGGATCGATCCGAT
will have an ASCII symbol (between !
and ~
) associated with it
that will be the contents of the 11-th column in the BAM file. For
instance:
~~~~i~l~~~~_~~~~Z~~~~~~~~~~~~~~
Each symbol tells us the quality of sequencing the corresponding base.
Since the ASCII symbols !
and ~
correspond to 33
and
126
in decimal (or 0x21
and 0x7e
in hexadecimal), and since
each quality value is shifted by 33
it means that the range
of allowed qualities, [0, 93]
, corresponds to a range of allowed
probabilities for each base being wrong of, roughly [1, 0.00005]
(beware the scale).