CCS¶
The sm-analysis program uses the ccs
program to
obtain highly accurate reads from the input BAM that are taken as the
true sequences and positions within the reference for each molecule in
the analysis done by the sm-analysis program.
ccs
is a runtime dependency: it must be accessible at latest at
runtime and it is called on demand: if its output is already there, it will
not be computed again.
By default, ccs
is searched for in the PATH. If it is
not found in the PATH, you will receive a common runtime
error message:
[CRITICAL] [Errno 2] No such file or directory: 'ccs'
and the sm-analysis program itself will terminate.
In that case, the instructions in the following sections can help you.
Installing CCS¶
An easy way to install the ccs
program is with conda
.
Have a look at Setting up Bioconda, and follow the instructions, if
you haOnce those steps are followed,
and the resulting conda
environment is active, install pbccs
:
$ conda install pbccs
Warning
Notice that, contrary to the suggestion given in PacBio & Bioconda,
the explicit selection of the bioconda
channel by means of the -c
option of conda install
(e.g., conda install -c bioconda ...
)
triggers a dependency error. DO NOT USE the -c bioconda
option,
just run conda install ...
instead, as explained in the main text.
Note
An alternative way to install pbindex
is through SMRT-link software server tool.
Upon success, you will be able to pass the path to the ccs
executable to sm-analysis if needed (see
below for details).
Using ccs from sm-analysis¶
Let us assume that PacBio Data Processing was installed inside a virtual environment located in:
/home/dave/.venvs/pdp
and let us assume that pbbioconda
was installed in:
/home/dave/miniconda3
then, after activating the PacBio Data Processing’s virtual environment:
$ source /home/dave/.venvs/pdp/bin/activate
you can tell sm-analysis
about ccs
by using a command
line option (sm-analysis -c
) as follows:
$ sm-analysis --ccs-path /home/dave/miniconda3/bin/ccs
Issues¶
Multimapping¶
In some cases an aligned CCS file presents multimapping. Two examples take from the
st1A09
file:
m54099_200720_153206/4194505/ccs 0 U00096.3 392180 0 150= * 0 0 ATCTGTACGTAAGTACGTGATGTCTCCTGCCCACTTCT...
m54099_200720_153206/4194505/ccs 256 U00096.3 1094716 0 150= * 0 0 ATCTGTACGTAAGTACGTGATGTCTCCTGCCCACTTCT...
m54099_200720_153206/4194505/ccs 272 U00096.3 2170808 0 150= * 0 0 GGACTGAGGGCAAAGGCCTCCCGGAAGTTCAGCCCGGT...
m54099_200720_153206/4194505/ccs 272 U00096.3 567414 0 150= * 0 0 GGACTGAGGGCAAAGGCCTCCCGGAAGTTCAGCCCGGT...
m54099_200720_153206/4194505/ccs 272 U00096.3 315863 0 150= * 0 0 GGACTGAGGGCAAAGGCCTCCCGGAAGTTCAGCCCGGT...
...
m54099_200720_153206/4194627/ccs 0 U00096.3 274198 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 256 U00096.3 574834 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 256 U00096.3 688094 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 272 U00096.3 3130803 0 295= * 0 0 CGGCCAACGAGCATGACCTCAATCAGCTGGGTAATCTG...
m54099_200720_153206/4194627/ccs 256 U00096.3 2101992 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 256 U00096.3 2289162 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 272 U00096.3 1396701 0 295= * 0 0 CGGCCAACGAGCATGACCTCAATCAGCTGGGTAATCTG...
m54099_200720_153206/4194627/ccs 272 U00096.3 1300156 0 295= * 0 0 CGGCCAACGAGCATGACCTCAATCAGCTGGGTAATCTG...
m54099_200720_153206/4194627/ccs 256 U00096.3 3365799 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
m54099_200720_153206/4194627/ccs 256 U00096.3 3652279 0 295= * 0 0 CCCTTGTATCTGGCTTTCACGAAGCCGAACTGTCGCTT...
How do we decide the position? In the current implementation, the first
subread of each molecule is taken (for details, see
pacbio_data_processing.sm_analysis.map_molecules_with_highest_sim_ratio()
),
because all the subreads are perfect. But notice that the positions (4th
column) differ.