Glossary

alignment variant

The result of aligning a BAM file using a rotated reference. The word rotated implies that the reference is considered to have a circular topology (unless, of course, the angle of the rotation is 0). If the rotation angle is 0 degrees/radians, i.e. no rotation is applied to the reference, the result of the alignment is called straight in PacBio Data Processing. If a rotation angle of 180 degrees (or π radians) is applied to the refereence, the resulting alignment is called pi-shifted, or π-shifted.

Bash

The default shell of a GNU operating system, as its documentation declares. If the target OS is Linux, Bash is probably the shell, or command line interface, that you are using to enter commands. In case of doubt, type:

$ echo $SHELL

and if you are using a bash shell you will get /bin/bash.

Command Line Interface (CLI)

An interface between a system and its user based on the command line, i.e. the system’s behaviour is controled by instructions passed to it as text through the keyboard. See Command Line Interface (CLI).

Command Line Option

A flag that can be used in a Command Line Interface (CLI) to customize the behaviour of the program. In Unix a command line option typically begins by either - for short option names, e.g. -h or by -- for long option names, e.g. --help. A command line option might accept a value, e.g. -N 3. That depends on the nature of the option.

CPython

Python is a programming language with multiple implementations. Its reference implementation is written in the C programming language and its name is CPython. There are other implementations like RustPython (Python implemented in Rust) or PyPy (Python implemented in RPython), for instance. All implementations should be equivalent in functionality. Sometimes the term Python is used instead of CPython. Though imprecise, this is common practice, if there is no confusion.

CSV file

A Comma Separated Values file. As its name suggests, the file is structured in a table-like fashion, but, interestingly, the separator must not be a comma, although the comma is a very common choice. The CSV standard is defined in RFC 4180.

exit status

The exit status of an executed command is the value returned by it (actually, by a waitpid system call or equivalent function). From the shell, the $? variable holds the value returned by the last executed command. Type echo $? right after the command you are interested in terminates, to find out its exit status. The exit statuses are integers in the range 0-255. A value of 0 means success. Non-zero values indicate failure.

FASTA

Text based file format to store sequences of DNA, or in general, nucleotides or amino acids. See the Wikipedia page on FASTA format, and references therein.

GFF

A file format to encode genetic features. See the GFF3 definition.

Graphical User Interface (GUI)

An interface between a system and its user based on graphical icons, where the mouse is typically involved. See Graphical User Interface (GUI).

MD5 checksum

A checksum based on the MD5 algorithm. Used only in PacBio Data Processing as a mechanism to protect the data integrity against unintentional corruption.

module

In the context of a cluster, a module usually refers to a so-called environment module. It is a relatively low-level administrative tool to automate the steps required to use software installed in a non-standard location. Sometimes the term modulefile is used too, because it is a file, the modulefile, that defines the necessary modifications of the environment to enable a straightforwad usage of a particular piece of software that you are interested in. A module can be loaded (to add or make accessible the target software in the current environment) and unloaded (to remove it from the current environment). Environment modules give great flexibility to use software on a system that is not under your complete control (like a cluster): multiple implementations of the same facilities can coexist in the same system without conflicts, and without interfering with the base system. In a cluster, modules are typically loaded for instance, to use some compiler (version) or some library (version) or any other tool that is not available in the base system. See environment modules or an introductory article in Admin magazine on environment modules.

molecule

In the context of PacBio Data Processing molecule refers to a fragment of DNA that was captured in a hole, aka ZMW, in the sequencing machine. Each molecule in a BAM file is identified with a positive integer and typically spans several subreads.

PATH

An environment variable that contains the search path for commands. It is a colon-separated list of directories in which the shell looks for commands. Type man bash or info bash in your shell for more details.

reference

A DNA sequence used as a reference for the single molecule analysis stored as a file in the FASTA format.

subread

A single line in the BAM file. Each subread belongs to one molecule.

summary report

An HTML file created by Single molecule analysis with the sm-analysis program with basic statistics about the input BAM, the input reference and the output produced by the sm-analysis program during its analysis. It includes also some intermediate details of the process and selected plots that provide a visual help for some quantities or additional information about a certain distribution or quantity.

variant

See alignment variant.