Summary Reports

abstract:

The sm-analysis program summarizes some statistics about the analysis with a file aptly called summary report. This document describes its contents.

Introduction

Most of the data found in the summary report are self-explanatory, but some points deserve special attention. The aim of this document is to provide additional comments that help to understand the precise meaning and/or format of the quantities reported in the sumamry report.

Description

The remaining subsections of this document map the sections of the summary report. Not every quantity that appears in a summary report is described below, though. Only the points that require additional remarks are documented here.

Overview

This section contains general information about the particular run of Single molecule analysis with the sm-analysis program that produced the summary report in question. The command given to produce the analysis can be obtained by joining Program name to Program options. The date of the analysis contains the date and time when the sm-analysis program started, and is given in the ISO 8601 format with a precision of minutes, following this pattern: YYYY-MM-DDTHH:MM.

Result filenames

The Single molecule analysis with the sm-analysis program program produces different output files. Local links to those files are provided in this section. See Methylation Reports, Raw Detections and GFF format for a description of their contents. Notice that actually the link to the GFF file is pointing to a concatenation of all GFF files produced by ipdSummary.

Input files

This section contains details about the input files passed in to the Single molecule analysis with the sm-analysis program: the BAM file and the reference file.

BAM File

Apart from the filename and the size in bytes, the report includes the MD5 checksum of the full file, as it was at the time of the analysis, and the MD5 checksum of the body of that file. The reason for adding both is that the header of the BAM file could be altered by some tool while the body is preserved intact. In a normal case, the full checksum would be enough. But if, by some reason, the header of the BAM file was modified after the analysis, having the body checksum at hand could be helpful.

Reference

Apart from the filename, a FASTA file contains a reference name, which is given under Reference name. The length of the sequence itself is also given (in terms of base pairs) and the MD5 checksum of the (normalized to upper case) sequence is included as well.

Molecules/subreads

Basic statistics about the body of the input BAM file. All the quantities in this section are given for molecules and for subreads.

Initial contains the molecules/subreads counts in the input BAM file. All percentages in this section refer to these quantities. In particular, the statistics about subreads refer to what is found in the input BAM file.

Used in aligned CCS BAM refers to how many molecules/subreads from the input BAM file are also in any of the alignment variants of the CCS BAM file. Each molecule from the input BAM file will be assigned, at most, to one variant, i.e. even if a molecule is found in all the alignment variants, it (and its subreads) will not be counted more than once.

DNA mismatch discards gives us the numbers corresponding to molecules for which the sequence provided by the aligned CCS BAM file does not match the reference at the position given also by the aligned CCS BAM file.

Filtered out contains statistics about the molecules discarded by the filters applied by Single molecule analysis with the sm-analysis program to the input BAM file.

Faulty (with processing error) are molecules whose corresponding single molecule BAM file had problems when it was processed by either pbindex or ipdSummary. The details about what went wrong are given in the output displayed in the screen if the sm-analysis program was executed in verbose mode, i.e. sm-analysis -v. Faulty molecules are exceptional. A normal Single molecule analysis with the sm-analysis program is expected to have not even a single faulty molecule.

The In methylation report… row contains what fraction of the initial data ends up in the methylation report. These quantities are further splitted into which molecules and subreads contain/do not contain GATCs in the rows …only with GATCs and …only without GATCs, respectively. Of course, …only with GATCs and …only without GATCs add up to In methylation report….

The quantities in Used in aligned CCS BAM should be the sum of the corresponding numbers found in the following rows: DNA mismatch discards, Filtered out, Faulty (with processing error) and In methylation report…. The difference between Initial and Used in aligned CCS BAM is due to molecules that do not survive the CCS and alignment processes. Since the positioning of the molecules is essential to determine the location of the detected methylation, the aligned CCS BAM file is taken as a baseline.

New in version 1.0: Faulty (with processing error) added to the summary report.

Sequencing Position Coverage

Positions covered by molecules in the BAM file refer to the position coverage provided by all the molecules in all the alignment variants of the CCS BAM file. Obviously the percentages in this section refer to the length of the reference.

GATCs

Again, as in the section Sequencing Position Coverage, the Number of GATCs identified in the BAM file include all the molecules in the merged alignment variants of the CCS BAM file.

Methylations

As the summary report itself declares, in this section the individual methylation detections are considered. Any GATC in the reference can be detected multiple times: several molecules can cover the same GATC, but each molecule will be analyzed independently by sm-analysis. That is why Total number of GATCs in all the analyzed molecules does not (and must not!) agree with the numbers in the GATCs section.