Methylation Reports¶
- abstract:
The Single molecule analysis with the sm-analysis program pipeline produces a so-called methylation report. Its contents is described in this document.
Format¶
Note
This section describes the most recent version of the methylation report’s format.
The methylation report produced by sm-analysis
is a CSV file
with ;
(semicolon) as separator and 21 columns with the following
header:
molecule id;sequence;start of molecule;end of molecule;len(molecule);count(subreads+);count(subreads-);combined QUAL;mean(QUAL);sim ratio;count(GATC);positions of GATCs;count(methylation states);methylation states;combined score;mean(score);min(IPDRatio);mean(IPDRatio);combined idQV;mean(idQV);mean(coverage)
Each column can be itself separated (see e.g. columns 12 and 14). In that case
an internal separator, namely ,
, is used.
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
example |
---|---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
23480 |
2 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
AGACTTTC… |
3 |
|
positive int |
1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1) |
312 |
4 |
|
positive int |
1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1 |
1509 |
5 |
|
positive int |
length of the DNA sequence corresponding to the molecule, according to the aligned CCS file. |
1198 |
6 |
|
int >= 0 |
number of subreads in the + strand found in the input BAM |
51 |
7 |
|
int >= 0 |
number of subreads in the - strand found in the input BAM |
48 |
8 |
|
positive float |
combined QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong. |
95.2 |
9 |
|
positive float |
mean QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong. |
101.4 |
10 |
|
float between 0 and 1 |
ratio of similarity between the molecule’s sequence and the corresponding piece in the reference |
1.0 |
11 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
12 |
|
comma separated sequence of int>0 |
1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule) |
315,699,902 |
13 |
|
positive int |
in how many positions the mole-
cule was detected to have a
methylation ( |
2 |
14 |
|
comma separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
f,-,0 |
15 |
|
positive float |
combined score of the feature for each detection (each score is the Phred-transformed proba- bility value that a kinetic de- viation exists at a position) |
118 |
16 |
|
positive float |
mean score of the feature over all detections in the mole- cule (each score is the Phred- transformed probability value that a kinetic deviation exists at a position) |
150 |
17 |
|
positive float |
min of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position) |
3.4 |
18 |
|
positive float |
mean of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position) |
5.2 |
19 |
|
positive float |
combined |
19.6 |
20 |
|
positive float |
mean |
30 |
21 |
|
positive float |
mean value of the coverage levels used to assign the modif. type label |
42 |
Some notes:
the number of elements in columns 12 and 14 must be equal to the value in column 11
idQV
is the Phred-transformed QV of having a modification at a given positionThe meaning of the methylation state symbols:
0
: not methylated-
: hemi-methylated. Negative strand+
: hemi-methylated. Positive strandf
: full methylated
Format (version 2)¶
Warning
Please, ignore the content of this section if you are working with
a public release of PacBio Data Processing (one installed with pip
, for
instance). It is kept here for reference.
The methylation report produced by sm-analysis
is a csv file with ;
(semicolon) as separator and 7 columns with the following header:
molecule id;count(GATC);sequence;start of molecule;end of molecule;positions of GATCs;methylation states
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
example |
---|---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
23480 |
2 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
3 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
AGACTTTC… |
4 |
|
positive int |
1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1) |
312 |
5 |
|
positive int |
1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1 |
1509 |
6 |
|
comma separated sequence of int>0 |
1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule) |
315,699,1002 |
7 |
|
comma separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
f,-,0 |
Some notes:
the number of elements in columns 6 and 7 must be equal to the value in column 2
The meaning of the methylation state symbols:
0
: not methylated-
: hemi-methylated. Negative strand+
: hemi-methylated. Positive strandf
: full methylated
Format (version 1)¶
Warning
Please, ignore the content of this section if you are working with
a public release of PacBio Data Processing (one installed with pip
, for
instance). It is kept here for reference.
Note
This version, v1, is an old format no longer used. It was decided to be replaced by the version 2 (described above) in a work meeting (with DP, DV and TW) on 18 June 20201.
The methylation report produced by sm-analysis
is a csv file with ;
(semicolon) as separator and 6 columns with the following header:
molecule id;count(GATC);sequence;start-end of molecule;
positions of GATCs;methylation states
The following table summarizes the meaning of each column.
col num |
field name |
possible values |
description |
---|---|---|---|
1 |
|
positive int |
value provided by the sequencer |
2 |
|
positive int |
number of GATCs found in the DNA sequence. |
3 |
|
[ACGT]* |
DNA sequence corresponding to the molecule, as reported by CCS |
4 |
|
[int>=,int>0] |
inclusive interval corresponding to the start and end of the molecule within the reference; the values are taken from the aligner but shifted such that the minimum position is 0 (ie 0-index is used) |
5 |
|
space separated sequence of int>0 |
0-index positions of the A in all the GATCs present in col 3 and realtive to that sequence |
6 |
|
space separated sequence of [0-+f] |
each element corresponds to the
methylation state of one GATC in
the sequence as returned by the
|
Some notes:
the number of elements in columns 5 and 6 must be equal to the value in column 2
The meaning of the methylation state symbols:
0
: not methylated-
: hemi-methylated. Negative strand+
: hemi-methylated. Positive strandf
: full methylated