Methylation Reports

abstract:

The Single molecule analysis with the sm-analysis program pipeline produces a so-called methylation report. Its contents is described in this document.

Format

Note

This section describes the most recent version of the methylation report’s format.

The methylation report produced by sm-analysis is a CSV file with ; (semicolon) as separator and 21 columns with the following header:

molecule id;sequence;start of molecule;end of molecule;len(molecule);count(subreads+);count(subreads-);combined QUAL;mean(QUAL);sim ratio;count(GATC);positions of GATCs;count(methylation states);methylation states;combined score;mean(score);min(IPDRatio);mean(IPDRatio);combined idQV;mean(idQV);mean(coverage)

Each column can be itself separated (see e.g. columns 12 and 14). In that case an internal separator, namely ,, is used.

The following table summarizes the meaning of each column.

col num

field name

possible values

description

example

1

molecule id

positive int

value provided by the sequencer

23480

2

sequence

[ACGT]*

DNA sequence corresponding to the molecule, as reported by CCS

AGACTTTC…

3

start of molecule

positive int

1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1)

312

4

end of molecule

positive int

1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1

1509

5

len(molecule)

positive int

length of the DNA sequence corresponding to the molecule, according to the aligned CCS file.

1198

6

count(sub- reads+)

int >= 0

number of subreads in the + strand found in the input BAM

51

7

count(sub- reads-)

int >= 0

number of subreads in the - strand found in the input BAM

48

8

combined QUAL

positive float

combined QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong.

95.2

9

mean(QUAL)

positive float

mean QUAL (asccii of base quality plus 33). Each QUAL is the Phred-transformed proba- bility value that the base is wrong.

101.4

10

sim ratio

float between 0 and 1

ratio of similarity between the molecule’s sequence and the corresponding piece in the reference

1.0

11

count(GATC)

positive int

number of GATCs found in the DNA sequence.

3

12

positions of GATCs

comma separated sequence of int>0

1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule)

315,699,902

13

count(methy- lation states)

positive int

in how many positions the mole- cule was detected to have a methylation (+, - or f in column 7).

2

14

methylation states

comma separated sequence of [0-+f]

each element corresponds to the methylation state of one GATC in the sequence as returned by the ipdSummary program

f,-,0

15

combined score

positive float

combined score of the feature for each detection (each score is the Phred-transformed proba- bility value that a kinetic de- viation exists at a position)

118

16

mean(score)

positive float

mean score of the feature over all detections in the mole- cule (each score is the Phred- transformed probability value that a kinetic deviation exists at a position)

150

17

min(IPDRatio)

positive float

min of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position)

3.4

18

mean(IPDRatio)

positive float

mean of tMean/modelPredictions (tMean is the capped mean of normalized IPDs observed at this position)

5.2

19

combined idQV

positive float

combined idQV value for all the detected modifications of the correct type in the given molecule

19.6

20

mean(idQV)

positive float

mean idQV value for all the detected modifications of the correct type in the given molecule

30

21

mean(coverage)

positive float

mean value of the coverage levels used to assign the modif. type label

42

Some notes:

  • the number of elements in columns 12 and 14 must be equal to the value in column 11

  • idQV is the Phred-transformed QV of having a modification at a given position

  • The meaning of the methylation state symbols:

    • 0: not methylated

    • -: hemi-methylated. Negative strand

    • +: hemi-methylated. Positive strand

    • f: full methylated

Format (version 2)

Warning

Please, ignore the content of this section if you are working with a public release of PacBio Data Processing (one installed with pip, for instance). It is kept here for reference.

The methylation report produced by sm-analysis is a csv file with ; (semicolon) as separator and 7 columns with the following header:

molecule id;count(GATC);sequence;start of molecule;end of molecule;positions of GATCs;methylation states

The following table summarizes the meaning of each column.

col num

field name

possible values

description

example

1

molecule id

positive int

value provided by the sequencer

23480

2

count(GATC)

positive int

number of GATCs found in the DNA sequence.

3

3

sequence

[ACGT]*

DNA sequence corresponding to the molecule, as reported by CCS

AGACTTTC…

4

start of molecule

positive int

1-based start position of the molecule within the reference; the values are taken from the aligner; this value is the number of bases before the first base of the sequence plus 1 (the minimum position is 1)

312

5

end of molecule

positive int

1-based end position of the molecule within the reference; the values are taken from the aligner. This value is the number of bases before the last base of the sequence plus 1

1509

6

positions of GATCs

comma separated sequence of int>0

1-based absolute positions of the Gs for all the GATCs present in col 3 (ie, in the current molecule)

315,699,1002

7

methylation states

comma separated sequence of [0-+f]

each element corresponds to the methylation state of one GATC in the sequence as returned by the ipdSummary program

f,-,0

Some notes:

  • the number of elements in columns 6 and 7 must be equal to the value in column 2

  • The meaning of the methylation state symbols:

    • 0: not methylated

    • -: hemi-methylated. Negative strand

    • +: hemi-methylated. Positive strand

    • f: full methylated

Format (version 1)

Warning

Please, ignore the content of this section if you are working with a public release of PacBio Data Processing (one installed with pip, for instance). It is kept here for reference.

Note

This version, v1, is an old format no longer used. It was decided to be replaced by the version 2 (described above) in a work meeting (with DP, DV and TW) on 18 June 20201.

The methylation report produced by sm-analysis is a csv file with ; (semicolon) as separator and 6 columns with the following header:

molecule id;count(GATC);sequence;start-end of molecule;
positions of GATCs;methylation states

The following table summarizes the meaning of each column.

col num

field name

possible values

description

1

molecule id

positive int

value provided by the sequencer

2

count(GATC)

positive int

number of GATCs found in the DNA sequence.

3

sequence

[ACGT]*

DNA sequence corresponding to the molecule, as reported by CCS

4

start-end of molecule

[int>=,int>0]

inclusive interval corresponding to the start and end of the molecule within the reference; the values are taken from the aligner but shifted such that the minimum position is 0 (ie 0-index is used)

5

positions of GATCs

space separated sequence of int>0

0-index positions of the A in all the GATCs present in col 3 and realtive to that sequence

6

methylation states

space separated sequence of [0-+f]

each element corresponds to the methylation state of one GATC in the sequence as returned by the ipdSummary program

Some notes:

  • the number of elements in columns 5 and 6 must be equal to the value in column 2

  • The meaning of the methylation state symbols:

    • 0: not methylated

    • -: hemi-methylated. Negative strand

    • +: hemi-methylated. Positive strand

    • f: full methylated