Quality Control of Mod Data

Before diving into downstream analysis, it's important to assess whether your modification calls are trustworthy. This tutorial covers how to check modification call quality and apply filters to improve data reliability.

Prerequisites

You will need:

A BAM file with modification tags (MM and ML tags)
Nanalogue installed
jq for JSON processing (optional but recommended)

Note: Most filtering options in nanalogue (such as --mapq-filter, --min-align-len, --mod-prob-filter) are shared across subcommands. The examples below show specific commands, but you can apply these filters to any nanalogue subcommand that accepts them.

Understanding Modification Probabilities

Modification callers like Dorado output a probability for each potential modification site. These probabilities are stored in the ML tag as values from 0-255, which represent 0-100% probability of modification.

A high-quality dataset typically shows a bimodal distribution:

Many values near 0 (confidently unmodified)
Many values near 255 (confidently modified)
Few values in the middle (uncertain calls)

A dataset with many values around 128 (50%) indicates the caller is uncertain, which may signal poor signal quality or an inappropriate basecalling model.

Extracting Raw Modification Probabilities

To check your probability distribution, extract the raw values:

nanalogue read-info --detailed input.bam | jq '.[].mod_table[].data[][2]' | shuf | head -20

These values (0-255) can be plotted as a histogram. In Python:

import subprocess
import matplotlib.pyplot as plt

# Extract probabilities
result = subprocess.run(
    ["bash", "-c", "nanalogue read-info --detailed input.bam | jq '.[].mod_table[].data[][2]'"],
    capture_output=True, text=True
)
probs = [int(x) for x in result.stdout.strip().split('\n') if x]

# Plot histogram
plt.hist(probs, bins=50, edgecolor='black')
plt.xlabel('Modification probability (0-255)')
plt.ylabel('Count')
plt.title('Modification call distribution')
plt.savefig('mod_distribution.png')

Filtering Low-Confidence Calls

If you have many uncertain calls (probabilities near 128), you can filter them out using --mod-prob-filter:

nanalogue window-dens --win 10 --step 5 --mod-prob-filter 0.3,0.7 input.bam

This excludes modification calls where the probability falls between 0.3 and 0.7 (77-179 in 0-255 scale), keeping only confident calls.

When to use this:

When you see a large peak around 128 in your probability histogram
When you want to be conservative about modification calls
For samples with lower signal quality

When NOT to use this:

When your data already shows a clean bimodal distribution
When you're studying intermediate modification states

Base Quality Filtering

Poor basecalling quality can affect modification calls. Use --base-qual-filter-mod to exclude positions with low basecall confidence:

nanalogue window-dens --win 10 --step 5 --base-qual-filter-mod 20 input.bam

This excludes modification calls at positions where the basecall quality is below 20 (Phred scale).

Read-Level Quality Filters

Mapping Quality

Poorly mapped reads may have incorrect modification positions. Filter by MAPQ:

nanalogue read-stats --mapq-filter 20 input.bam

Alignment Length

Short alignments may not provide enough context for reliable modification calls:

nanalogue read-stats --min-align-len 500 input.bam

Quick Subsampling

For large files, quickly check a subset using -s:

nanalogue read-stats -s 0.1 input.bam

This analyzes approximately 10% of reads, useful for rapid QC checks.

Expected vs Observed: Sanity Checks

If you have positive or negative controls, verify your data matches expectations.

Check Overall Modification Levels

Use window-dens on a region you expect to be highly modified:

nanalogue window-dens --win 20 --step 10 --region chr1:100-200 input.bam

If your positive control region shows low modification, investigate:

Was the correct basecaller model used?
Is the sample truly modified?
Are there alignment issues in this region?

Compare Modification Counts

Use read-table-show-mods to see per-read modification statistics:

nanalogue read-table-show-mods --tag m input.bam | head -10

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count
0.d6265c11-75ad-4f81-81bc-40eefcd5b41e	387	387	secondary_reverse	m:60
0.13dbd875-0589-4fb7-bf01-2b77493bf153	532	532	primary_reverse	m:87
0.fbbdfbe3-81ee-4eea-b4ea-2cd86b52a928	455	455	supplementary_reverse	m:75
...

QC Checklist

Before proceeding with analysis, verify:

Probability distribution is bimodal — Use read-info --detailed + histogram
Modification types detected — Use peek to confirm expected mod codes
Sufficient read depth — Use read-stats to check alignment counts
Reasonable modification levels — Compare to expected values for your sample
No systematic issues — Check if problems are genome-wide or region-specific

Next Steps

Once your data passes QC:

Find highly modified reads — Filter reads by modification level
Explore modification gradients — Detect directional patterns
Analyze specific regions — Focus on genes or features

Nanalogue cookbook