Finding Highly Modified Reads

A common task in single-molecule genomics with DNA/RNA modifications is to identify reads that are "highly modified" — for example, reads where a large fraction of CpG sites are methylated. This can be useful for:

Identifying molecules from hypermethylated regions (e.g., imprinted loci, repeat elements)
Quality control: checking if your sample has the expected modification levels
Filtering reads for downstream analysis based on modification status
Studying heterogeneity in modification patterns across single molecules

Nanalogue provides the find-modified-reads command for exactly this purpose.

Prerequisites

You will need:

A BAM file with modification tags (MM and ML tags), typically produced by a basecaller like Dorado
Nanalogue installed

The `find-modified-reads` Command

The find-modified-reads command outputs a list of read IDs that satisfy a specified criterion. To see available subcommands and options:

nanalogue find-modified-reads --help

Finding Reads by Modification Density

The most common use case is finding reads where the modification density (the fraction of modified bases) exceeds a threshold. Use the any-dens-above subcommand:

nanalogue find-modified-reads any-dens-above \
    --win 10 \
    --step 5 \
    --tag m \
    --high 0.8 \
    input.bam

This outputs read IDs where at least one window has a modification density at or above 0.8 (80%). The required parameters are:

--win: Window size in number of modified bases (e.g., 10 Cs per window)
--step: How many bases to slide the window by
--tag: The modification code to look for (e.g., m for 5mC)
--high: The threshold value

Example output:

a4f36092-b4d5-47a9-813e-c22c3b477a0f
5d10eb9a-aae1-4db8-8ec6-7ebb34d32576
fffffff1-10d2-49cb-8ca3-e8d48979001a

Understanding Windowed Analysis

Nanalogue computes modification density over sliding windows along each read. This windowed approach is important because:

A read might be partially modified (e.g., one end is methylated, the other is not)
Modification patterns often have biological meaning at specific scales
It allows you to detect local hotspots of modification

You can control the window size and step with additional parameters. Run nanalogue find-modified-reads any-dens-above --help for all available options.

Practical Example: Finding Hypermethylated Reads

Suppose you have nanopore sequencing data from a human sample and want to find reads that are highly methylated at CpG sites. A typical workflow might look like:

# Step 1: Find read IDs with >=70% methylation in any window
nanalogue find-modified-reads any-dens-above \
    --win 10 \
    --step 5 \
    --tag m \
    --high 0.7 \
    aligned_reads.bam > hypermethylated_reads.txt

# Step 2: Count how many reads were found
wc -l hypermethylated_reads.txt

# Step 3: Extract these reads using samtools for further analysis
samtools view -h -b -N hypermethylated_reads.txt -o hypermethylated.bam aligned_reads.bam
samtools index hypermethylated.bam

Exploring Modification Patterns with Windowed Densities

If you want to see the actual modification densities (not just filter reads), use the window-dens command:

nanalogue window-dens \
    --win 10 \
    --step 5 \
    input.bam > densities.tsv

Example output:

#contig	ref_win_start	ref_win_end	read_id	win_val	strand	base	mod_strand	mod_type	win_start	win_end	basecall_qual
contig_00000	6	35	0.a9ff3bb7-8611-4b4f-a7cb-8becf44d3183	0.7	+	C	+	m	0	29	27
contig_00000	26	51	0.a9ff3bb7-8611-4b4f-a7cb-8becf44d3183	0.5	+	C	+	m	20	45	27
contig_00000	38	66	0.a9ff3bb7-8611-4b4f-a7cb-8becf44d3183	0.7	+	C	+	m	32	60	30
contig_00000	55	98	0.a9ff3bb7-8611-4b4f-a7cb-8becf44d3183	0.6	+	C	+	m	49	92	26
...

This output can be loaded into Python/R for custom analysis and visualization.

Tips

Choosing a threshold: The appropriate threshold depends on your biological question. For CpG methylation in mammals, "highly methylated" often means >80% methylation. For other modifications or organisms, you may need to adjust.
Window size matters: Smaller windows capture local patterns but are noisier. Larger windows give more stable estimates but may miss focal modifications.

Combine with region filtering: Use the --region parameter to filter reads to a genomic region:

nanalogue find-modified-reads any-dens-above \
    --win 10 --step 5 --tag m --high 0.8 \
    --region chr1:100-200 \
    input.bam

Performance: Nanalogue is written in Rust and designed to handle large BAM files efficiently. For very large files, you can process in parallel by splitting by chromosome.

Next Steps

Explore modification gradients with window-grad
Create synthetic test data with pynanalogue.simulate_mod_bam() for benchmarking
Visualize your results with tools like IGV or custom matplotlib plots

Nanalogue cookbook