Spotting Variants in Sequence Data

When inspecting aligned sequences, you may notice positions where some reads differ from others. These could be sequencing errors (random, scattered) or real variants like SNPs (consistent, appearing in a subset of reads). This tutorial shows how to use nanalogue's sequence visualization to spot these patterns visually. We'll first see what random errors look like, then contrast with a cleaner variant pattern where only some reads carry the difference.

Key Concept

When viewing many reads across a short region (10-20bp) with --full-region, all sequences align to the same length. Random errors scatter across positions. A real variant appears as a vertical column where a subset of reads consistently shows a different base.

Prerequisites

You will need:

A BAM file with modification tags (MM and ML tags)
Nanalogue installed
For this tutorial: test data with simulated mismatches (configurations provided below)

Please read the Extracting sequences tutorial first, as this tutorial builds on those concepts.

Note: The contig names contig_00001, etc. are example names used throughout this guide. In real BAM files aligned to a reference genome, you will see names like chr1, chr2, NC_000001.11, or similar depending on your reference.

What Random Errors Look Like

First, let's see what data looks like when all reads have random mismatches scattered throughout. This simulates sequencing errors or very noisy data.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region error_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
0.337a590a-04bb-443e-8ea1-0a3dd9749112	200	200	supplementary_forward	m:31	GGTACAATCCGATGCAACTA
0.930fcb84-5c0a-4f59-86b6-e0069f13ceb7	200	200	primary_forward	m:33	CTATTCTCTCTCTCTTAATT
0.62dcc8b7-d6a1-4853-b40c-2a3c45a6f116	200	200	secondary_forward	m:33	GTTACGACCCGATGGTTGTT
0.f00e2a1e-7cf3-426d-a559-740fd8661daf	200	200	supplementary_reverse	m:37	TTTAGTTTATGAAGGACGTT
0.2ba9f7c3-13e9-4723-9854-0c5a80538a52	200	200	primary_forward	m:29	AGGAGGACGTTAAACTTCTA
0.94f3d1c7-33de-45ca-bb5b-154bd7ddc9d0	200	200	primary_forward	m:30	ATTCGCTTATAAAACACCCG
0.d055550c-6bf2-472d-9e0c-0297df5391f9	200	200	supplementary_forward	m:29	CTTTTCGTATCATGAGACCC
0.9ed07340-11b4-4b0e-9055-66ea0e4da96a	200	200	secondary_forward	m:34	GGTACTAGACCAATCGGCTA
0.12a294b9-222d-47c6-9b68-35e31e60a219	200	200	secondary_reverse	m:33	GAAGGTACCGAAGGAAACAC

Notice how the mismatches are distributed randomly across positions. No single column shows a consistent pattern. If you were looking for a real variant, you wouldn't find a clear signal here - just noise.

Spotting a Consistent Variant

Now let's look at data where only some reads have mismatches - simulating a heterozygous-like variant. This test data has two groups of reads: one clean, one with mismatches.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
0.fa2fb3e5-a2bb-44a5-82fc-0bacc07d7c57	200	200	supplementary_reverse	m:35	CTTGTCACTTGGGGGCAAGG
1.70634869-5b0c-4745-b0b7-a877144a186b	200	200	primary_forward	m:30	ACCGTCACGTTATGGCGGCC
0.587ff0f4-46c8-49d5-9b8a-e09ff3aeb943	200	200	supplementary_reverse	m:35	CTTGTCACTTGGGGGCAAGG
0.f4c791d5-f223-4923-abd8-f866dcf76f43	200	200	secondary_forward	m:35	CTTGTCACTTGGGGGCAAGG
0.853b6de5-e4b4-4fc4-a1e2-810492c76fed	200	200	primary_forward	m:35	CTTGTCACTTGGGGGCAAGG
1.67afe3c4-624f-4263-b5d5-f65b71cbba4d	200	200	secondary_reverse	m:40	GTTGTGACCAAGGCGTAAGC
1.bacb8ed5-cdbb-4d61-949a-29772ee8979b	200	200	secondary_reverse	m:34	GTTAGGCTACTGGCGCCGGG
1.aa87f8c2-d4eb-45fe-9aa0-809cb534e181	200	200	secondary_forward	m:30	CTTGTATTGGGGAGCCAGGG
1.4616177c-4ab4-4292-acc7-24c634de27d0	200	200	secondary_reverse	m:35	GTTTACAGGCTGGAGTGAGA
1.c779961e-0d5b-448b-8aa1-692b3486aa78	200	200	secondary_forward	m:36	TAGGTAAATTCCATGCTATG
0.2b5c2290-7587-4492-bd89-9149790ab236	200	200	primary_forward	m:35	CTTGTCACTTGGGGGCAAGG
0.ebe40ffa-d883-4e57-b5b0-fa728d481e8b	200	200	supplementary_reverse	m:35	CTTGTCACTTGGGGGCAAGG
1.2603e53e-4a5d-4237-8beb-cd6b6a99e683	200	200	supplementary_forward	m:35	CGTGTTAAGTAAGGGGACAG
1.5a7d129e-9ecd-4a4b-9024-aadfe45db632	200	200	supplementary_reverse	m:32	ATGGGTATTGACGTGTCAGG
1.0355700d-abe6-446e-a9dc-ebad37f0b0bd	200	200	supplementary_reverse	m:30	CTAGCCCGTTATGATCAGGC
1.aca9bf96-0175-42c3-a4d9-7b07aee69561	200	200	secondary_forward	m:30	CTAATCGTTATAGTGCACGG
0.3b8f2657-64c0-40dc-83d5-0097223f24dd	200	200	primary_reverse	m:35	CTTGTCACTTGGGGGCAAGG
0.fe9b97f8-412b-497f-b24a-478bdeac3fb2	200	200	primary_reverse	m:35	CTTGTCACTTGGGGGCAAGG
0.5b91b2dc-c6ba-439e-862b-f4dc85d29f2b	200	200	supplementary_forward	m:35	CTTGTCACTTGGGGGCAAGG
1.96e2c65c-9b28-4431-861e-270e3f349145	200	200	secondary_forward	m:32	CGGTAGTCTCGGAGTAAGGA

Compare the read IDs to the sequences. The reads with IDs starting with "1." carry mismatches while those starting with "0." are clean.

Important: This ID-based grouping exists only because we created the test data this way. In real datasets, read IDs have no relationship to sequence features. However, the visual pattern remains the same: a subset of reads consistently differing at certain positions. That's the visual signature of a potential variant.

Unlike the random noise in the previous section, here we see structure. This is closer to what a real heterozygous variant looks like - consistent differences in a subset of reads.

Combining Variant and Modification Views

Add --show-mod-z to see modifications marked alongside the base differences:

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region --show-mod-z variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
1.5a7d129e-9ecd-4a4b-9024-aadfe45db632	200	200	supplementary_reverse	m:32	ATGZZTATTZACZTZTCAGG
0.5b91b2dc-c6ba-439e-862b-f4dc85d29f2b	200	200	supplementary_forward	m:35	ZTTGTCACTTGGGGGCAAGG
1.70634869-5b0c-4745-b0b7-a877144a186b	200	200	primary_forward	m:30	AZZGTCACGTTATGGCGGZZ
0.f4c791d5-f223-4923-abd8-f866dcf76f43	200	200	secondary_forward	m:35	ZTTGTCACTTGGGGGCAAGG
1.aa87f8c2-d4eb-45fe-9aa0-809cb534e181	200	200	secondary_forward	m:30	ZTTGTATTGGGGAGZZAGGG
1.2603e53e-4a5d-4237-8beb-cd6b6a99e683	200	200	supplementary_forward	m:35	ZGTGTTAAGTAAGGGGAZAG
0.fe9b97f8-412b-497f-b24a-478bdeac3fb2	200	200	primary_reverse	m:35	CTTGTCACTTGZZZZCAAZG
1.96e2c65c-9b28-4431-861e-270e3f349145	200	200	secondary_forward	m:32	ZGGTAGTZTZGGAGTAAGGA
1.4616177c-4ab4-4292-acc7-24c634de27d0	200	200	secondary_reverse	m:35	ZTTTACAZGCTGGAZTZAZA
1.bacb8ed5-cdbb-4d61-949a-29772ee8979b	200	200	secondary_reverse	m:34	GTTAZZCTACTZZCZCCGGG
0.587ff0f4-46c8-49d5-9b8a-e09ff3aeb943	200	200	supplementary_reverse	m:35	CTTGTCACTTGZZZZCAAZG
0.3b8f2657-64c0-40dc-83d5-0097223f24dd	200	200	primary_reverse	m:35	CTTGTCACTTGZZZZCAAZG
1.0355700d-abe6-446e-a9dc-ebad37f0b0bd	200	200	supplementary_reverse	m:30	CTAZCCCZTTATZATCAGGC
0.ebe40ffa-d883-4e57-b5b0-fa728d481e8b	200	200	supplementary_reverse	m:35	CTTGTCACTTGZZZZCAAZG
0.853b6de5-e4b4-4fc4-a1e2-810492c76fed	200	200	primary_forward	m:35	ZTTGTCACTTGGGGGCAAGG
0.2b5c2290-7587-4492-bd89-9149790ab236	200	200	primary_forward	m:35	ZTTGTCACTTGGGGGCAAGG
1.c779961e-0d5b-448b-8aa1-692b3486aa78	200	200	secondary_forward	m:36	TAGGTAAATTZZATGCTATG
1.67afe3c4-624f-4263-b5d5-f65b71cbba4d	200	200	secondary_reverse	m:40	GTTGTZACCAAZZCZTAAZC
0.fa2fb3e5-a2bb-44a5-82fc-0bacc07d7c57	200	200	supplementary_reverse	m:35	CTTGTCACTTGZZZZCAAZG
1.aca9bf96-0175-42c3-a4d9-7b07aee69561	200	200	secondary_forward	m:30	ZTAATZGTTATAGTGZAZGG

This combined view lets you inspect whether variants occur near modified bases. In some biological contexts, SNPs can affect modification patterns - or a variant at a modified position might affect how the modification is called. Visual inspection gives you a quick sanity check before deeper analysis.

Note: Modifications at variant positions may be less reliable since the basecaller's modification model may assume the reference base.

Interpreting What You See

Patterns to look for:

Consistent column differences in a subset of reads → potential variant worth investigating
Scattered differences across positions → likely sequencing errors or very noisy data
Single read with many differences → possible alignment issue or sample contamination

Limitations:

Visual inspection works for quick exploration, not rigorous variant calling
High coverage helps - with few reads, random errors can look like variants
Short regions (10-20bp) work best for this approach; longer regions become hard to scan visually

What to do next:

For rigorous SNP detection and genotyping, use specialized variant calling tools. Nanalogue's strength is quick visual inspection - useful for QC, sanity checks, or exploring specific regions of interest.

Creating Test Data

To create your own test BAM files for this tutorial:

Test data with random errors — All reads have scattered mismatches
Test data with variants — Two read groups simulating heterozygous-like pattern

Next Steps

Extracting sequences — More sequence display options
Quality control of mod data — Assess modification call quality
Finding highly modified reads — Filter reads by modification level

Nanalogue cookbook

Spotting Variants in Sequence Data

Key Concept

Prerequisites

What Random Errors Look Like

Spotting a Consistent Variant

Combining Variant and Modification Views

Interpreting What You See

Creating Test Data

Next Steps

See Also