Spotting Variants in Sequence Data
When inspecting aligned sequences, you may notice positions where some reads differ from others. These could be sequencing errors (random, scattered) or real variants like SNPs (consistent, appearing in a subset of reads). This tutorial shows how to use nanalogue's sequence visualization to spot these patterns visually. We'll first see what random errors look like, then contrast with a cleaner variant pattern where only some reads carry the difference.
Key Concept
When viewing many reads across a short region (10-20bp) with --full-region, all sequences align to the same length.
Random errors scatter across positions.
A real variant appears as a vertical column where a subset of reads consistently shows a different base.
Prerequisites
You will need:
- A BAM file with modification tags (
MMandMLtags) - Nanalogue installed
- For this tutorial: test data with simulated mismatches (configurations provided below)
Please read the Extracting sequences tutorial first, as this tutorial builds on those concepts.
Note: The contig names
contig_00001, etc. are example names used throughout this guide. In real BAM files aligned to a reference genome, you will see names likechr1,chr2,NC_000001.11, or similar depending on your reference.
What Random Errors Look Like
First, let's see what data looks like when all reads have random mismatches scattered throughout. This simulates sequencing errors or very noisy data.
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region error_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
0.5baeece2-5bce-44d4-9a44-14514e238c3f 200 200 primary_forward m:27 GGGTCCTCGGTTTTGGTTCG
0.c1f4f826-d447-4dcd-841d-c473fb7d69b8 200 200 supplementary_reverse m:30 AGATCATCCGCTCCCGTTGC
0.9837fce4-2a05-444d-9e5f-9ace9a0a479b 200 200 supplementary_reverse m:33 GGAGACTCGCTTGGGTTTAA
0.3e27c960-f501-45b6-bb00-5f8dca01d762 200 200 supplementary_reverse m:28 CGCTGCCAGTTCATCCAAAC
0.a417fee6-663d-4d3e-a626-103fe9642ac6 200 200 secondary_reverse m:40 GGGCAGAGGGCCTTAGTTAT
0.6dbb322b-671d-4be9-a5e3-91aaff7de597 200 200 secondary_reverse m:28 AATACCCCGATATTAGGTAC
0.54c1d349-2a46-4335-b277-c0c60a2ca1d5 200 200 primary_forward m:27 CGGGGCACGTTCACACTTTT
Notice how the mismatches are distributed randomly across positions. No single column shows a consistent pattern. If you were looking for a real variant, you wouldn't find a clear signal here - just noise.
Spotting a Consistent Variant
Now let's look at data where only some reads have mismatches - simulating a heterozygous-like variant. This test data has two groups of reads: one clean, one with mismatches.
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region variant_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
1.79cbeb0a-8f0f-4d94-be45-a093648442fc 200 200 secondary_reverse m:34 TCCAATGTGGGGCGCGTCTT
0.f10ef60b-f3be-4533-add7-f5f65886bf9b 200 200 supplementary_reverse m:35 TGCACTGTCGCGCGAGTCGC
1.95198662-f71a-4357-9844-0212b8b67bea 200 200 secondary_reverse m:35 TGGGCCGTCGCGCGAAACAA
1.17aca6a6-6fa1-4c88-ab42-1e93576f827c 200 200 secondary_forward m:33 TGTTCCTTTGTGTGTATTCG
0.ccd9fbcd-c7b8-4ff8-a8d0-cb3ac12959b8 200 200 secondary_forward m:33 TGCACTGTCGCGCGAGTCGC
1.7f29111a-c40e-4cdc-b8e6-09c42eb84945 200 200 primary_forward m:27 TTGAATATGCTCCCAGAGCA
1.912cb0b4-ea05-43bb-b676-4c6ff370139e 200 200 primary_forward m:30 CGTTATGGCTTCCGACGGGC
1.f04db211-a5cc-4305-a416-9c116251feec 200 200 primary_forward m:34 AAAACTTAAGCGAGCTTCAT
0.c0b92f6a-2b69-4cc9-83be-a72de23e34d8 200 200 secondary_reverse m:35 TGCACTGTCGCGCGAGTCGC
0.b24b8b0f-19a5-4f40-bd3e-c7e70d53143a 200 200 supplementary_forward m:33 TGCACTGTCGCGCGAGTCGC
1.cf3407a8-e0f1-4db6-9050-5356e3d66ab2 200 200 supplementary_forward m:34 GAGATTGTCTCACACGGGTG
0.eb6c7249-4a6c-4e41-80ec-1cb4c1b6f15e 200 200 secondary_forward m:33 TGCACTGTCGCGCGAGTCGC
1.69c2b170-232c-46d0-8c95-0897dc42364b 200 200 supplementary_reverse m:31 TGCAGTCTCGCCCCTGCTAG
0.3d31b1dc-3f11-424d-89c8-ac117984a3e5 200 200 primary_reverse m:35 TGCACTGTCGCGCGAGTCGC
0.d1c4bef6-e405-4de1-819d-e4ae97ed865b 200 200 primary_reverse m:35 TGCACTGTCGCGCGAGTCGC
0.dd01e25b-e02c-48b1-9ca2-7907bd752cdc 200 200 secondary_reverse m:35 TGCACTGTCGCGCGAGTCGC
1.96d6b835-b780-4092-bf06-125945513e12 200 200 primary_reverse m:33 CTTGCTGTCAATCGGGCGGC
1.85b854c5-e148-4cd2-a12d-e0e8ef1ff5ee 200 200 supplementary_forward m:31 TTGACCATAGCTCACCTGGA
0.c4211ca9-d683-4a4e-93fa-900463d6ae41 200 200 supplementary_reverse m:35 TGCACTGTCGCGCGAGTCGC
0.8b5365ff-5c69-4106-8821-5a01a426429f 200 200 supplementary_forward m:33 TGCACTGTCGCGCGAGTCGC
0.c0349a71-754d-4722-879c-3910d56fa5e6 200 200 secondary_reverse m:35 TGCACTGTCGCGCGAGTCGC
1.7a3dcd5c-ef3f-4cdd-bd83-daa678963d26 200 200 supplementary_reverse m:38 AACACTGTCCCATGAGTCCG
0.a049173c-4fcf-4f8c-bab8-26eb29c81690 200 200 supplementary_forward m:33 TGCACTGTCGCGCGAGTCGC
0.25e8bcf8-179d-45b4-afc5-d57768082c3e 200 200 primary_reverse m:35 TGCACTGTCGCGCGAGTCGC
0.9381eb02-b7e4-41da-9bbd-38ee15c63b07 200 200 primary_reverse m:35 TGCACTGTCGCGCGAGTCGC
1.98ac705f-0d5b-4928-b3da-17802ad07c74 200 200 primary_forward m:29 TCCACTTTGGTTCTCTTGGC
1.5381c007-404f-44fe-9bed-ccdcf1656a69 200 200 primary_forward m:25 TGGCGTGACGTTCGATTGGT
Compare the read IDs to the sequences. The reads with IDs starting with "1." carry mismatches while those starting with "0." are clean.
Important: This ID-based grouping exists only because we created the test data this way. In real datasets, read IDs have no relationship to sequence features. However, the visual pattern remains the same: a subset of reads consistently differing at certain positions. That's the visual signature of a potential variant.
Unlike the random noise in the previous section, here we see structure. This is closer to what a real heterozygous variant looks like - consistent differences in a subset of reads.
Combining Variant and Modification Views
Add --show-mod-z to see modifications marked alongside the base differences:
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region --show-mod-z variant_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
1.17aca6a6-6fa1-4c88-ab42-1e93576f827c 200 200 secondary_forward m:33 TGTTZZTTTGTGTGTATTCG
0.c0349a71-754d-4722-879c-3910d56fa5e6 200 200 secondary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.eb6c7249-4a6c-4e41-80ec-1cb4c1b6f15e 200 200 secondary_forward m:33 TGCACTGTCGZGZGAGTZGZ
1.912cb0b4-ea05-43bb-b676-4c6ff370139e 200 200 primary_forward m:30 ZGTTATGGZTTZZGAZGGGC
0.dd01e25b-e02c-48b1-9ca2-7907bd752cdc 200 200 secondary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.c4211ca9-d683-4a4e-93fa-900463d6ae41 200 200 supplementary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.b24b8b0f-19a5-4f40-bd3e-c7e70d53143a 200 200 supplementary_forward m:33 TGCACTGTCGZGZGAGTZGZ
0.ccd9fbcd-c7b8-4ff8-a8d0-cb3ac12959b8 200 200 secondary_forward m:33 TGCACTGTCGZGZGAGTZGZ
0.9381eb02-b7e4-41da-9bbd-38ee15c63b07 200 200 primary_reverse m:35 TZCACTZTCZCGCGAGTCZC
1.7a3dcd5c-ef3f-4cdd-bd83-daa678963d26 200 200 supplementary_reverse m:38 AACACTGTCCCATZAZTCCZ
0.f10ef60b-f3be-4533-add7-f5f65886bf9b 200 200 supplementary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.25e8bcf8-179d-45b4-afc5-d57768082c3e 200 200 primary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.a049173c-4fcf-4f8c-bab8-26eb29c81690 200 200 supplementary_forward m:33 TGCACTGTCGZGZGAGTZGZ
1.f04db211-a5cc-4305-a416-9c116251feec 200 200 primary_forward m:34 AAAAZTTAAGZGAGZTTCAT
1.96d6b835-b780-4092-bf06-125945513e12 200 200 primary_reverse m:33 CTTZCTGTCAATCGGZCZZC
1.95198662-f71a-4357-9844-0212b8b67bea 200 200 secondary_reverse m:35 TGGGCCZTCZCZCZAAACAA
1.85b854c5-e148-4cd2-a12d-e0e8ef1ff5ee 200 200 supplementary_forward m:31 TTGAZZATAGZTCACCTGGA
1.69c2b170-232c-46d0-8c95-0897dc42364b 200 200 supplementary_reverse m:31 TZCAGTCTCGCCCCTGCTAZ
0.3d31b1dc-3f11-424d-89c8-ac117984a3e5 200 200 primary_reverse m:35 TZCACTZTCZCGCGAGTCZC
1.98ac705f-0d5b-4928-b3da-17802ad07c74 200 200 primary_forward m:29 TCZAZTTTGGTTZTZTTGGZ
1.7f29111a-c40e-4cdc-b8e6-09c42eb84945 200 200 primary_forward m:27 TTGAATATGZTZCCAGAGCA
1.cf3407a8-e0f1-4db6-9050-5356e3d66ab2 200 200 supplementary_forward m:34 GAGATTGTZTZAZACGGGTG
1.79cbeb0a-8f0f-4d94-be45-a093648442fc 200 200 secondary_reverse m:34 TCCAATZTZZZGCGCGTCTT
1.5381c007-404f-44fe-9bed-ccdcf1656a69 200 200 primary_forward m:25 TGGZGTGAZGTTZGATTGGT
0.d1c4bef6-e405-4de1-819d-e4ae97ed865b 200 200 primary_reverse m:35 TZCACTZTCZCGCGAGTCZC
0.8b5365ff-5c69-4106-8821-5a01a426429f 200 200 supplementary_forward m:33 TGCACTGTCGZGZGAGTZGZ
0.c0b92f6a-2b69-4cc9-83be-a72de23e34d8 200 200 secondary_reverse m:35 TZCACTZTCZCGCGAGTCZC
This combined view lets you inspect whether variants occur near modified bases. In some biological contexts, SNPs can affect modification patterns - or a variant at a modified position might affect how the modification is called. Visual inspection gives you a quick sanity check before deeper analysis.
Note: Modifications at variant positions may be less reliable since the basecaller's modification model may assume the reference base.
Interpreting What You See
Patterns to look for:
- Consistent column differences in a subset of reads → potential variant worth investigating
- Scattered differences across positions → likely sequencing errors or very noisy data
- Single read with many differences → possible alignment issue or sample contamination
Limitations:
- Visual inspection works for quick exploration, not rigorous variant calling
- High coverage helps - with few reads, random errors can look like variants
- Short regions (10-20bp) work best for this approach; longer regions become hard to scan visually
What to do next:
For rigorous SNP detection and genotyping, use specialized variant calling tools. Nanalogue's strength is quick visual inspection - useful for QC, sanity checks, or exploring specific regions of interest.
Creating Test Data
To create your own test BAM files for this tutorial:
- Test data with random errors — All reads have scattered mismatches
- Test data with variants — Two read groups simulating heterozygous-like pattern
Next Steps
- Extracting sequences — More sequence display options
- Quality control of mod data — Assess modification call quality
- Finding highly modified reads — Filter reads by modification level
See Also
- Quick look at your data — Initial data inspection
- CLI Reference — Full documentation of all nanalogue commands
- Recipes — Quick copy-paste snippets