Spotting Variants in Sequence Data
When inspecting aligned sequences, you may notice positions where some reads differ from others. These could be sequencing errors (random, scattered) or real variants like SNPs (consistent, appearing in a subset of reads). This tutorial shows how to use nanalogue's sequence visualization to spot these patterns visually. We'll first see what random errors look like, then contrast with a cleaner variant pattern where only some reads carry the difference.
Key Concept
When viewing many reads across a short region (10-20bp) with --full-region, all sequences align to the same length.
Random errors scatter across positions.
A real variant appears as a vertical column where a subset of reads consistently shows a different base.
Prerequisites
You will need:
- A BAM file with modification tags (
MMandMLtags) - Nanalogue installed
- For this tutorial: test data with simulated mismatches (configurations provided below)
Please read the Extracting sequences tutorial first, as this tutorial builds on those concepts.
Note: The contig names
contig_00001, etc. are example names used throughout this guide. In real BAM files aligned to a reference genome, you will see names likechr1,chr2,NC_000001.11, or similar depending on your reference.
What Random Errors Look Like
First, let's see what data looks like when all reads have random mismatches scattered throughout. This simulates sequencing errors or very noisy data.
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region error_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
0.0ab823f1-f8e4-4522-961e-fb1832f3975b 200 200 secondary_reverse m:30 GTGGAGCGCTAAGTAACGAA
0.803d426e-b704-4f32-b864-2e69b75327d2 200 200 secondary_reverse m:35 ATGTGGCAATGCTGACGAGA
0.2294a43e-d27e-4f3e-a66a-e1d20a9f7051 200 200 secondary_reverse m:30 GAGGTCAAATGCGGCAAGGA
0.9fb8a78c-22d9-4127-861c-de8cd7fae11c 200 200 primary_reverse m:38 ACGCCCCAGTCGGAGCGAGG
Notice how the mismatches are distributed randomly across positions. No single column shows a consistent pattern. If you were looking for a real variant, you wouldn't find a clear signal here - just noise.
Spotting a Consistent Variant
Now let's look at data where only some reads have mismatches - simulating a heterozygous-like variant. This test data has two groups of reads: one clean, one with mismatches.
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region variant_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
0.009ec9ba-8d9b-4f96-b895-81a11aff4c7d 200 200 supplementary_reverse m:32 TCTGAGGTGCACCACCTGAC
1.1d26ce15-97a7-4ef0-9450-a2e1bd97dacc 200 200 primary_reverse m:32 CCTGACTTGCAGCACAAATC
1.6511c942-e05f-481d-badf-0b032595f91e 200 200 secondary_reverse m:35 TTCCAGGTACTACAATAGCC
0.4d011595-05f4-4dc5-8dfd-03e70d1c5816 200 200 primary_forward m:38 TCTGAGGTGCACCACCTGAC
1.55001acf-6aed-4794-b16a-e21df6c9f942 200 200 secondary_reverse m:30 TTTGACGCTAGCAAGAGACC
1.081f7f88-516c-44b1-bf79-3ae512528a62 200 200 primary_reverse m:30 AGGGTAGCACTCCCCTAAAT
1.e5c144f5-41fb-4204-ac32-f28e22ceb2e1 200 200 supplementary_forward m:32 AGTGAGCATAAACTCTGGAT
1.037a63f0-2622-4bd3-b979-2a4ae3c5a3cf 200 200 supplementary_reverse m:30 TCTCTGGACTATCACCTCTC
1.7ab027ef-dd27-4368-a643-9be8f5800864 200 200 supplementary_forward m:29 TGCGAGGGACCCGATCTTAT
0.7ef74789-efa0-4312-88ab-357b4de499dc 200 200 primary_reverse m:32 TCTGAGGTGCACCACCTGAC
1.f6eb17bc-b317-4d47-a780-4085f3784c1d 200 200 secondary_reverse m:30 ACCCATGTCCACCACCTGAT
1.5cb0e996-83c5-41b6-a14f-26ff0508ec70 200 200 secondary_reverse m:31 ATAGAGGCCCAAGATCTGAC
1.2ff3ed50-4265-4a50-9c12-fd903c72ab87 200 200 supplementary_forward m:30 GATGCTCCATACAAGCTGTA
0.49f7e109-e06a-42f5-9904-f1f1c8cbc3ff 200 200 primary_reverse m:32 TCTGAGGTGCACCACCTGAC
1.99a44ca6-81de-41f2-b62e-df6e399b9376 200 200 secondary_forward m:35 CATCATGGGGACTAAATGTG
1.110424ff-bb76-46f4-8f60-80c367ee6a6e 200 200 supplementary_forward m:32 ACTACAGACAACCACCTGGT
0.6a19555d-b96d-4979-a61b-0208e11d264c 200 200 primary_forward m:38 TCTGAGGTGCACCACCTGAC
1.f5c76ac6-b4f4-4e96-af2c-43da8b76d9a5 200 200 secondary_forward m:30 TCTAATATTGAGTCCTAAGT
Compare the read IDs to the sequences. The reads with IDs starting with "1." carry mismatches while those starting with "0." are clean.
Important: This ID-based grouping exists only because we created the test data this way. In real datasets, read IDs have no relationship to sequence features. However, the visual pattern remains the same: a subset of reads consistently differing at certain positions. That's the visual signature of a potential variant.
Unlike the random noise in the previous section, here we see structure. This is closer to what a real heterozygous variant looks like - consistent differences in a subset of reads.
Combining Variant and Modification Views
Add --show-mod-z to see modifications marked alongside the base differences:
nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
--seq-region contig_00001:90-110 --full-region --show-mod-z variant_data.bam
Example output:
# mod-unmod threshold is 0.5
read_id align_length sequence_length_template alignment_type mod_count sequence
1.037a63f0-2622-4bd3-b979-2a4ae3c5a3cf 200 200 supplementary_reverse m:30 TCTCTZZACTATCACCTCTC
1.2ff3ed50-4265-4a50-9c12-fd903c72ab87 200 200 supplementary_forward m:30 GATGZTZZATAZAAGCTGTA
1.f6eb17bc-b317-4d47-a780-4085f3784c1d 200 200 secondary_reverse m:30 ACCCATGTCCACCACCTGAT
1.6511c942-e05f-481d-badf-0b032595f91e 200 200 secondary_reverse m:35 TTCCAZZTACTACAATAZCC
0.009ec9ba-8d9b-4f96-b895-81a11aff4c7d 200 200 supplementary_reverse m:32 TCTZAGGTGCACCACCTZAC
1.110424ff-bb76-46f4-8f60-80c367ee6a6e 200 200 supplementary_forward m:32 AZTACAGACAACZAZZTGGT
1.f5c76ac6-b4f4-4e96-af2c-43da8b76d9a5 200 200 secondary_forward m:30 TZTAATATTGAGTZZTAAGT
1.99a44ca6-81de-41f2-b62e-df6e399b9376 200 200 secondary_forward m:35 ZATCATGGGGACTAAATGTG
1.7ab027ef-dd27-4368-a643-9be8f5800864 200 200 supplementary_forward m:29 TGCGAGGGACCZGATZTTAT
0.49f7e109-e06a-42f5-9904-f1f1c8cbc3ff 200 200 primary_reverse m:32 TCTZAGGTGCACCACCTZAC
1.081f7f88-516c-44b1-bf79-3ae512528a62 200 200 primary_reverse m:30 AGZZTAZCACTCCCCTAAAT
1.1d26ce15-97a7-4ef0-9450-a2e1bd97dacc 200 200 primary_reverse m:32 CCTZACTTGCAGCACAAATC
0.4d011595-05f4-4dc5-8dfd-03e70d1c5816 200 200 primary_forward m:38 TZTGAGGTGZAZZACCTGAC
0.7ef74789-efa0-4312-88ab-357b4de499dc 200 200 primary_reverse m:32 TCTZAGGTGCACCACCTZAC
1.55001acf-6aed-4794-b16a-e21df6c9f942 200 200 secondary_reverse m:30 TTTGACGCTAZCAAZAZACC
0.6a19555d-b96d-4979-a61b-0208e11d264c 200 200 primary_forward m:38 TZTGAGGTGZAZZACCTGAC
1.5cb0e996-83c5-41b6-a14f-26ff0508ec70 200 200 secondary_reverse m:31 ATAGAZZCCCAAZATCTZAC
1.e5c144f5-41fb-4204-ac32-f28e22ceb2e1 200 200 supplementary_forward m:32 AGTGAGCATAAACTCTGGAT
This combined view lets you inspect whether variants occur near modified bases. In some biological contexts, SNPs can affect modification patterns - or a variant at a modified position might affect how the modification is called. Visual inspection gives you a quick sanity check before deeper analysis.
Note: Modifications at variant positions may be less reliable since the basecaller's modification model may assume the reference base.
Interpreting What You See
Patterns to look for:
- Consistent column differences in a subset of reads → potential variant worth investigating
- Scattered differences across positions → likely sequencing errors or very noisy data
- Single read with many differences → possible alignment issue or sample contamination
Limitations:
- Visual inspection works for quick exploration, not rigorous variant calling
- High coverage helps - with few reads, random errors can look like variants
- Short regions (10-20bp) work best for this approach; longer regions become hard to scan visually
What to do next:
For rigorous SNP detection and genotyping, use specialized variant calling tools. Nanalogue's strength is quick visual inspection - useful for QC, sanity checks, or exploring specific regions of interest.
Creating Test Data
To create your own test BAM files for this tutorial:
- Test data with random errors — All reads have scattered mismatches
- Test data with variants — Two read groups simulating heterozygous-like pattern
Next Steps
- Extracting sequences — More sequence display options
- Quality control of mod data — Assess modification call quality
- Finding highly modified reads — Filter reads by modification level
See Also
- Quick look at your data — Initial data inspection
- CLI Reference — Full documentation of all nanalogue commands
- Recipes — Quick copy-paste snippets