Spotting Variants in Sequence Data

When inspecting aligned sequences, you may notice positions where some reads differ from others. These could be sequencing errors (random, scattered) or real variants like SNPs (consistent, appearing in a subset of reads). This tutorial shows how to use nanalogue's sequence visualization to spot these patterns visually. We'll first see what random errors look like, then contrast with a cleaner variant pattern where only some reads carry the difference.

Key Concept

When viewing many reads across a short region (10-20bp) with --full-region, all sequences align to the same length. Random errors scatter across positions. A real variant appears as a vertical column where a subset of reads consistently shows a different base.

Prerequisites

You will need:

  • A BAM file with modification tags (MM and ML tags)
  • Nanalogue installed
  • For this tutorial: test data with simulated mismatches (configurations provided below)

Please read the Extracting sequences tutorial first, as this tutorial builds on those concepts.

Note: The contig names contig_00001, etc. are example names used throughout this guide. In real BAM files aligned to a reference genome, you will see names like chr1, chr2, NC_000001.11, or similar depending on your reference.

What Random Errors Look Like

First, let's see what data looks like when all reads have random mismatches scattered throughout. This simulates sequencing errors or very noisy data.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region error_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
0.0ab823f1-f8e4-4522-961e-fb1832f3975b	200	200	secondary_reverse	m:30	GTGGAGCGCTAAGTAACGAA
0.803d426e-b704-4f32-b864-2e69b75327d2	200	200	secondary_reverse	m:35	ATGTGGCAATGCTGACGAGA
0.2294a43e-d27e-4f3e-a66a-e1d20a9f7051	200	200	secondary_reverse	m:30	GAGGTCAAATGCGGCAAGGA
0.9fb8a78c-22d9-4127-861c-de8cd7fae11c	200	200	primary_reverse	m:38	ACGCCCCAGTCGGAGCGAGG

Notice how the mismatches are distributed randomly across positions. No single column shows a consistent pattern. If you were looking for a real variant, you wouldn't find a clear signal here - just noise.

Spotting a Consistent Variant

Now let's look at data where only some reads have mismatches - simulating a heterozygous-like variant. This test data has two groups of reads: one clean, one with mismatches.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
0.009ec9ba-8d9b-4f96-b895-81a11aff4c7d	200	200	supplementary_reverse	m:32	TCTGAGGTGCACCACCTGAC
1.1d26ce15-97a7-4ef0-9450-a2e1bd97dacc	200	200	primary_reverse	m:32	CCTGACTTGCAGCACAAATC
1.6511c942-e05f-481d-badf-0b032595f91e	200	200	secondary_reverse	m:35	TTCCAGGTACTACAATAGCC
0.4d011595-05f4-4dc5-8dfd-03e70d1c5816	200	200	primary_forward	m:38	TCTGAGGTGCACCACCTGAC
1.55001acf-6aed-4794-b16a-e21df6c9f942	200	200	secondary_reverse	m:30	TTTGACGCTAGCAAGAGACC
1.081f7f88-516c-44b1-bf79-3ae512528a62	200	200	primary_reverse	m:30	AGGGTAGCACTCCCCTAAAT
1.e5c144f5-41fb-4204-ac32-f28e22ceb2e1	200	200	supplementary_forward	m:32	AGTGAGCATAAACTCTGGAT
1.037a63f0-2622-4bd3-b979-2a4ae3c5a3cf	200	200	supplementary_reverse	m:30	TCTCTGGACTATCACCTCTC
1.7ab027ef-dd27-4368-a643-9be8f5800864	200	200	supplementary_forward	m:29	TGCGAGGGACCCGATCTTAT
0.7ef74789-efa0-4312-88ab-357b4de499dc	200	200	primary_reverse	m:32	TCTGAGGTGCACCACCTGAC
1.f6eb17bc-b317-4d47-a780-4085f3784c1d	200	200	secondary_reverse	m:30	ACCCATGTCCACCACCTGAT
1.5cb0e996-83c5-41b6-a14f-26ff0508ec70	200	200	secondary_reverse	m:31	ATAGAGGCCCAAGATCTGAC
1.2ff3ed50-4265-4a50-9c12-fd903c72ab87	200	200	supplementary_forward	m:30	GATGCTCCATACAAGCTGTA
0.49f7e109-e06a-42f5-9904-f1f1c8cbc3ff	200	200	primary_reverse	m:32	TCTGAGGTGCACCACCTGAC
1.99a44ca6-81de-41f2-b62e-df6e399b9376	200	200	secondary_forward	m:35	CATCATGGGGACTAAATGTG
1.110424ff-bb76-46f4-8f60-80c367ee6a6e	200	200	supplementary_forward	m:32	ACTACAGACAACCACCTGGT
0.6a19555d-b96d-4979-a61b-0208e11d264c	200	200	primary_forward	m:38	TCTGAGGTGCACCACCTGAC
1.f5c76ac6-b4f4-4e96-af2c-43da8b76d9a5	200	200	secondary_forward	m:30	TCTAATATTGAGTCCTAAGT

Compare the read IDs to the sequences. The reads with IDs starting with "1." carry mismatches while those starting with "0." are clean.

Important: This ID-based grouping exists only because we created the test data this way. In real datasets, read IDs have no relationship to sequence features. However, the visual pattern remains the same: a subset of reads consistently differing at certain positions. That's the visual signature of a potential variant.

Unlike the random noise in the previous section, here we see structure. This is closer to what a real heterozygous variant looks like - consistent differences in a subset of reads.

Combining Variant and Modification Views

Add --show-mod-z to see modifications marked alongside the base differences:

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region --show-mod-z variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
1.037a63f0-2622-4bd3-b979-2a4ae3c5a3cf	200	200	supplementary_reverse	m:30	TCTCTZZACTATCACCTCTC
1.2ff3ed50-4265-4a50-9c12-fd903c72ab87	200	200	supplementary_forward	m:30	GATGZTZZATAZAAGCTGTA
1.f6eb17bc-b317-4d47-a780-4085f3784c1d	200	200	secondary_reverse	m:30	ACCCATGTCCACCACCTGAT
1.6511c942-e05f-481d-badf-0b032595f91e	200	200	secondary_reverse	m:35	TTCCAZZTACTACAATAZCC
0.009ec9ba-8d9b-4f96-b895-81a11aff4c7d	200	200	supplementary_reverse	m:32	TCTZAGGTGCACCACCTZAC
1.110424ff-bb76-46f4-8f60-80c367ee6a6e	200	200	supplementary_forward	m:32	AZTACAGACAACZAZZTGGT
1.f5c76ac6-b4f4-4e96-af2c-43da8b76d9a5	200	200	secondary_forward	m:30	TZTAATATTGAGTZZTAAGT
1.99a44ca6-81de-41f2-b62e-df6e399b9376	200	200	secondary_forward	m:35	ZATCATGGGGACTAAATGTG
1.7ab027ef-dd27-4368-a643-9be8f5800864	200	200	supplementary_forward	m:29	TGCGAGGGACCZGATZTTAT
0.49f7e109-e06a-42f5-9904-f1f1c8cbc3ff	200	200	primary_reverse	m:32	TCTZAGGTGCACCACCTZAC
1.081f7f88-516c-44b1-bf79-3ae512528a62	200	200	primary_reverse	m:30	AGZZTAZCACTCCCCTAAAT
1.1d26ce15-97a7-4ef0-9450-a2e1bd97dacc	200	200	primary_reverse	m:32	CCTZACTTGCAGCACAAATC
0.4d011595-05f4-4dc5-8dfd-03e70d1c5816	200	200	primary_forward	m:38	TZTGAGGTGZAZZACCTGAC
0.7ef74789-efa0-4312-88ab-357b4de499dc	200	200	primary_reverse	m:32	TCTZAGGTGCACCACCTZAC
1.55001acf-6aed-4794-b16a-e21df6c9f942	200	200	secondary_reverse	m:30	TTTGACGCTAZCAAZAZACC
0.6a19555d-b96d-4979-a61b-0208e11d264c	200	200	primary_forward	m:38	TZTGAGGTGZAZZACCTGAC
1.5cb0e996-83c5-41b6-a14f-26ff0508ec70	200	200	secondary_reverse	m:31	ATAGAZZCCCAAZATCTZAC
1.e5c144f5-41fb-4204-ac32-f28e22ceb2e1	200	200	supplementary_forward	m:32	AGTGAGCATAAACTCTGGAT

This combined view lets you inspect whether variants occur near modified bases. In some biological contexts, SNPs can affect modification patterns - or a variant at a modified position might affect how the modification is called. Visual inspection gives you a quick sanity check before deeper analysis.

Note: Modifications at variant positions may be less reliable since the basecaller's modification model may assume the reference base.

Interpreting What You See

Patterns to look for:

  • Consistent column differences in a subset of reads → potential variant worth investigating
  • Scattered differences across positions → likely sequencing errors or very noisy data
  • Single read with many differences → possible alignment issue or sample contamination

Limitations:

  • Visual inspection works for quick exploration, not rigorous variant calling
  • High coverage helps - with few reads, random errors can look like variants
  • Short regions (10-20bp) work best for this approach; longer regions become hard to scan visually

What to do next:

For rigorous SNP detection and genotyping, use specialized variant calling tools. Nanalogue's strength is quick visual inspection - useful for QC, sanity checks, or exploring specific regions of interest.

Creating Test Data

To create your own test BAM files for this tutorial:

Next Steps

See Also