Spotting Variants in Sequence Data

When inspecting aligned sequences, you may notice positions where some reads differ from others. These could be sequencing errors (random, scattered) or real variants like SNPs (consistent, appearing in a subset of reads). This tutorial shows how to use nanalogue's sequence visualization to spot these patterns visually. We'll first see what random errors look like, then contrast with a cleaner variant pattern where only some reads carry the difference.

Key Concept

When viewing many reads across a short region (10-20bp) with --full-region, all sequences align to the same length. Random errors scatter across positions. A real variant appears as a vertical column where a subset of reads consistently shows a different base.

Prerequisites

You will need:

  • A BAM file with modification tags (MM and ML tags)
  • Nanalogue installed
  • For this tutorial: test data with simulated mismatches (configurations provided below)

Please read the Extracting sequences tutorial first, as this tutorial builds on those concepts.

Note: The contig names contig_00001, etc. are example names used throughout this guide. In real BAM files aligned to a reference genome, you will see names like chr1, chr2, NC_000001.11, or similar depending on your reference.

What Random Errors Look Like

First, let's see what data looks like when all reads have random mismatches scattered throughout. This simulates sequencing errors or very noisy data.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region error_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
0.5baeece2-5bce-44d4-9a44-14514e238c3f	200	200	primary_forward	m:27	GGGTCCTCGGTTTTGGTTCG
0.c1f4f826-d447-4dcd-841d-c473fb7d69b8	200	200	supplementary_reverse	m:30	AGATCATCCGCTCCCGTTGC
0.9837fce4-2a05-444d-9e5f-9ace9a0a479b	200	200	supplementary_reverse	m:33	GGAGACTCGCTTGGGTTTAA
0.3e27c960-f501-45b6-bb00-5f8dca01d762	200	200	supplementary_reverse	m:28	CGCTGCCAGTTCATCCAAAC
0.a417fee6-663d-4d3e-a626-103fe9642ac6	200	200	secondary_reverse	m:40	GGGCAGAGGGCCTTAGTTAT
0.6dbb322b-671d-4be9-a5e3-91aaff7de597	200	200	secondary_reverse	m:28	AATACCCCGATATTAGGTAC
0.54c1d349-2a46-4335-b277-c0c60a2ca1d5	200	200	primary_forward	m:27	CGGGGCACGTTCACACTTTT

Notice how the mismatches are distributed randomly across positions. No single column shows a consistent pattern. If you were looking for a real variant, you wouldn't find a clear signal here - just noise.

Spotting a Consistent Variant

Now let's look at data where only some reads have mismatches - simulating a heterozygous-like variant. This test data has two groups of reads: one clean, one with mismatches.

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
1.79cbeb0a-8f0f-4d94-be45-a093648442fc	200	200	secondary_reverse	m:34	TCCAATGTGGGGCGCGTCTT
0.f10ef60b-f3be-4533-add7-f5f65886bf9b	200	200	supplementary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
1.95198662-f71a-4357-9844-0212b8b67bea	200	200	secondary_reverse	m:35	TGGGCCGTCGCGCGAAACAA
1.17aca6a6-6fa1-4c88-ab42-1e93576f827c	200	200	secondary_forward	m:33	TGTTCCTTTGTGTGTATTCG
0.ccd9fbcd-c7b8-4ff8-a8d0-cb3ac12959b8	200	200	secondary_forward	m:33	TGCACTGTCGCGCGAGTCGC
1.7f29111a-c40e-4cdc-b8e6-09c42eb84945	200	200	primary_forward	m:27	TTGAATATGCTCCCAGAGCA
1.912cb0b4-ea05-43bb-b676-4c6ff370139e	200	200	primary_forward	m:30	CGTTATGGCTTCCGACGGGC
1.f04db211-a5cc-4305-a416-9c116251feec	200	200	primary_forward	m:34	AAAACTTAAGCGAGCTTCAT
0.c0b92f6a-2b69-4cc9-83be-a72de23e34d8	200	200	secondary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
0.b24b8b0f-19a5-4f40-bd3e-c7e70d53143a	200	200	supplementary_forward	m:33	TGCACTGTCGCGCGAGTCGC
1.cf3407a8-e0f1-4db6-9050-5356e3d66ab2	200	200	supplementary_forward	m:34	GAGATTGTCTCACACGGGTG
0.eb6c7249-4a6c-4e41-80ec-1cb4c1b6f15e	200	200	secondary_forward	m:33	TGCACTGTCGCGCGAGTCGC
1.69c2b170-232c-46d0-8c95-0897dc42364b	200	200	supplementary_reverse	m:31	TGCAGTCTCGCCCCTGCTAG
0.3d31b1dc-3f11-424d-89c8-ac117984a3e5	200	200	primary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
0.d1c4bef6-e405-4de1-819d-e4ae97ed865b	200	200	primary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
0.dd01e25b-e02c-48b1-9ca2-7907bd752cdc	200	200	secondary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
1.96d6b835-b780-4092-bf06-125945513e12	200	200	primary_reverse	m:33	CTTGCTGTCAATCGGGCGGC
1.85b854c5-e148-4cd2-a12d-e0e8ef1ff5ee	200	200	supplementary_forward	m:31	TTGACCATAGCTCACCTGGA
0.c4211ca9-d683-4a4e-93fa-900463d6ae41	200	200	supplementary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
0.8b5365ff-5c69-4106-8821-5a01a426429f	200	200	supplementary_forward	m:33	TGCACTGTCGCGCGAGTCGC
0.c0349a71-754d-4722-879c-3910d56fa5e6	200	200	secondary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
1.7a3dcd5c-ef3f-4cdd-bd83-daa678963d26	200	200	supplementary_reverse	m:38	AACACTGTCCCATGAGTCCG
0.a049173c-4fcf-4f8c-bab8-26eb29c81690	200	200	supplementary_forward	m:33	TGCACTGTCGCGCGAGTCGC
0.25e8bcf8-179d-45b4-afc5-d57768082c3e	200	200	primary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
0.9381eb02-b7e4-41da-9bbd-38ee15c63b07	200	200	primary_reverse	m:35	TGCACTGTCGCGCGAGTCGC
1.98ac705f-0d5b-4928-b3da-17802ad07c74	200	200	primary_forward	m:29	TCCACTTTGGTTCTCTTGGC
1.5381c007-404f-44fe-9bed-ccdcf1656a69	200	200	primary_forward	m:25	TGGCGTGACGTTCGATTGGT

Compare the read IDs to the sequences. The reads with IDs starting with "1." carry mismatches while those starting with "0." are clean.

Important: This ID-based grouping exists only because we created the test data this way. In real datasets, read IDs have no relationship to sequence features. However, the visual pattern remains the same: a subset of reads consistently differing at certain positions. That's the visual signature of a potential variant.

Unlike the random noise in the previous section, here we see structure. This is closer to what a real heterozygous variant looks like - consistent differences in a subset of reads.

Combining Variant and Modification Views

Add --show-mod-z to see modifications marked alongside the base differences:

nanalogue read-table-show-mods --tag m --region contig_00001:90-110 \
    --seq-region contig_00001:90-110 --full-region --show-mod-z variant_data.bam

Example output:

# mod-unmod threshold is 0.5
read_id	align_length	sequence_length_template	alignment_type	mod_count	sequence
1.17aca6a6-6fa1-4c88-ab42-1e93576f827c	200	200	secondary_forward	m:33	TGTTZZTTTGTGTGTATTCG
0.c0349a71-754d-4722-879c-3910d56fa5e6	200	200	secondary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.eb6c7249-4a6c-4e41-80ec-1cb4c1b6f15e	200	200	secondary_forward	m:33	TGCACTGTCGZGZGAGTZGZ
1.912cb0b4-ea05-43bb-b676-4c6ff370139e	200	200	primary_forward	m:30	ZGTTATGGZTTZZGAZGGGC
0.dd01e25b-e02c-48b1-9ca2-7907bd752cdc	200	200	secondary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.c4211ca9-d683-4a4e-93fa-900463d6ae41	200	200	supplementary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.b24b8b0f-19a5-4f40-bd3e-c7e70d53143a	200	200	supplementary_forward	m:33	TGCACTGTCGZGZGAGTZGZ
0.ccd9fbcd-c7b8-4ff8-a8d0-cb3ac12959b8	200	200	secondary_forward	m:33	TGCACTGTCGZGZGAGTZGZ
0.9381eb02-b7e4-41da-9bbd-38ee15c63b07	200	200	primary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
1.7a3dcd5c-ef3f-4cdd-bd83-daa678963d26	200	200	supplementary_reverse	m:38	AACACTGTCCCATZAZTCCZ
0.f10ef60b-f3be-4533-add7-f5f65886bf9b	200	200	supplementary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.25e8bcf8-179d-45b4-afc5-d57768082c3e	200	200	primary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.a049173c-4fcf-4f8c-bab8-26eb29c81690	200	200	supplementary_forward	m:33	TGCACTGTCGZGZGAGTZGZ
1.f04db211-a5cc-4305-a416-9c116251feec	200	200	primary_forward	m:34	AAAAZTTAAGZGAGZTTCAT
1.96d6b835-b780-4092-bf06-125945513e12	200	200	primary_reverse	m:33	CTTZCTGTCAATCGGZCZZC
1.95198662-f71a-4357-9844-0212b8b67bea	200	200	secondary_reverse	m:35	TGGGCCZTCZCZCZAAACAA
1.85b854c5-e148-4cd2-a12d-e0e8ef1ff5ee	200	200	supplementary_forward	m:31	TTGAZZATAGZTCACCTGGA
1.69c2b170-232c-46d0-8c95-0897dc42364b	200	200	supplementary_reverse	m:31	TZCAGTCTCGCCCCTGCTAZ
0.3d31b1dc-3f11-424d-89c8-ac117984a3e5	200	200	primary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
1.98ac705f-0d5b-4928-b3da-17802ad07c74	200	200	primary_forward	m:29	TCZAZTTTGGTTZTZTTGGZ
1.7f29111a-c40e-4cdc-b8e6-09c42eb84945	200	200	primary_forward	m:27	TTGAATATGZTZCCAGAGCA
1.cf3407a8-e0f1-4db6-9050-5356e3d66ab2	200	200	supplementary_forward	m:34	GAGATTGTZTZAZACGGGTG
1.79cbeb0a-8f0f-4d94-be45-a093648442fc	200	200	secondary_reverse	m:34	TCCAATZTZZZGCGCGTCTT
1.5381c007-404f-44fe-9bed-ccdcf1656a69	200	200	primary_forward	m:25	TGGZGTGAZGTTZGATTGGT
0.d1c4bef6-e405-4de1-819d-e4ae97ed865b	200	200	primary_reverse	m:35	TZCACTZTCZCGCGAGTCZC
0.8b5365ff-5c69-4106-8821-5a01a426429f	200	200	supplementary_forward	m:33	TGCACTGTCGZGZGAGTZGZ
0.c0b92f6a-2b69-4cc9-83be-a72de23e34d8	200	200	secondary_reverse	m:35	TZCACTZTCZCGCGAGTCZC

This combined view lets you inspect whether variants occur near modified bases. In some biological contexts, SNPs can affect modification patterns - or a variant at a modified position might affect how the modification is called. Visual inspection gives you a quick sanity check before deeper analysis.

Note: Modifications at variant positions may be less reliable since the basecaller's modification model may assume the reference base.

Interpreting What You See

Patterns to look for:

  • Consistent column differences in a subset of reads → potential variant worth investigating
  • Scattered differences across positions → likely sequencing errors or very noisy data
  • Single read with many differences → possible alignment issue or sample contamination

Limitations:

  • Visual inspection works for quick exploration, not rigorous variant calling
  • High coverage helps - with few reads, random errors can look like variants
  • Short regions (10-20bp) work best for this approach; longer regions become hard to scan visually

What to do next:

For rigorous SNP detection and genotyping, use specialized variant calling tools. Nanalogue's strength is quick visual inspection - useful for QC, sanity checks, or exploring specific regions of interest.

Creating Test Data

To create your own test BAM files for this tutorial:

Next Steps

See Also