Simulating Test Data

Nanalogue includes nanalogue_sim_bam, a tool for creating synthetic BAM files with modifications. This is useful for testing, learning, and benchmarking.

Basic Usage & Configuration Structure

Create a JSON configuration file describing your desired data, then run:

cat > config.json << 'EOF'
{
  "contigs": {
    "number": 3,
    "len_range": [200, 200]
  },
  "reads": [
    {
      "number": 30,
      "mapq_range": [20, 60],
      "base_qual_range": [20, 40],
      "len_range": [0.3, 0.8],
      "mods": [{
        "base": "C",
        "is_strand_plus": true,
        "mod_code": "m",
        "win": [5, 3],
        "mod_range": [[0.7, 1.0], [0.1, 0.4]]
      }]
    }
  ]
}
EOF
nanalogue_sim_bam config.json output.bam output.fasta

This produces:

A BAM file with simulated reads and modification tags
A FASTA file with the reference contigs

All config files follow the basic structure shown above. Further details follow.

Contigs Section

Field	Description
`number`	Number of contigs to create
`len_range`	`[min, max]` length of contigs in bp
`repeated_seq`	Optional: repeat this sequence to build contig

Reads Section

Each entry in the reads array creates a group of reads. Multiple groups can have different properties.

Field	Description
`number`	Number of reads in this group
`mapq_range`	`[min, max]` mapping quality
`base_qual_range`	`[min, max]` base quality scores
`len_range`	`[min, max]` as fraction of contig length
`delete`	`[start%, end%]` deletion position range
`insert` or `insert_middle`	Insertion sequence string
`mismatch`	Mismatch rate (0.0 to 1.0)
`mods`	Array of modification configurations

Modifications Section

Field	Description
`base`	Target base (e.g., "C" for cytosine)
`is_strand_plus`	true for + strand, false for - strand
`mod_code`	Modification code (e.g., "m" for 5mC)
`win`	List of window sizes (number of target bases, e.g., cytosines)
`mod_range`	List of `[min, max]` probability ranges, same length as `win`

The win and mod_range fields work together to create repeating modification patterns. Each window size in win pairs with the corresponding probability range in mod_range. The pattern cycles along the read until it ends.

For example, with "base": "C", "win": [5, 3] and "mod_range": [[0.7, 1.0], [0.1, 0.4]] creates:

5 cytosines with modification probability 0.7-1.0 (high)
3 cytosines with modification probability 0.1-0.4 (low)
Then repeats: 5 high, 3 low, 5 high, 3 low, ...

Read ID Prefixes

When you create multiple read groups, each group's read IDs are prefixed with the group index:

First group: 0.uuid...
Second group: 1.uuid...
And so on...

This helps identify which group a read belongs to when inspecting output.

Example Configurations

Test data with indels — Insertions and deletions
Test data with random errors — Simulated sequencing errors
Test data with variants — Heterozygous-like SNP patterns

Nanalogue cookbook