Skip to content

Add Qualimap RNA-Seq QC module#19

Merged
ewels merged 25 commits into
mainfrom
qualimap
Mar 4, 2026
Merged

Add Qualimap RNA-Seq QC module#19
ewels merged 25 commits into
mainfrom
qualimap

Conversation

@ewels
Copy link
Copy Markdown
Member

@ewels ewels commented Feb 17, 2026

Summary

Adds a new qualimap/ module that produces Qualimap-compatible RNA-Seq QC output (rnaseq_qc_results.txt and coverage profile TSVs), integrated into the existing single-pass BAM processing pipeline.

  • Phase 1: Core data structures — exon/intron interval tree index, read accumulator with PE mate buffering, per-transcript coverage tracking
  • Phase 2: Output writer, counting fixes, and Qualimap compatibility — rnaseq_qc_results.txt output, coverage profiles (total/high/low), bias computation, junction motif analysis

Key implementation details

  • Reimplements Qualimap's read counting logic: enclosure-based gene assignment, PE fragment processing, SSP estimation per enclosed M-block, intronic/intergenic classification for no-feature reads
  • Replicates Qualimap's CIGAR bug (I/S/H/P ops advance reference offset) for output compatibility
  • Junction reads counted per-junction (per N op), motifs limited to top 11 by percentage
  • Coverage mapped to exonic coordinates (matching Picard's Gene.Transcript model)
  • Auto-detects paired-end mode from BAM flags (no CLI flag needed)

Small benchmark comparison vs Qualimap reference

Metric Status
total / secondary / non-unique Exact match
left / right / read pairs Exact match
ambiguous / intergenic / not aligned Exact match
aligned to genes / no feature 99.3% (212 read difference)
exonic / intronic Very close
junction reads 99% (27,858 vs 28,125)
junction motifs 11 entries, percentages very close
overlapping exon 93% (6,133 vs 5,748)
SSP Close (0.01/0.99 vs 0.03/0.97)
5'/3'/5'-3' bias Needs work (~half of reference values)

Known remaining work

  • Bias values: Coverage magnitude ~half of reference — needs merged gene model (all exons from all transcripts merged per gene, matching Qualimap's Picard Gene model)
  • SSP estimation: Close but not exact — likely related to the 212 read classification difference
  • 212 read discrepancy: Minor, possibly CIGAR edge case or coordinate boundary difference
  • Plots: Not yet implemented (coverage profile plots, junction analysis, genomic origin pie chart)
  • HTML report: Not yet implemented
  • Old genebody module: Still present alongside new qualimap module; needs removal once validated

Note

Medium Risk
Adds a new QC pipeline path and hooks it into the core single-pass BAM counting flow, which could affect performance and output correctness for GTF-based runs. Risk is mitigated by being largely additive and gated by qualimap.enabled (default-on) but still changes default output behavior.

Overview
Adds a new rna::qualimap module that computes Qualimap-style RNA-Seq QC metrics during the existing single-pass BAM processing and writes Qualimap-compatible outputs (e.g. rnaseq_qc_results.txt plus coverage profile TSVs).

Wires the Qualimap accumulator/index into dupradar::counting::count_reads and main.rs, replaces the prior genebody-based Qualimap text output with the new qualimap::output::write_qualimap_results, and introduces qualimap configuration (qualimap.enabled, default true) including passing the GTF path through for report metadata.

Updates/extends benchmark artifacts (small and large) with Qualimap reference outputs and adds a vendored HTML report + static assets under benchmark/qualimap/large; .gitignore now ignores tests/qualimap_ref/.

Written by Cursor Bugbot for commit b142f24. This will update automatically on new commits. Configure here.

… (Phase 1)

Implement the core Qualimap-compatible RNA-Seq QC engine as a new
src/rna/qualimap/ module with its own counting logic, fundamentally
different from the existing dupRadar engine:

- index.rs: Per-chromosome exon COITree with enclosure-based gene
  assignment, intron interval tree, and per-transcript metadata
- accumulator.rs: M-only CIGAR extraction, NH>1 multi-mapper exclusion,
  PE mate buffering with interval combining, junction motif extraction,
  and Qualimap-compatible read counters
- coverage.rs: Per-transcript per-base depth tracking with lazy
  allocation and genomic-to-relative coordinate mapping
- mod.rs: Module re-exports and QualimapResult struct

Integrated into the BAM processing pipeline alongside existing RSeQC
and dupRadar accumulators (both parallel and single-threaded paths).
QualimapIndex built from GTF genes in main.rs, gated by config toggle.
No existing outputs are altered.
Major changes:
- Add output.rs: writes rnaseq_qc_results.txt, coverage profiles (total/
  high/low), bias computation, junction motif output
- Fix PE mate pairing: auto-detect from BAM flags instead of CLI --paired
  flag (was routing all reads through SE path)
- Fix CIGAR bug compatibility: I/S/H/P ops advance reference offset to
  match Qualimap's behavior
- Fix non-unique count: include secondary reads in NH>1 tally
- Fix overlapping exon: count no-feature reads overlapping exon regions
  (not exonic reads in overlapping transcripts)
- Fix junction count: per-junction (per N op) not per-read
- Limit junction motifs to top 11 entries sorted by percentage
- Fix left/right counting: all paired reads, add both_proper_in_pair
- Add SSP estimation per enclosed M-block in both SE and PE paths
- Switch coverage to exonic coordinates with best-per-gene profile
- Auto-detect paired mode for output format (left/right, read pairs)
- Remove unused paired field from QualimapAccum
- Fix clippy warnings, run cargo fmt

Small benchmark results vs Qualimap reference:
- Exact: total/secondary/non-unique/ambiguous/intergenic/left/right/pairs
- Very close (99%+): aligned_to_genes, no_feature, exonic, intronic,
  junction reads, motif percentages
- Remaining work: bias values (~half of ref, needs merged gene model),
  SSP (0.01/0.99 vs 0.03/0.97), 212 read discrepancy
@netlify

This comment was marked as outdated.

…tion, merged gene models

Major changes to match Qualimap's coverage behavior:

- Add MergedGeneModel to index.rs: merges all exons from all transcripts
  of a gene into non-redundant intervals for per-gene coverage tracking
- Add MergedGeneCoverage to coverage.rs: parallel per-gene coverage
  tracking on the merged model
- Rewrite coverage accumulation in accumulator.rs:
  - Per-block, per-read coverage (not per-fragment) so each mate
    independently contributes, matching Qualimap
  - Overlap-based gene detection for coverage (not strict enclosure)
    so partially-overlapping blocks contribute exonic coverage
  - Coverage added before multi-mapper skip, matching Qualimap where
    multi-mapped reads contribute to coverage profiles
- Rewrite profile computation in output.rs: use ALL transcripts with
  mean > 0 (not best-per-gene) via raw TranscriptCoverage struct
- Add transcript_coverage_raw field to QualimapResult for direct
  per-transcript access
- Fix clippy dead_code warnings in genebody and qualimap modules

Coverage profiles now within 10-40% of Qualimap reference (was 250x off).
5' bias improved from 0.36 to 0.54 (ref 0.71). All 12 tests pass.
Qualimap Java merges overlapping/abutting exons per gene before
building its interval tree for enclosure checks. Our code previously
used per-transcript exon intervals, causing reads spanning overlapping
exons from different transcripts to fail enclosure.

Changes:
- Add MergedExonTree (COITree<MergedExonMeta>) built from merged gene
  models, matching Qualimap's per-gene merged interval tree
- Use merged exon tree for find_enclosing_genes() and
  block_enclosed_by_gene() (SSP), keeping per-transcript tree for
  coverage tracking
- Switch classify_no_feature overlap check to merged exon tree
- Add flat_idx to TranscriptCoverageEntry for diagnostic logging
- Add test_merged_exon_enclosure unit test
- Update benchmark reference outputs

Result: exonic, intronic, intergenic, no_feature, ambiguous counts
now exactly match Qualimap reference (small benchmark).
Minor improvements to compute_coverage_profile(): clearer variable
names, better comments explaining the ceiling-step binning approach.
Add debug log for active transcript count and flat_idx field to
TranscriptCoverageEntry for diagnostic purposes.
…ing tests

- Change merge_intervals() condition from <= to < (strict overlap only)
  to match Qualimap's 1-based inclusive coordinate semantics where
  abutting intervals are NOT merged
- Add 5 unit tests for compute_coverage_profile binning logic
Three root causes for the +279 excess (6,027 vs 5,748):

1. Scope: overlapping_exon was only counted for no-feature reads.
   Qualimap counts for ALL reads. Extracted check into new
   check_overlapping_exon() method called from both reconcile_pair()
   and assign_se_read() before gene assignment.

2. Semantics: used 'any overlap' instead of per-node 'overlap but NOT
   enclosed'. Now checks each exon node individually — if the exon
   overlaps the M-block but does NOT enclose it, the read is counted.
   Matches Qualimap's segmentEnclosed / readOverlaps logic.

3. COITree boundary: COITree uses closed [first, last] intervals but
   our coordinates are half-open. Abutting exons were falsely reported
   as overlapping (23 reads). Added explicit half-open overlap check
   in the query callback to filter these out.
Comment thread src/rna/qualimap/accumulator.rs
Comment thread src/rna/qualimap/accumulator.rs
Comment thread src/rna/qualimap/index.rs
…alimap

Qualimap's findIntersectingFeatures() counts SSP per (M-block, gene) pair
before the exonic/ambiguous/no-feature decision. Move SSP counting out of
the per-category branches and into a unified count_ssp_for_blocks() method
that runs for all fragments. This also handles multiple enclosing genes per
block with HashSet deduplication, replacing the single-gene block_enclosed_by_gene helper.
…locks only

- Remove overlap-based per-block coverage accumulation that ran before
  the multi-mapper skip in process_read()
- Add add_enclosed_coverage() that uses the same enclosure check as
  find_enclosing_genes() (block fully within exon interval)
- Call add_enclosed_coverage() after multi-mapper skip: in process_read()
  for SE reads, in reconcile_pair() for PE reads with combined blocks
- Coverage now only counts unique mappers and fully-enclosed M-blocks,
  matching Qualimap's model more closely
- Also fix bias computation: exclude transcripts with zero coverage at
  the respective prime end before computing 5'/3'/5'3' bias ratios
- Fix pre-existing dead_code clippy warning on ExonMeta.gene_idx
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

Comment thread src/main.rs
None
};

// === Build Qualimap exon index (if enabled, GTF-only) ===
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Genebody computation runs but output was removed

Medium Severity

The old genebody output code was removed and replaced by the new qualimap module, but TranscriptPositionMap is still built and GenebodyCoverageAccum::record_coverage still runs per-read in the hot counting loop. genebody_coverage.enabled defaults to true, so every assigned read incurs an O(aligned_blocks × exons) cost whose results are computed into CountResult.genebody but never read or written anywhere. The #[allow(dead_code)] on the field confirms this.

Additional Locations (1)

Fix in Cursor Fix in Web

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Valid observation. The old GenebodyCoverageAccum / TranscriptPositionMap code still runs in the counting hot loop but its output is now superseded by the qualimap module. Removing it is a larger refactoring task (deeply integrated into count_reads return types) — will address as a follow-up to avoid introducing regressions in this PR.

ewels added 2 commits March 3, 2026 20:19
Re-ran Qualimap with correct settings:
- Large dataset: filtered GTF (genes.filtered.gtf), strand-specific-reverse,
  name-sorted BAM, paired-end
- Small dataset: updated to strand-specific-reverse protocol (was non-strand-specific)

Large dataset reference values:
  aligned_to_genes = 126,404,800
  ambiguous = 2,552,785
  no_feature = 39,037,765
  5' bias = 0.87, 3' bias = 0.78
  junctions = 60,467,173
… coverage, abutting exon merge

- Implement per-mate intersection model for strand-specific read assignment,
  matching Qualimap's ComputeCountsTask exactly (ambiguous: 1754 -> 1054)
- Switch to per-block coverage model (addCoverage per CIGAR M-block)
- Fix abutting exon merge in index (< to <= to match Qualimap's abuts())
- Fix integration test binary helper (use rustqc_binary() not env macro)
- Fix unused variable in preseq.rs
- Pass stranded parameter through to QualimapAccum

Results: all assignment counts exact match on small dataset, near-exact
on large dataset (~0.002% off). 5'/3' bias still lower than Qualimap.
@ewels

This comment was marked as outdated.

…off-by-one

Two root causes identified and fixed for the coverage bias discrepancy:

1. Exon ordering for minus-strand transcripts (index.rs): Qualimap Java's
   createHelperMaps() sorts negative-strand exons in descending start order
   via in-place Arrays.sort, so getTranscriptCoordinate() maps the biological
   5' end (rightmost genomic) to position 0 in the coverage array. RustQC was
   always sorting ascending. Now re-sorts minus-strand exons descending after
   interval tree/intron work (which needs ascending order) is done.

2. Picard 1.70 off-by-one in addCoverageCounts (coverage.rs): The loop uses
   'pos < endPos' with 1-based inclusive coordinates, skipping the last base
   of every alignment block. Replicated via effective_end = block_end - 1.

Also fixed: unconditional bias push (matching Qualimap's no-guard behavior)
with NaN filtering for the 5'-3' ratio to avoid sort panics.

Results now match Qualimap exactly on both datasets:
- Small:  5' bias=0.71, 3' bias=0.57, 5'-3' bias=1.30
- Large:  5' bias=0.87, 3' bias=0.78, 5'-3' bias=1.11
- All count metrics (aligned_to_genes, ambiguous, junctions, etc.) unchanged
- HIGH tier coverage profile: exact match to all decimal places
@ewels
Copy link
Copy Markdown
Member Author

ewels commented Mar 3, 2026

5'/3' Coverage Bias — Fully Resolved ✅

Commit cfe9501 fixes the long-standing coverage bias discrepancy. All Qualimap metrics now match exactly on both the small (chr6) and large (GM12878_REP1) datasets.

Root Causes

Two bugs in the coverage-to-exonic-coordinate mapping were identified by deep analysis of Qualimap's Java source and Picard 1.70's decompiled bytecode:

1. Exon ordering for minus-strand transcripts (index.rs)

Qualimap Java's createHelperMaps() sorts negative-strand transcript exons in descending start order via an in-place Arrays.sort() mutation. This means Picard's getTranscriptCoordinate() maps the biological 5' end (rightmost genomic position for minus-strand genes) to position 0 in the coverage array. RustQC was always sorting exons ascending regardless of strand, producing a systematically different coverage array layout for all minus-strand transcripts — the primary cause of the ~30–45% bias error on the small dataset.

2. Picard 1.70 off-by-one in addCoverageCounts (coverage.rs)

Picard's coverage loop uses for (pos = startPos; pos < endPos; ...) with 1-based inclusive coordinates, systematically skipping the last base of every M-block alignment. RustQC's range-based overlap was counting all bases correctly. Replicated via effective_end = block_end - 1.

Results

Metric Small QM Small RQ Large QM Large RQ
aligned_to_genes 31,654 31,654 126,404,800 126,404,800
ambiguous 1,054 1,054 2,552,785 2,552,785
no_feature 14,635 14,635
5' bias 0.71 0.71 0.87 0.87
3' bias 0.57 0.57 0.78 0.78
5'-3' bias 1.30 1.30 1.11 1.11
junctions 28,125 28,125 60,467,173 60,467,173
intronic 30,407,985 30,407,985
intergenic 8,629,780 8,629,780
overlapping_exon 5,748 5,748 6,853,516 6,853,516

Coverage profile (HIGH tier, top 500 transcripts): exact match to all decimal places.

All 178 tests pass, clippy clean, fmt clean.

ewels added 2 commits March 3, 2026 23:34
# Conflicts:
#	benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (High).png
#	benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Low).png
#	benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Total).png
#	benchmark/qualimap/large/images_qualimapReport/Junction Analysis.png
#	benchmark/qualimap/large/images_qualimapReport/Reads Genomic Origin.png
#	benchmark/qualimap/large/images_qualimapReport/Transcript coverage histogram.png
#	benchmark/qualimap/large/qualimapReport.html
#	benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(high).txt
#	benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(low).txt
#	benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt
#	benchmark/qualimap/large/rnaseq_qc_results.txt
…rage clone

Consolidate 4+ redundant merged exon tree queries per M-block into a single
cached query (CachedBlockHits). Reuse pre-computed enclosing genes and cached
hits for buffered PE mates instead of recomputing in reconcile_pair(). Move
TranscriptCoverage by ownership in into_result() instead of deep cloning.

Counting phase: 256s → 155s (39% faster on 11GB BAM, 8 threads).
Qualimap overhead reduced from 85% to 12% of base processing time.
All outputs verified identical.
@netlify
Copy link
Copy Markdown

netlify Bot commented Mar 3, 2026

Deploy Preview for rustqc ready!

Name Link
🔨 Latest commit 56c0ce1
🔍 Latest deploy log https://app.netlify.com/projects/rustqc/deploys/69a805c27465a1000847f12a
😎 Deploy Preview https://deploy-preview-19--rustqc.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

ewels added 8 commits March 4, 2026 00:42
…educe verbosity

- Fix cross-chrom PE merge: count both reads (num_reads=2) not 1
- Remove dead JunctionMotif.intron_start/intron_end fields
- Replace MateInfo + assign_se_read with MateData + classify_and_count
  shared helper, reducing reconcile_pair from 14 to 4 params
- Store (gene_idx, strand) in CachedBlockHits.enclosing_genes,
  eliminating the overlapping Vec and reducing per-mate memory
- Remove unused _index param from find_enclosing_genes_cached
- Remove redundant gene dedup in count_ssp_for_blocks_cached
- Audit #[allow(dead_code)]: remove stale annotation on
  MergedExonMeta.strand (now actively used), add targeted annotations
  on genuinely unused fields (fragment_count, transcript_coverage,
  merged_gene_coverage)
- Add doc comments on CachedBlockHits fields
- Update all benchmark reference outputs (large + small datasets)
- README: add Qualimap rnaseq as a fully documented tool with output
  file descriptions, update How it Works section, add references
- Qualimap outputs include: rnaseq_qc_results.txt, coverage profiles
  for total/high/low expression tiers
- Create outputs/qualimap.mdx: comprehensive output docs with file
  format tables, QC results sections, interpretation guides, config,
  and compatibility notes
- Create benchmarks/qualimap.mdx: performance and output comparison
- Update outputs/tin.mdx: TIN-only page (gene body coverage moved
  to qualimap.mdx)
- Update sidebar: separate TIN and Qualimap entries in both Outputs
  and Benchmarks sections
- Update cross-references in quickstart, introduction, cli-reference,
  rseqc outputs, rseqc benchmarks, and configuration pages
- Add qualimap: config section to configuration.md
…filtered GTF)

Reran benchmarks with correct parameters matching Qualimap Java reference:
strand-specific-reverse, filtered GTF, name-sorted BAM for Java. All key
metrics now match exactly between RustQC and Qualimap Java (both small and
large datasets). Updated docs benchmark page with side-by-side comparison
tables and corrected example values in outputs page.
@ewels ewels merged commit 7bc33a7 into main Mar 4, 2026
8 checks passed
@ewels ewels deleted the qualimap branch March 4, 2026 10:16
@ewels ewels mentioned this pull request Mar 4, 2026
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant