Add Qualimap RNA-Seq QC module by ewels · Pull Request #19 · seqeralabs/RustQC

ewels · 2026-02-17T07:29:19Z

Summary

Adds a new qualimap/ module that produces Qualimap-compatible RNA-Seq QC output (rnaseq_qc_results.txt and coverage profile TSVs), integrated into the existing single-pass BAM processing pipeline.

Phase 1: Core data structures — exon/intron interval tree index, read accumulator with PE mate buffering, per-transcript coverage tracking
Phase 2: Output writer, counting fixes, and Qualimap compatibility — rnaseq_qc_results.txt output, coverage profiles (total/high/low), bias computation, junction motif analysis

Key implementation details

Reimplements Qualimap's read counting logic: enclosure-based gene assignment, PE fragment processing, SSP estimation per enclosed M-block, intronic/intergenic classification for no-feature reads
Replicates Qualimap's CIGAR bug (I/S/H/P ops advance reference offset) for output compatibility
Junction reads counted per-junction (per N op), motifs limited to top 11 by percentage
Coverage mapped to exonic coordinates (matching Picard's Gene.Transcript model)
Auto-detects paired-end mode from BAM flags (no CLI flag needed)

Small benchmark comparison vs Qualimap reference

Metric	Status
total / secondary / non-unique	Exact match
left / right / read pairs	Exact match
ambiguous / intergenic / not aligned	Exact match
aligned to genes / no feature	99.3% (212 read difference)
exonic / intronic	Very close
junction reads	99% (27,858 vs 28,125)
junction motifs	11 entries, percentages very close
overlapping exon	93% (6,133 vs 5,748)
SSP	Close (0.01/0.99 vs 0.03/0.97)
5'/3'/5'-3' bias	Needs work (~half of reference values)

Known remaining work

Bias values: Coverage magnitude ~half of reference — needs merged gene model (all exons from all transcripts merged per gene, matching Qualimap's Picard Gene model)
SSP estimation: Close but not exact — likely related to the 212 read classification difference
212 read discrepancy: Minor, possibly CIGAR edge case or coordinate boundary difference
Plots: Not yet implemented (coverage profile plots, junction analysis, genomic origin pie chart)
HTML report: Not yet implemented
Old genebody module: Still present alongside new qualimap module; needs removal once validated

Note

Medium Risk
Adds a new QC pipeline path and hooks it into the core single-pass BAM counting flow, which could affect performance and output correctness for GTF-based runs. Risk is mitigated by being largely additive and gated by qualimap.enabled (default-on) but still changes default output behavior.

Overview
Adds a new rna::qualimap module that computes Qualimap-style RNA-Seq QC metrics during the existing single-pass BAM processing and writes Qualimap-compatible outputs (e.g. rnaseq_qc_results.txt plus coverage profile TSVs).

Wires the Qualimap accumulator/index into dupradar::counting::count_reads and main.rs, replaces the prior genebody-based Qualimap text output with the new qualimap::output::write_qualimap_results, and introduces qualimap configuration (qualimap.enabled, default true) including passing the GTF path through for report metadata.

Updates/extends benchmark artifacts (small and large) with Qualimap reference outputs and adds a vendored HTML report + static assets under benchmark/qualimap/large; .gitignore now ignores tests/qualimap_ref/.

^{Written by Cursor Bugbot for commit b142f24. This will update automatically on new commits. Configure here.}

… (Phase 1) Implement the core Qualimap-compatible RNA-Seq QC engine as a new src/rna/qualimap/ module with its own counting logic, fundamentally different from the existing dupRadar engine: - index.rs: Per-chromosome exon COITree with enclosure-based gene assignment, intron interval tree, and per-transcript metadata - accumulator.rs: M-only CIGAR extraction, NH>1 multi-mapper exclusion, PE mate buffering with interval combining, junction motif extraction, and Qualimap-compatible read counters - coverage.rs: Per-transcript per-base depth tracking with lazy allocation and genomic-to-relative coordinate mapping - mod.rs: Module re-exports and QualimapResult struct Integrated into the BAM processing pipeline alongside existing RSeQC and dupRadar accumulators (both parallel and single-threaded paths). QualimapIndex built from GTF genes in main.rs, gated by config toggle. No existing outputs are altered.

Major changes: - Add output.rs: writes rnaseq_qc_results.txt, coverage profiles (total/ high/low), bias computation, junction motif output - Fix PE mate pairing: auto-detect from BAM flags instead of CLI --paired flag (was routing all reads through SE path) - Fix CIGAR bug compatibility: I/S/H/P ops advance reference offset to match Qualimap's behavior - Fix non-unique count: include secondary reads in NH>1 tally - Fix overlapping exon: count no-feature reads overlapping exon regions (not exonic reads in overlapping transcripts) - Fix junction count: per-junction (per N op) not per-read - Limit junction motifs to top 11 entries sorted by percentage - Fix left/right counting: all paired reads, add both_proper_in_pair - Add SSP estimation per enclosed M-block in both SE and PE paths - Switch coverage to exonic coordinates with best-per-gene profile - Auto-detect paired mode for output format (left/right, read pairs) - Remove unused paired field from QualimapAccum - Fix clippy warnings, run cargo fmt Small benchmark results vs Qualimap reference: - Exact: total/secondary/non-unique/ambiguous/intergenic/left/right/pairs - Very close (99%+): aligned_to_genes, no_feature, exonic, intronic, junction reads, motif percentages - Remaining work: bias values (~half of ref, needs merged gene model), SSP (0.01/0.99 vs 0.03/0.97), 212 read discrepancy

…tion, merged gene models Major changes to match Qualimap's coverage behavior: - Add MergedGeneModel to index.rs: merges all exons from all transcripts of a gene into non-redundant intervals for per-gene coverage tracking - Add MergedGeneCoverage to coverage.rs: parallel per-gene coverage tracking on the merged model - Rewrite coverage accumulation in accumulator.rs: - Per-block, per-read coverage (not per-fragment) so each mate independently contributes, matching Qualimap - Overlap-based gene detection for coverage (not strict enclosure) so partially-overlapping blocks contribute exonic coverage - Coverage added before multi-mapper skip, matching Qualimap where multi-mapped reads contribute to coverage profiles - Rewrite profile computation in output.rs: use ALL transcripts with mean > 0 (not best-per-gene) via raw TranscriptCoverage struct - Add transcript_coverage_raw field to QualimapResult for direct per-transcript access - Fix clippy dead_code warnings in genebody and qualimap modules Coverage profiles now within 10-40% of Qualimap reference (was 250x off). 5' bias improved from 0.36 to 0.54 (ref 0.71). All 12 tests pass.

Qualimap Java merges overlapping/abutting exons per gene before building its interval tree for enclosure checks. Our code previously used per-transcript exon intervals, causing reads spanning overlapping exons from different transcripts to fail enclosure. Changes: - Add MergedExonTree (COITree<MergedExonMeta>) built from merged gene models, matching Qualimap's per-gene merged interval tree - Use merged exon tree for find_enclosing_genes() and block_enclosed_by_gene() (SSP), keeping per-transcript tree for coverage tracking - Switch classify_no_feature overlap check to merged exon tree - Add flat_idx to TranscriptCoverageEntry for diagnostic logging - Add test_merged_exon_enclosure unit test - Update benchmark reference outputs Result: exonic, intronic, intergenic, no_feature, ambiguous counts now exactly match Qualimap reference (small benchmark).

Minor improvements to compute_coverage_profile(): clearer variable names, better comments explaining the ceiling-step binning approach. Add debug log for active transcript count and flat_idx field to TranscriptCoverageEntry for diagnostic purposes.

…ing tests - Change merge_intervals() condition from <= to < (strict overlap only) to match Qualimap's 1-based inclusive coordinate semantics where abutting intervals are NOT merged - Add 5 unit tests for compute_coverage_profile binning logic

Three root causes for the +279 excess (6,027 vs 5,748): 1. Scope: overlapping_exon was only counted for no-feature reads. Qualimap counts for ALL reads. Extracted check into new check_overlapping_exon() method called from both reconcile_pair() and assign_se_read() before gene assignment. 2. Semantics: used 'any overlap' instead of per-node 'overlap but NOT enclosed'. Now checks each exon node individually — if the exon overlaps the M-block but does NOT enclose it, the read is counted. Matches Qualimap's segmentEnclosed / readOverlaps logic. 3. COITree boundary: COITree uses closed [first, last] intervals but our coordinates are half-open. Abutting exons were falsely reported as overlapping (23 reads). Added explicit half-open overlap check in the query callback to filter these out.

…alimap Qualimap's findIntersectingFeatures() counts SSP per (M-block, gene) pair before the exonic/ambiguous/no-feature decision. Move SSP counting out of the per-category branches and into a unified count_ssp_for_blocks() method that runs for all fragments. This also handles multiple enclosing genes per block with HashSet deduplication, replacing the single-gene block_enclosed_by_gene helper.

…locks only - Remove overlap-based per-block coverage accumulation that ran before the multi-mapper skip in process_read() - Add add_enclosed_coverage() that uses the same enclosure check as find_enclosing_genes() (block fully within exon interval) - Call add_enclosed_coverage() after multi-mapper skip: in process_read() for SE reads, in reconcile_pair() for PE reads with combined blocks - Coverage now only counts unique mappers and fully-enclosed M-blocks, matching Qualimap's model more closely - Also fix bias computation: exclude transcripts with zero coverage at the respective prime end before computing 5'/3'/5'3' bias ratios - Fix pre-existing dead_code clippy warning on ExonMeta.gene_idx

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.}

cursor · 2026-02-18T08:29:40Z

        None
    };

+    // === Build Qualimap exon index (if enabled, GTF-only) ===


Genebody computation runs but output was removed

Medium Severity

The old genebody output code was removed and replaced by the new qualimap module, but TranscriptPositionMap is still built and GenebodyCoverageAccum::record_coverage still runs per-read in the hot counting loop. genebody_coverage.enabled defaults to true, so every assigned read incurs an O(aligned_blocks × exons) cost whose results are computed into CountResult.genebody but never read or written anywhere. The #[allow(dead_code)] on the field confirms this.

Additional Locations (1)

src/rna/dupradar/counting.rs#L357-L360

Valid observation. The old GenebodyCoverageAccum / TranscriptPositionMap code still runs in the counting hot loop but its output is now superseded by the qualimap module. Removing it is a larger refactoring task (deeply integrated into count_reads return types) — will address as a follow-up to avoid introducing regressions in this PR.

Re-ran Qualimap with correct settings: - Large dataset: filtered GTF (genes.filtered.gtf), strand-specific-reverse, name-sorted BAM, paired-end - Small dataset: updated to strand-specific-reverse protocol (was non-strand-specific) Large dataset reference values: aligned_to_genes = 126,404,800 ambiguous = 2,552,785 no_feature = 39,037,765 5' bias = 0.87, 3' bias = 0.78 junctions = 60,467,173

… coverage, abutting exon merge - Implement per-mate intersection model for strand-specific read assignment, matching Qualimap's ComputeCountsTask exactly (ambiguous: 1754 -> 1054) - Switch to per-block coverage model (addCoverage per CIGAR M-block) - Fix abutting exon merge in index (< to <= to match Qualimap's abuts()) - Fix integration test binary helper (use rustqc_binary() not env macro) - Fix unused variable in preseq.rs - Pass stranded parameter through to QualimapAccum Results: all assignment counts exact match on small dataset, near-exact on large dataset (~0.002% off). 5'/3' bias still lower than Qualimap.

…off-by-one Two root causes identified and fixed for the coverage bias discrepancy: 1. Exon ordering for minus-strand transcripts (index.rs): Qualimap Java's createHelperMaps() sorts negative-strand exons in descending start order via in-place Arrays.sort, so getTranscriptCoordinate() maps the biological 5' end (rightmost genomic) to position 0 in the coverage array. RustQC was always sorting ascending. Now re-sorts minus-strand exons descending after interval tree/intron work (which needs ascending order) is done. 2. Picard 1.70 off-by-one in addCoverageCounts (coverage.rs): The loop uses 'pos < endPos' with 1-based inclusive coordinates, skipping the last base of every alignment block. Replicated via effective_end = block_end - 1. Also fixed: unconditional bias push (matching Qualimap's no-guard behavior) with NaN filtering for the 5'-3' ratio to avoid sort panics. Results now match Qualimap exactly on both datasets: - Small: 5' bias=0.71, 3' bias=0.57, 5'-3' bias=1.30 - Large: 5' bias=0.87, 3' bias=0.78, 5'-3' bias=1.11 - All count metrics (aligned_to_genes, ambiguous, junctions, etc.) unchanged - HIGH tier coverage profile: exact match to all decimal places

ewels · 2026-03-03T22:32:16Z

5'/3' Coverage Bias — Fully Resolved ✅

Commit cfe9501 fixes the long-standing coverage bias discrepancy. All Qualimap metrics now match exactly on both the small (chr6) and large (GM12878_REP1) datasets.

Root Causes

Two bugs in the coverage-to-exonic-coordinate mapping were identified by deep analysis of Qualimap's Java source and Picard 1.70's decompiled bytecode:

1. Exon ordering for minus-strand transcripts (index.rs)

Qualimap Java's createHelperMaps() sorts negative-strand transcript exons in descending start order via an in-place Arrays.sort() mutation. This means Picard's getTranscriptCoordinate() maps the biological 5' end (rightmost genomic position for minus-strand genes) to position 0 in the coverage array. RustQC was always sorting exons ascending regardless of strand, producing a systematically different coverage array layout for all minus-strand transcripts — the primary cause of the ~30–45% bias error on the small dataset.

2. Picard 1.70 off-by-one in addCoverageCounts (coverage.rs)

Picard's coverage loop uses for (pos = startPos; pos < endPos; ...) with 1-based inclusive coordinates, systematically skipping the last base of every M-block alignment. RustQC's range-based overlap was counting all bases correctly. Replicated via effective_end = block_end - 1.

Results

Metric	Small QM	Small RQ	Large QM	Large RQ
aligned_to_genes	31,654	31,654 ✅	126,404,800	126,404,800 ✅
ambiguous	1,054	1,054 ✅	2,552,785	2,552,785 ✅
no_feature	14,635	14,635 ✅	—	—
5' bias	0.71	0.71 ✅	0.87	0.87 ✅
3' bias	0.57	0.57 ✅	0.78	0.78 ✅
5'-3' bias	1.30	1.30 ✅	1.11	1.11 ✅
junctions	28,125	28,125 ✅	60,467,173	60,467,173 ✅
intronic	—	—	30,407,985	30,407,985 ✅
intergenic	—	—	8,629,780	8,629,780 ✅
overlapping_exon	5,748	5,748 ✅	6,853,516	6,853,516 ✅

Coverage profile (HIGH tier, top 500 transcripts): exact match to all decimal places.

All 178 tests pass, clippy clean, fmt clean.

# Conflicts: # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (High).png # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Low).png # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Total).png # benchmark/qualimap/large/images_qualimapReport/Junction Analysis.png # benchmark/qualimap/large/images_qualimapReport/Reads Genomic Origin.png # benchmark/qualimap/large/images_qualimapReport/Transcript coverage histogram.png # benchmark/qualimap/large/qualimapReport.html # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(high).txt # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(low).txt # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt # benchmark/qualimap/large/rnaseq_qc_results.txt

…rage clone Consolidate 4+ redundant merged exon tree queries per M-block into a single cached query (CachedBlockHits). Reuse pre-computed enclosing genes and cached hits for buffered PE mates instead of recomputing in reconcile_pair(). Move TranscriptCoverage by ownership in into_result() instead of deep cloning. Counting phase: 256s → 155s (39% faster on 11GB BAM, 8 threads). Qualimap overhead reduced from 85% to 12% of base processing time. All outputs verified identical.

netlify · 2026-03-03T23:35:51Z

✅ Deploy Preview for rustqc ready!

Name	Link
🔨 Latest commit	`56c0ce1`
🔍 Latest deploy log	https://app.netlify.com/projects/rustqc/deploys/69a805c27465a1000847f12a
😎 Deploy Preview	https://deploy-preview-19--rustqc.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

…educe verbosity - Fix cross-chrom PE merge: count both reads (num_reads=2) not 1 - Remove dead JunctionMotif.intron_start/intron_end fields - Replace MateInfo + assign_se_read with MateData + classify_and_count shared helper, reducing reconcile_pair from 14 to 4 params - Store (gene_idx, strand) in CachedBlockHits.enclosing_genes, eliminating the overlapping Vec and reducing per-mate memory - Remove unused _index param from find_enclosing_genes_cached - Remove redundant gene dedup in count_ssp_for_blocks_cached - Audit #[allow(dead_code)]: remove stale annotation on MergedExonMeta.strand (now actively used), add targeted annotations on genuinely unused fields (fragment_count, transcript_coverage, merged_gene_coverage) - Add doc comments on CachedBlockHits fields

…println with log::info

- Update all benchmark reference outputs (large + small datasets) - README: add Qualimap rnaseq as a fully documented tool with output file descriptions, update How it Works section, add references - Qualimap outputs include: rnaseq_qc_results.txt, coverage profiles for total/high/low expression tiers

- Create outputs/qualimap.mdx: comprehensive output docs with file format tables, QC results sections, interpretation guides, config, and compatibility notes - Create benchmarks/qualimap.mdx: performance and output comparison - Update outputs/tin.mdx: TIN-only page (gene body coverage moved to qualimap.mdx) - Update sidebar: separate TIN and Qualimap entries in both Outputs and Benchmarks sections - Update cross-references in quickstart, introduction, cli-reference, rseqc outputs, rseqc benchmarks, and configuration pages - Add qualimap: config section to configuration.md

…ap Java

…filtered GTF) Reran benchmarks with correct parameters matching Qualimap Java reference: strand-specific-reverse, filtered GTF, name-sorted BAM for Java. All key metrics now match exactly between RustQC and Qualimap Java (both small and large datasets). Updated docs benchmark page with side-by-side comparison tables and corrected example values in outputs page.

ewels added 3 commits February 16, 2026 20:27

Add Qualimap rnaseq benchmark output for large dataset

b6533db

This comment was marked as outdated.

Sign in to view

ewels added 6 commits February 17, 2026 09:48

Fix junction count: count all CIGAR N-ops for reads_at_junctions

cd70d58

cursor Bot reviewed Feb 18, 2026

View reviewed changes

Comment thread src/rna/qualimap/accumulator.rs

Comment thread src/rna/qualimap/accumulator.rs

Comment thread src/rna/qualimap/index.rs

ewels added 2 commits February 18, 2026 07:54

cursor Bot reviewed Feb 18, 2026

View reviewed changes

ewels added 2 commits March 3, 2026 20:19

This comment was marked as outdated.

Sign in to view

ewels added 2 commits March 3, 2026 23:34

ewels added 8 commits March 4, 2026 00:42

Merge branch 'main' into qualimap

e122246

Fix PR review: merge merged_gene_coverage in parallel mode, replace e…

33d608b

…println with log::info

Note that RustQC skips the samtools name-sort step required by Qualim…

2c0ece4

…ap Java

Fix MDX build: escape < in qualimap benchmark page

623933e

Add samtools name-sort step and Qualimap to combined benchmark page

56c0ce1

ewels merged commit 7bc33a7 into main Mar 4, 2026
8 checks passed

ewels deleted the qualimap branch March 4, 2026 10:16

ewels mentioned this pull request Mar 4, 2026

Remove dead genebody coverage code #24

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Qualimap RNA-Seq QC module#19

Add Qualimap RNA-Seq QC module#19
ewels merged 25 commits into
mainfrom
qualimap

ewels commented Feb 17, 2026 •

edited by cursor Bot

Loading

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Feb 18, 2026

Uh oh!

ewels Mar 3, 2026

Uh oh!

This comment was marked as outdated.

ewels commented Mar 3, 2026

Uh oh!

netlify Bot commented Mar 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ewels commented Feb 17, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key implementation details

Small benchmark comparison vs Qualimap reference

Known remaining work

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Feb 18, 2026

Choose a reason for hiding this comment

Genebody computation runs but output was removed

Uh oh!

ewels Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

ewels commented Mar 3, 2026

5'/3' Coverage Bias — Fully Resolved ✅

Root Causes

Results

Uh oh!

netlify Bot commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for rustqc ready!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ewels commented Feb 17, 2026 •

edited by cursor Bot

Loading

netlify Bot commented Mar 3, 2026 •

edited

Loading