Conversation
… (Phase 1) Implement the core Qualimap-compatible RNA-Seq QC engine as a new src/rna/qualimap/ module with its own counting logic, fundamentally different from the existing dupRadar engine: - index.rs: Per-chromosome exon COITree with enclosure-based gene assignment, intron interval tree, and per-transcript metadata - accumulator.rs: M-only CIGAR extraction, NH>1 multi-mapper exclusion, PE mate buffering with interval combining, junction motif extraction, and Qualimap-compatible read counters - coverage.rs: Per-transcript per-base depth tracking with lazy allocation and genomic-to-relative coordinate mapping - mod.rs: Module re-exports and QualimapResult struct Integrated into the BAM processing pipeline alongside existing RSeQC and dupRadar accumulators (both parallel and single-threaded paths). QualimapIndex built from GTF genes in main.rs, gated by config toggle. No existing outputs are altered.
Major changes: - Add output.rs: writes rnaseq_qc_results.txt, coverage profiles (total/ high/low), bias computation, junction motif output - Fix PE mate pairing: auto-detect from BAM flags instead of CLI --paired flag (was routing all reads through SE path) - Fix CIGAR bug compatibility: I/S/H/P ops advance reference offset to match Qualimap's behavior - Fix non-unique count: include secondary reads in NH>1 tally - Fix overlapping exon: count no-feature reads overlapping exon regions (not exonic reads in overlapping transcripts) - Fix junction count: per-junction (per N op) not per-read - Limit junction motifs to top 11 entries sorted by percentage - Fix left/right counting: all paired reads, add both_proper_in_pair - Add SSP estimation per enclosed M-block in both SE and PE paths - Switch coverage to exonic coordinates with best-per-gene profile - Auto-detect paired mode for output format (left/right, read pairs) - Remove unused paired field from QualimapAccum - Fix clippy warnings, run cargo fmt Small benchmark results vs Qualimap reference: - Exact: total/secondary/non-unique/ambiguous/intergenic/left/right/pairs - Very close (99%+): aligned_to_genes, no_feature, exonic, intronic, junction reads, motif percentages - Remaining work: bias values (~half of ref, needs merged gene model), SSP (0.01/0.99 vs 0.03/0.97), 212 read discrepancy
This comment was marked as outdated.
This comment was marked as outdated.
…tion, merged gene models
Major changes to match Qualimap's coverage behavior:
- Add MergedGeneModel to index.rs: merges all exons from all transcripts
of a gene into non-redundant intervals for per-gene coverage tracking
- Add MergedGeneCoverage to coverage.rs: parallel per-gene coverage
tracking on the merged model
- Rewrite coverage accumulation in accumulator.rs:
- Per-block, per-read coverage (not per-fragment) so each mate
independently contributes, matching Qualimap
- Overlap-based gene detection for coverage (not strict enclosure)
so partially-overlapping blocks contribute exonic coverage
- Coverage added before multi-mapper skip, matching Qualimap where
multi-mapped reads contribute to coverage profiles
- Rewrite profile computation in output.rs: use ALL transcripts with
mean > 0 (not best-per-gene) via raw TranscriptCoverage struct
- Add transcript_coverage_raw field to QualimapResult for direct
per-transcript access
- Fix clippy dead_code warnings in genebody and qualimap modules
Coverage profiles now within 10-40% of Qualimap reference (was 250x off).
5' bias improved from 0.36 to 0.54 (ref 0.71). All 12 tests pass.
Qualimap Java merges overlapping/abutting exons per gene before building its interval tree for enclosure checks. Our code previously used per-transcript exon intervals, causing reads spanning overlapping exons from different transcripts to fail enclosure. Changes: - Add MergedExonTree (COITree<MergedExonMeta>) built from merged gene models, matching Qualimap's per-gene merged interval tree - Use merged exon tree for find_enclosing_genes() and block_enclosed_by_gene() (SSP), keeping per-transcript tree for coverage tracking - Switch classify_no_feature overlap check to merged exon tree - Add flat_idx to TranscriptCoverageEntry for diagnostic logging - Add test_merged_exon_enclosure unit test - Update benchmark reference outputs Result: exonic, intronic, intergenic, no_feature, ambiguous counts now exactly match Qualimap reference (small benchmark).
Minor improvements to compute_coverage_profile(): clearer variable names, better comments explaining the ceiling-step binning approach. Add debug log for active transcript count and flat_idx field to TranscriptCoverageEntry for diagnostic purposes.
…ing tests - Change merge_intervals() condition from <= to < (strict overlap only) to match Qualimap's 1-based inclusive coordinate semantics where abutting intervals are NOT merged - Add 5 unit tests for compute_coverage_profile binning logic
Three root causes for the +279 excess (6,027 vs 5,748): 1. Scope: overlapping_exon was only counted for no-feature reads. Qualimap counts for ALL reads. Extracted check into new check_overlapping_exon() method called from both reconcile_pair() and assign_se_read() before gene assignment. 2. Semantics: used 'any overlap' instead of per-node 'overlap but NOT enclosed'. Now checks each exon node individually — if the exon overlaps the M-block but does NOT enclose it, the read is counted. Matches Qualimap's segmentEnclosed / readOverlaps logic. 3. COITree boundary: COITree uses closed [first, last] intervals but our coordinates are half-open. Abutting exons were falsely reported as overlapping (23 reads). Added explicit half-open overlap check in the query callback to filter these out.
…alimap Qualimap's findIntersectingFeatures() counts SSP per (M-block, gene) pair before the exonic/ambiguous/no-feature decision. Move SSP counting out of the per-category branches and into a unified count_ssp_for_blocks() method that runs for all fragments. This also handles multiple enclosing genes per block with HashSet deduplication, replacing the single-gene block_enclosed_by_gene helper.
…locks only - Remove overlap-based per-block coverage accumulation that ran before the multi-mapper skip in process_read() - Add add_enclosed_coverage() that uses the same enclosure check as find_enclosing_genes() (block fully within exon interval) - Call add_enclosed_coverage() after multi-mapper skip: in process_read() for SE reads, in reconcile_pair() for PE reads with combined blocks - Coverage now only counts unique mappers and fully-enclosed M-blocks, matching Qualimap's model more closely - Also fix bias computation: exclude transcripts with zero coverage at the respective prime end before computing 5'/3'/5'3' bias ratios - Fix pre-existing dead_code clippy warning on ExonMeta.gene_idx
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
| None | ||
| }; | ||
|
|
||
| // === Build Qualimap exon index (if enabled, GTF-only) === |
There was a problem hiding this comment.
Genebody computation runs but output was removed
Medium Severity
The old genebody output code was removed and replaced by the new qualimap module, but TranscriptPositionMap is still built and GenebodyCoverageAccum::record_coverage still runs per-read in the hot counting loop. genebody_coverage.enabled defaults to true, so every assigned read incurs an O(aligned_blocks × exons) cost whose results are computed into CountResult.genebody but never read or written anywhere. The #[allow(dead_code)] on the field confirms this.
Additional Locations (1)
There was a problem hiding this comment.
Valid observation. The old GenebodyCoverageAccum / TranscriptPositionMap code still runs in the counting hot loop but its output is now superseded by the qualimap module. Removing it is a larger refactoring task (deeply integrated into count_reads return types) — will address as a follow-up to avoid introducing regressions in this PR.
Re-ran Qualimap with correct settings: - Large dataset: filtered GTF (genes.filtered.gtf), strand-specific-reverse, name-sorted BAM, paired-end - Small dataset: updated to strand-specific-reverse protocol (was non-strand-specific) Large dataset reference values: aligned_to_genes = 126,404,800 ambiguous = 2,552,785 no_feature = 39,037,765 5' bias = 0.87, 3' bias = 0.78 junctions = 60,467,173
… coverage, abutting exon merge - Implement per-mate intersection model for strand-specific read assignment, matching Qualimap's ComputeCountsTask exactly (ambiguous: 1754 -> 1054) - Switch to per-block coverage model (addCoverage per CIGAR M-block) - Fix abutting exon merge in index (< to <= to match Qualimap's abuts()) - Fix integration test binary helper (use rustqc_binary() not env macro) - Fix unused variable in preseq.rs - Pass stranded parameter through to QualimapAccum Results: all assignment counts exact match on small dataset, near-exact on large dataset (~0.002% off). 5'/3' bias still lower than Qualimap.
This comment was marked as outdated.
This comment was marked as outdated.
…off-by-one Two root causes identified and fixed for the coverage bias discrepancy: 1. Exon ordering for minus-strand transcripts (index.rs): Qualimap Java's createHelperMaps() sorts negative-strand exons in descending start order via in-place Arrays.sort, so getTranscriptCoordinate() maps the biological 5' end (rightmost genomic) to position 0 in the coverage array. RustQC was always sorting ascending. Now re-sorts minus-strand exons descending after interval tree/intron work (which needs ascending order) is done. 2. Picard 1.70 off-by-one in addCoverageCounts (coverage.rs): The loop uses 'pos < endPos' with 1-based inclusive coordinates, skipping the last base of every alignment block. Replicated via effective_end = block_end - 1. Also fixed: unconditional bias push (matching Qualimap's no-guard behavior) with NaN filtering for the 5'-3' ratio to avoid sort panics. Results now match Qualimap exactly on both datasets: - Small: 5' bias=0.71, 3' bias=0.57, 5'-3' bias=1.30 - Large: 5' bias=0.87, 3' bias=0.78, 5'-3' bias=1.11 - All count metrics (aligned_to_genes, ambiguous, junctions, etc.) unchanged - HIGH tier coverage profile: exact match to all decimal places
5'/3' Coverage Bias — Fully Resolved ✅Commit Root CausesTwo bugs in the coverage-to-exonic-coordinate mapping were identified by deep analysis of Qualimap's Java source and Picard 1.70's decompiled bytecode: 1. Exon ordering for minus-strand transcripts ( Qualimap Java's 2. Picard 1.70 off-by-one in Picard's coverage loop uses Results
Coverage profile (HIGH tier, top 500 transcripts): exact match to all decimal places. All 178 tests pass, clippy clean, fmt clean. |
# Conflicts: # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (High).png # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Low).png # benchmark/qualimap/large/images_qualimapReport/Coverage Profile Along Genes (Total).png # benchmark/qualimap/large/images_qualimapReport/Junction Analysis.png # benchmark/qualimap/large/images_qualimapReport/Reads Genomic Origin.png # benchmark/qualimap/large/images_qualimapReport/Transcript coverage histogram.png # benchmark/qualimap/large/qualimapReport.html # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(high).txt # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(low).txt # benchmark/qualimap/large/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt # benchmark/qualimap/large/rnaseq_qc_results.txt
…rage clone Consolidate 4+ redundant merged exon tree queries per M-block into a single cached query (CachedBlockHits). Reuse pre-computed enclosing genes and cached hits for buffered PE mates instead of recomputing in reconcile_pair(). Move TranscriptCoverage by ownership in into_result() instead of deep cloning. Counting phase: 256s → 155s (39% faster on 11GB BAM, 8 threads). Qualimap overhead reduced from 85% to 12% of base processing time. All outputs verified identical.
✅ Deploy Preview for rustqc ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
…educe verbosity - Fix cross-chrom PE merge: count both reads (num_reads=2) not 1 - Remove dead JunctionMotif.intron_start/intron_end fields - Replace MateInfo + assign_se_read with MateData + classify_and_count shared helper, reducing reconcile_pair from 14 to 4 params - Store (gene_idx, strand) in CachedBlockHits.enclosing_genes, eliminating the overlapping Vec and reducing per-mate memory - Remove unused _index param from find_enclosing_genes_cached - Remove redundant gene dedup in count_ssp_for_blocks_cached - Audit #[allow(dead_code)]: remove stale annotation on MergedExonMeta.strand (now actively used), add targeted annotations on genuinely unused fields (fragment_count, transcript_coverage, merged_gene_coverage) - Add doc comments on CachedBlockHits fields
…println with log::info
- Update all benchmark reference outputs (large + small datasets) - README: add Qualimap rnaseq as a fully documented tool with output file descriptions, update How it Works section, add references - Qualimap outputs include: rnaseq_qc_results.txt, coverage profiles for total/high/low expression tiers
- Create outputs/qualimap.mdx: comprehensive output docs with file format tables, QC results sections, interpretation guides, config, and compatibility notes - Create benchmarks/qualimap.mdx: performance and output comparison - Update outputs/tin.mdx: TIN-only page (gene body coverage moved to qualimap.mdx) - Update sidebar: separate TIN and Qualimap entries in both Outputs and Benchmarks sections - Update cross-references in quickstart, introduction, cli-reference, rseqc outputs, rseqc benchmarks, and configuration pages - Add qualimap: config section to configuration.md
…filtered GTF) Reran benchmarks with correct parameters matching Qualimap Java reference: strand-specific-reverse, filtered GTF, name-sorted BAM for Java. All key metrics now match exactly between RustQC and Qualimap Java (both small and large datasets). Updated docs benchmark page with side-by-side comparison tables and corrected example values in outputs page.


Summary
Adds a new
qualimap/module that produces Qualimap-compatible RNA-Seq QC output (rnaseq_qc_results.txtand coverage profile TSVs), integrated into the existing single-pass BAM processing pipeline.rnaseq_qc_results.txtoutput, coverage profiles (total/high/low), bias computation, junction motif analysisKey implementation details
Gene.Transcriptmodel)Small benchmark comparison vs Qualimap reference
Known remaining work
Note
Medium Risk
Adds a new QC pipeline path and hooks it into the core single-pass BAM counting flow, which could affect performance and output correctness for GTF-based runs. Risk is mitigated by being largely additive and gated by
qualimap.enabled(default-on) but still changes default output behavior.Overview
Adds a new
rna::qualimapmodule that computes Qualimap-style RNA-Seq QC metrics during the existing single-pass BAM processing and writes Qualimap-compatible outputs (e.g.rnaseq_qc_results.txtplus coverage profile TSVs).Wires the Qualimap accumulator/index into
dupradar::counting::count_readsandmain.rs, replaces the prior genebody-based Qualimap text output with the newqualimap::output::write_qualimap_results, and introducesqualimapconfiguration (qualimap.enabled, default true) including passing the GTF path through for report metadata.Updates/extends benchmark artifacts (small and large) with Qualimap reference outputs and adds a vendored HTML report + static assets under
benchmark/qualimap/large;.gitignorenow ignorestests/qualimap_ref/.Written by Cursor Bugbot for commit b142f24. This will update automatically on new commits. Configure here.