Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
cf4357c
Add Qualimap RNA-Seq QC module: index, accumulator, and coverage core…
ewels Feb 16, 2026
b6533db
Add Qualimap rnaseq benchmark output for large dataset
ewels Feb 16, 2026
cf56eb7
Add Qualimap output writer and fix accumulator counting (Phase 2)
ewels Feb 17, 2026
5446c7d
Improve coverage tracking: per-block overlap model, per-read accumula…
ewels Feb 17, 2026
03af466
Fix 212-read exonic/no-feature gap with merged exon tree
ewels Feb 17, 2026
4da391a
Clean up coverage profile binning and add diagnostic logging
ewels Feb 17, 2026
cd70d58
Fix junction count: count all CIGAR N-ops for reads_at_junctions
ewels Feb 17, 2026
daea1e9
Fix merge_intervals to not merge abutting exons and add coverage binn…
ewels Feb 17, 2026
9c861cd
Fix overlapping_exon count: match Qualimap exactly (5,748)
ewels Feb 18, 2026
9073f71
Fix SSP estimation: count all fragments, not just exonic, to match Qu…
ewels Feb 18, 2026
b142f24
Fix coverage: move to after multi-mapper skip, restrict to enclosed b…
ewels Feb 18, 2026
dd1f24b
Update Qualimap benchmark references for large dataset
ewels Mar 3, 2026
9f42109
Fix Qualimap counting: per-mate strand-specific assignment, per-block…
ewels Mar 3, 2026
cfe9501
Fix 5'/3' coverage bias: replicate Qualimap exon ordering and Picard …
ewels Mar 3, 2026
fc3e159
Merge remote-tracking branch 'origin/main' into qualimap
ewels Mar 3, 2026
0570c46
Optimize qualimap: cache COITree queries, reuse mate data, avoid cove…
ewels Mar 3, 2026
e122246
Merge branch 'main' into qualimap
ewels Mar 3, 2026
52cdeec
Clean up qualimap module: fix cross-chrom PE bug, remove dead code, r…
ewels Mar 3, 2026
33d608b
Fix PR review: merge merged_gene_coverage in parallel mode, replace e…
ewels Mar 3, 2026
83b6367
Update benchmark outputs and document Qualimap functionality
ewels Mar 4, 2026
1e2792d
Add standalone Qualimap documentation pages
ewels Mar 4, 2026
2c0ece4
Note that RustQC skips the samtools name-sort step required by Qualim…
ewels Mar 4, 2026
d61d3b7
Update Qualimap benchmark outputs with matched parameters (stranded, …
ewels Mar 4, 2026
623933e
Fix MDX build: escape < in qualimap benchmark page
ewels Mar 4, 2026
56c0ce1
Add samtools name-sort step and Qualimap to combined benchmark page
ewels Mar 4, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -56,3 +56,4 @@ __pycache__/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/
tests/qualimap_ref/
22 changes: 15 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,9 +62,9 @@ All RSeQC tools run by default when annotation is provided via `--gtf` or `--bed

| Tool | Upstream equivalent | Description |
|------|-------------------|-------------|
| Qualimap rnaseq | [Qualimap](http://qualimap.conesalab.org/) rnaseq | Full RNA-Seq QC: gene body coverage profiling (5'→3'), 5'/3' bias metrics, read origin classification (exonic/intronic/intergenic), strand-specificity estimation, and junction analysis. Qualimap-compatible output parseable by MultiQC. |
| preseq | [`preseq lc_extrap`](http://smithlabresearch.org/software/preseq/) | Library complexity extrapolation -- estimates distinct molecules at increasing sequencing depths |
| TIN | [`tin.py`](https://rseqc.sourceforge.net/#tin-py) | Transcript Integrity Number -- measures transcript degradation via coverage uniformity |
| Gene body coverage | [Qualimap](http://qualimap.conesalab.org/) rnaseq | Coverage profile along gene bodies (5'→3'), Qualimap-compatible output for MultiQC |
| flagstat | [`samtools flagstat`](http://www.htslib.org/) | Alignment flag summary statistics, identical format to samtools |
| idxstats | [`samtools idxstats`](http://www.htslib.org/) | Per-chromosome read counts, identical format to samtools |
| stats | [`samtools stats`](http://www.htslib.org/) | Summary Numbers (SN) section, MultiQC-compatible format |
Expand Down Expand Up @@ -358,12 +358,16 @@ RSeQC outputs are organized into per-tool subdirectories under `rseqc/`.
| `samtools/sample.idxstats` | samtools idxstats-compatible per-chromosome read counts |
| `samtools/sample.stats` | samtools stats SN-section compatible summary numbers (MultiQC-parseable) |

### Gene body coverage outputs (`qualimap/`)
### Qualimap outputs (`qualimap/`)

Reimplements [Qualimap](http://qualimap.conesalab.org/) rnaseq mode. The output files are written in the same format and directory structure as Qualimap, making them directly parseable by [MultiQC](https://multiqc.info/). Qualimap analyses only run in GTF mode (skipped with `--bed`).

| File | Description |
|------|-------------|
| `qualimap/coverage_profile_along_genes_(total).txt` | Gene body coverage profile (100 percentile bins, 5'→3') |
| `qualimap/rnaseq_qc_results.txt` | Qualimap rnaseq-compatible QC results (MultiQC-parseable) |
| `qualimap/rnaseq_qc_results.txt` | Qualimap rnaseq-compatible QC report: alignment statistics, read origin (exonic/intronic/intergenic/overlapping exon), 5'/3' coverage bias, strand specificity, and junction analysis. MultiQC-parseable. |
| `qualimap/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt` | Gene body coverage profile across all genes (100 percentile bins, 5'→3') |
| `qualimap/raw_data_qualimapReport/coverage_profile_along_genes_(high).txt` | Gene body coverage profile for highly-expressed genes (top 500 by mean coverage) |
| `qualimap/raw_data_qualimapReport/coverage_profile_along_genes_(low).txt` | Gene body coverage profile for lowly-expressed genes (bottom 500 by mean coverage) |

### Duplication matrix columns

Expand Down Expand Up @@ -418,16 +422,18 @@ Note: PGO profiles are machine-specific and `target-cpu=native` produces non-por

### `rustqc rna`

1. **GTF parsing**: Reads gene annotations (plain or gzip-compressed), computes effective gene lengths from non-overlapping exon bases, and extracts additional attributes (e.g., biotype) for downstream grouping.
1. **GTF parsing**: Reads gene annotations (plain or gzip-compressed), computes effective gene lengths from non-overlapping exon bases, and extracts additional attributes (e.g., biotype) for downstream grouping. Builds interval trees for merged exon regions used by the Qualimap module.
2. **Read counting**: Reads the alignment file (SAM/BAM/CRAM) once, assigning each read to a gene based on exon overlap. Four count modes are tracked simultaneously:
- With/without multimappers x with/without duplicates
- Assignment statistics (assigned, ambiguous, no features) are tracked for the featureCounts summary.
- Qualimap accumulator runs in the same pass: per-transcript coverage, gene body profiling, read origin classification (exonic/intronic/intergenic), overlapping-exon detection, strand-specificity estimation, and splice junction motif counting.
3. **featureCounts output**: Writes gene-level counts and summary statistics in the standard featureCounts format.
4. **Biotype counting**: Aggregates assigned read counts by a configurable GTF attribute (e.g., `gene_biotype`), producing biotype count tables and MultiQC-compatible rRNA QC metrics.
5. **Duplication matrix**: Computes RPK, RPKM, and duplication rates for each gene in all four modes.
6. **Logistic regression**: Fits a binomial GLM (`dupRate ~ log10(RPK)`) using iteratively reweighted least squares (IRLS) to model the relationship between expression and duplication.
7. **Plots**: Generates density scatter, boxplot, and histogram visualizations.
8. **MultiQC integration**: Outputs files compatible with [MultiQC](https://multiqc.info/) for pipeline reporting (dupRadar intercept, fit curve, biotype bargraph, rRNA percentage).
7. **Qualimap output**: Computes gene body coverage profiles (total, high-expression, low-expression tiers), 5'/3' bias metrics, and produces a Qualimap-compatible `rnaseq_qc_results.txt` report with read origin counts, mapping statistics, and strand specificity.
8. **Plots**: Generates density scatter, boxplot, and histogram visualizations.
9. **MultiQC integration**: Outputs files compatible with [MultiQC](https://multiqc.info/) for pipeline reporting (dupRadar intercept, fit curve, biotype bargraph, rRNA percentage, Qualimap rnaseq reports).

### RSeQC tools

Expand All @@ -454,6 +460,8 @@ Each RSeQC tool is a single-pass BAM reader that processes the alignment file an
- Original dupRadar R package: https://github.com/ssayols/dupRadar
- Wang L, Wang S, Li W (2012). RSeQC: quality control of RNA-seq experiments. *Bioinformatics*, 28(16), 2184-2185. doi:10.1093/bioinformatics/bts356
- RSeQC: https://rseqc.sourceforge.net/
- García-Alcalde F, Okonechnikov K, Carbonell J, et al. (2012). Qualimap: evaluating next-generation sequencing alignment data. *Bioinformatics*, 28(20), 2678-2679. doi:10.1093/bioinformatics/bts503
- Qualimap: http://qualimap.conesalab.org/

## License

Expand Down
43 changes: 34 additions & 9 deletions benchmark/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -355,32 +355,57 @@ docker run --rm --platform linux/amd64 \

### 9. Generate Qualimap reference outputs

Reference outputs for validating the gene body coverage reimplementation are
stored in `qualimap/small/` and `qualimap/large/`. These were generated with
Reference outputs for validating the Qualimap reimplementation are stored in
`qualimap/small/` and `qualimap/large/`. These were generated with
[Qualimap 2.3](http://qualimap.conesalab.org/) run via Docker.

> **Note:** Qualimap Java requires a **name-sorted BAM** (`samtools sort -n`).
> The `-p strand-specific-reverse` and `-s` flags match the library preparation
> of the GM12878 dataset. Both tools must use the same GTF and strandedness
> settings for results to be comparable.

```bash
QUALIMAP_IMG="quay.io/biocontainers/qualimap:2.3--hdfd78af_0"

# Small (requires uncompressed GTF)
# Small (name-sort BAM first, requires uncompressed GTF)
gunzip -k benchmark/input/small/chr6.gtf.gz
samtools sort -n -o benchmark/input/small/test.namesorted.bam benchmark/input/small/test.bam
docker run --rm -v $(pwd)/benchmark:/data \
$QUALIMAP_IMG qualimap rnaseq \
-bam /data/input/small/test.bam -gtf /data/input/small/chr6.gtf \
-outdir /data/qualimap_tmp_small --java-mem-size=4G -pe
-bam /data/input/small/test.namesorted.bam -gtf /data/input/small/chr6.gtf \
-outdir /data/qualimap_tmp_small --java-mem-size=4G \
-p strand-specific-reverse -pe -s
cp benchmark/qualimap_tmp_small/rnaseq_qc_results.txt benchmark/qualimap/small/
cp "benchmark/qualimap_tmp_small/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt" benchmark/qualimap/small/

# Large
gunzip -k benchmark/input/large/genes.gtf.gz
# Large (requires name-sorted BAM and filtered GTF)
docker run --rm -v $(pwd)/benchmark:/data \
$QUALIMAP_IMG qualimap rnaseq \
-bam /data/input/large/GM12878_REP1.markdup.sorted.bam -gtf /data/input/large/genes.gtf \
-outdir /data/qualimap_tmp_large --java-mem-size=8G -pe
-bam /data/input/large/GM12878_REP1.namesorted.bam \
-gtf /data/input/large/genes.filtered.gtf \
-outdir /data/qualimap_tmp_large --java-mem-size=8G \
-p strand-specific-reverse -pe -s
cp benchmark/qualimap_tmp_large/rnaseq_qc_results.txt benchmark/qualimap/large/
cp "benchmark/qualimap_tmp_large/raw_data_qualimapReport/coverage_profile_along_genes_(total).txt" benchmark/qualimap/large/
```

To generate matching RustQC Qualimap outputs for comparison, run with the same
strandedness and GTF as the Java reference:

```bash
# Small (coord-sorted BAM, same GTF and strandedness as Java run)
cargo run --release -- rna benchmark/input/small/test.bam \
--gtf benchmark/input/small/chr6.gtf \
--paired --stranded 2 --skip-dup-check \
--outdir benchmark/RustQC/small

# Large (coord-sorted BAM, filtered GTF, same strandedness)
cargo run --release -- rna benchmark/input/large/GM12878_REP1.markdup.sorted.bam \
--gtf benchmark/input/large/genes.filtered.gtf \
--paired --stranded 2 --biotype-attribute gene_biotype --threads 10 \
--outdir benchmark/RustQC/large
```

### 10. Compare results

```bash
Expand Down
Loading
Loading