Simplified, more flexible annotation #166

iraiosub · 2025-04-09T14:13:34Z

Simplified, more flexible annotation:

closes Address complicated annotation generation #119, make filtering the GTF and downstream steps optional #140, When transcript filtering is applied to GTF, the selected transcripts should also be used for transcriptome analyses #141, rename longest_transcript-related params and outputs more meaningfully #156, document the annotation options and outputs #163.
bug fixes: Crosslinks and peaks are not produced for all samples in the samplesheet #167, iCount summaries error #168

What this PR means for usage:

users can choose to filter the GTF (--skip_filter_gtf false) so they can prioritise a single representative transcript per gene when regions are assigned across the genome
users can now provide their own list of representative transcripts(--representative_transcript) to be prioritised for regions assignment and for transcriptome analyses
more intuitive outputs in the 00_genome folder (only the used GTFs and segmentation files are saved)

Module changes:

replaced FIND_LONGEST_TRANSCRIPT and CLIPSEQ_FILTER_GTF modules with → FILTER_GTF_BY_TRANSCRIPT module.
FILTER_GTF_BY_TRANSCRIPT does the following:
- filters the GTF provided with --gtf to only keep features of representative transcripts.
- representative transcripts can be user provided with --representative_transcript (which now replaces --longest_transcript), and if not provided auto-selects longest transcript per gene (instead of the former basic tag and TSL-based filtering)
- validates the GTF format (must contain valid gene, transcript and exon features)
- validates transcripts provided with --representative_transcript: all genes in the GTF must have exactly one representative transcript
- provides clear errors and troubleshooting tips on failure
- produces transcriptome fai and GTF (necessary if transcriptome enabled: --skip_transcriptome false)
- exports a filtered GTF only if --skip_filter_gtf is false
updated CLIPSEQ_RESOLVE_UNANNOTATED template script with important checks, incl. error messages if: unannotated regions remain even with full GTF; fai, regions, or annotation are mismatched; or regions don’t span the genome properly etc.
both FILTER_GTF_BY_TRANSCRIPT and CLIPSEQ_RESOLVE_UNANNOTATED are now emitting detailed logs
curated saved outputs: in modules.config added conditional publishDir directives to only save the GTF files used downstream in the pipeline outside of PREPARE_GENOME. Only final GTF and regions files (those actually used) are published in 00_genome results.

PREPARE_GENOME subworkflow changes:

removed unnecessary execution of CLIPSEQ_RESOLVE_UNANNOTATED on transcript segmentation, it only needs to be done on regions file
removed unnecessary input and output channels
added boolean value input channels for --skip_filter_gtf and --skip_transcriptome so FILTER_GTF_BY_TRANSCRIPT is only executed depending on these parameters
added logic to control FIND_LONGEST_TRANSCRIPT execution.

Parameter changes:

new: --skip_filter_gtf (false by default)
renamed:
- --representative_transcript → replaces --longest_transcript
- --representative_transcript_fai → replaces --longest_transcript_fai
- --representative_transcript_gtf → replaces --longest_transcript_gtf

Main CLIPSEQ workflow changes:

added logic to set the regions and GTF files used downstream (see ch_regions_used and ch_gtf_used)

Test configuration changes:

test_full.config:
- switched to Ensembl-hosted FASTA/GTF (downloaded at runtime)
- renamed UMI related params to match those in nextflow.config and enabled UMI extraction (so now UMI collapse doesn't fail)
- test_full runs further, but still fails during consensus peak analysis steps.
test.config:
- renamed parameters and removed unused ones

Documentation changes:

added a "Note on annotation" section, explaining the iCount-Mini segmentation, rationale for GTF filtering prior to segmentation and how it works, what outputs are expected in the results folder.

PR checklist

latest changes feat-2-0

…est transcript files.

…ript selection criteria.

…esolve (like intron,CDS)

RESOLVE_UNANNOTATED process: added validation and logging

iraiosub · 2025-04-09T14:24:03Z

One small note: the only place apart from PREPARE_GENOME where the filtered GTF (if filtering enabled) is used is clippy. We are unsure if any benefit of using the filtered GTF for clippy.
In terms of pipeline simplicity, if we use unfiltered GTF for clippy, it would become conceptually simpler: because then we can safely say the filtered GTF is only used to better assign regions when preparing the genome. And then the genome prep subworkflow could be reduced even further to emit less files. But this needs testing.

…fined; remove unused profiles

added collect to deal with queue ref channels

This fix was implemented in goodwrigth/clipseq 9dfeafe to tweak the channel creation for MERGE_SUMMARY process. In its previous iteration, the premmaped cDNA crosslink files could be randomly paired with other icount summary files to produce erroneous "premapadjusted" icount summary files

fix merging of iCount summaries with premapped crosslinks

kkuret and others added 30 commits March 24, 2025 12:10

changing filtering to ensembl canonical intial commit

b62c0ac

changes to imports etc

a5974d2

canonical filtering testing ok

3a3b824

more changes to filtering by canonical

5773bdf

add new ref filtering test for human

43c2b9e

comment on output name

546c348

added human test data

07bbfa2

use premade bowtie index for testing

042f16b

comment out unnecessary gtf channels

01ec151

Modify resolve unannotated to remove genic other option

64bcd5a

modify main.nf inputs and prepare genome subworkflow

3f05a38

renamed CLIPSEQ_RESOLVE_UNANNOTATED inputs

1519c6b

Merge remote-tracking branch 'origin/feat-2-0' into feat-2-0-canonical

47e0a9c

Merge remote-tracking branch 'origin/feat-2-0' into feat-2-0-canonical

fee5f80

latest changes feat-2-0

Merge remote-tracking branch 'origin/feat-2-0' into feat-2-0-canonical

b81ffd3

fixes cadinality issue CLIPSEQ_RESOLVE_UNANNOTATED

d984ac2

filtergtf outputting tx gene pairs

0049a4b

Edited find longest transcript to also filter gtf and output all long…

afe7534

…est transcript files.

changes to longest tx selection

42be21b

Fix file saving options

7740e49

fixed newline

6115fa4

correct excessive quoting and edit attributes

fa2b3cb

Edit the longest transcript script to accept user provided transcripts

484ad07

incorporated new tx selection into prep genome subworkflow

412942d

some more ref test data

f4f5f01

added ensembl 103 test ref

a33d0d4

added logging to the transcript selection and filtering script

0e1c646

Merge remote-tracking branch 'origin/feat-2-0' into feat-2-0-canonical

4b81518

raising error when not exactly 1 tx per gene

fdcb29c

Addditional checkes and added unspliced transcript length into transc…

267adbf

…ript selection criteria.

iraiosub and others added 16 commits April 7, 2025 14:32

undo resource changes for bedtools sort

3708dca

added missing negation

5721d99

removed unnecessary input files from test profile

6c4cba8

Reverted test config to original

7084539

Additional validation that no unexpected region types formed during r…

b313822

…esolve (like intron,CDS)

unified logging

9d81833

trailing whitespaces removed

c084bbc

Fixed logging, printing of a set

62d744e

Fixed logging

9cc5de3

Merge pull request #1 from iraiosub/feat-2-0-klara

93d5f85

RESOLVE_UNANNOTATED process: added validation and logging

small changes to docs

80193c1

docs restructureing

c96d71f

typos

11ae020

added outputs documentation

a0e4fb5

docs update

9c1c7e2

more renaming

b53ef83

iraiosub added documentation Improvements or additions to documentation enhancement Improvement for existing functionality labels Apr 9, 2025

iraiosub assigned iraiosub and kkuret Apr 9, 2025

iraiosub mentioned this pull request Apr 9, 2025

updated reference paths and UMI param names in test_full.config #165

Closed

11 tasks

iraiosub requested review from CharlotteAnne and amchakra April 9, 2025 14:14

Use ternary operator to create empty ch when optional inputs are unde…

106d3b2

…fined; remove unused profiles

kkuret mentioned this pull request May 18, 2025

Crosslinks and peaks are not produced for all samples in the samplesheet #167

Open

iraiosub and others added 4 commits June 9, 2025 17:22

added collect to deal with queue ref channels

f8b1d6b

Merge pull request #2 from iraiosub/feat-collect-fix

f40f034

added collect to deal with queue ref channels

Merge pull request #3 from iraiosub/fix-premap-summary

59ef35c

fix merging of iCount summaries with premapped crosslinks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Simplified, more flexible annotation #166

Simplified, more flexible annotation #166

Uh oh!

iraiosub commented Apr 9, 2025 •

edited

Loading

Uh oh!

iraiosub commented Apr 9, 2025 •

edited

Loading

Uh oh!

Uh oh!

Simplified, more flexible annotation #166

Are you sure you want to change the base?

Simplified, more flexible annotation #166

Uh oh!

Conversation

iraiosub commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR checklist

Uh oh!

iraiosub commented Apr 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

iraiosub commented Apr 9, 2025 •

edited

Loading

iraiosub commented Apr 9, 2025 •

edited

Loading