Skip to content

Conversation

iraiosub
Copy link

@iraiosub iraiosub commented Apr 9, 2025

Simplified, more flexible annotation:

What this PR means for usage:

  • users can choose to filter the GTF (--skip_filter_gtf false) so they can prioritise a single representative transcript per gene when regions are assigned across the genome
  • users can now provide their own list of representative transcripts(--representative_transcript) to be prioritised for regions assignment and for transcriptome analyses
  • more intuitive outputs in the 00_genome folder (only the used GTFs and segmentation files are saved)
  1. Module changes:
  • replaced FIND_LONGEST_TRANSCRIPT and CLIPSEQ_FILTER_GTF modules with → FILTER_GTF_BY_TRANSCRIPT module.

  • FILTER_GTF_BY_TRANSCRIPT does the following:

    • filters the GTF provided with --gtf to only keep features of representative transcripts.
    • representative transcripts can be user provided with --representative_transcript (which now replaces --longest_transcript), and if not provided auto-selects longest transcript per gene (instead of the former basic tag and TSL-based filtering)
    • validates the GTF format (must contain valid gene, transcript and exon features)
    • validates transcripts provided with --representative_transcript: all genes in the GTF must have exactly one representative transcript
    • provides clear errors and troubleshooting tips on failure
    • produces transcriptome fai and GTF (necessary if transcriptome enabled: --skip_transcriptome false)
    • exports a filtered GTF only if --skip_filter_gtf is false
  • updated CLIPSEQ_RESOLVE_UNANNOTATED template script with important checks, incl. error messages if: unannotated regions remain even with full GTF; fai, regions, or annotation are mismatched; or regions don’t span the genome properly etc.

  • both FILTER_GTF_BY_TRANSCRIPT and CLIPSEQ_RESOLVE_UNANNOTATED are now emitting detailed logs

  • curated saved outputs: in modules.config added conditional publishDir directives to only save the GTF files used downstream in the pipeline outside of PREPARE_GENOME. Only final GTF and regions files (those actually used) are published in 00_genome results.

  1. PREPARE_GENOME subworkflow changes:
  • removed unnecessary execution of CLIPSEQ_RESOLVE_UNANNOTATED on transcript segmentation, it only needs to be done on regions file
  • removed unnecessary input and output channels
  • added boolean value input channels for --skip_filter_gtf and --skip_transcriptome so FILTER_GTF_BY_TRANSCRIPT is only executed depending on these parameters
  • added logic to control FIND_LONGEST_TRANSCRIPT execution.
  1. Parameter changes:
  • new: --skip_filter_gtf (false by default)
  • renamed:
    • --representative_transcript → replaces --longest_transcript
    • --representative_transcript_fai → replaces --longest_transcript_fai
    • --representative_transcript_gtf → replaces --longest_transcript_gtf
  1. Main CLIPSEQ workflow changes:
  • added logic to set the regions and GTF files used downstream (see ch_regions_used and ch_gtf_used)
  1. Test configuration changes:
  • test_full.config:
    • switched to Ensembl-hosted FASTA/GTF (downloaded at runtime)
    • renamed UMI related params to match those in nextflow.config and enabled UMI extraction (so now UMI collapse doesn't fail)
    • test_full runs further, but still fails during consensus peak analysis steps.
  • test.config:
    • renamed parameters and removed unused ones
  1. Documentation changes:
  • added a "Note on annotation" section, explaining the iCount-Mini segmentation, rationale for GTF filtering prior to segmentation and how it works, what outputs are expected in the results folder.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
    • If you've added a new tool - add to the software_versions process and a regex to scrape_software_versions.py
    • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
    • If necessary, also make a PR on the nf-core/clipseq branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core lint .).
  • Ensure the test suite passes (nextflow run . -profile test,docker).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

kkuret and others added 30 commits March 24, 2025 12:10
@iraiosub iraiosub added documentation Improvements or additions to documentation enhancement Improvement for existing functionality labels Apr 9, 2025
@iraiosub
Copy link
Author

iraiosub commented Apr 9, 2025

One small note: the only place apart from PREPARE_GENOME where the filtered GTF (if filtering enabled) is used is clippy. We are unsure if any benefit of using the filtered GTF for clippy.
In terms of pipeline simplicity, if we use unfiltered GTF for clippy, it would become conceptually simpler: because then we can safely say the filtered GTF is only used to better assign regions when preparing the genome. And then the genome prep subworkflow could be reduced even further to emit less files. But this needs testing.

iraiosub and others added 4 commits June 9, 2025 17:22
added collect to deal with queue ref channels
This fix was implemented in goodwrigth/clipseq 9dfeafe to
tweak the channel creation for MERGE_SUMMARY process. In its previous iteration, the premmaped cDNA crosslink files could be randomly paired with other icount summary files to produce erroneous "premapadjusted" icount summary files
fix merging of iCount summaries with premapped crosslinks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation enhancement Improvement for existing functionality
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants