nf-core · iraiosub · Mar 24, 2025 · Mar 24, 2025 · Mar 24, 2025 · Mar 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -1,7 +1,7 @@
 .nextflow*
 work/
 data/
-results/
+results*/
 .DS_Store
 testing/
 testing*

diff --git a/README.md b/README.md
@@ -99,6 +99,76 @@ For more details about the output files and reports, please refer to the
 
 The pipeline currently does not support paired-end reads, as in our experience alignment using both reads when available doesn't improve analysis of CLIP data. When recieving CLIP data sequenced paired-end, we recommend running the pipeline with the read containing the crosslink and ensuring the crosslink_position parameter is set appropriately. If you have evidence to the contrary please do get in touch and let us know, or if you are working on a new variant protocol where paired-end alignment is important please do reach out.
 
+## A note on annotation
+
+In the current implementation, certain tools for peak-calling, motif discovery and analysis of crosslink distribution around landmarks or transcript regions (Clippy, PEKA, iCount summary and iCount RNA-maps) rely on GTF files generated by the [iCount-Mini](https://github.com/ulelab/iCount-Mini) segment [script](https://github.com/ulelab/iCount-Mini/blob/main/iCount/genomes/segment.py).
+
+Segmentation divides the genome into regions such as CDS, UTR, UTR3, ncRNA, introns, and intergenic, at:
+- the **transcript level** (`*seg.gtf`): each transcript is divided into non-overlapping segments (e.g., CDS, UTRs, introns). Segments can overlap across transcripts or genes.
+- the **genome level** (`*regions.gtf`): the genome is partitioned into non-overlapping regions. Each position is assigned to exactly one region based on iCount's priority: `CDS > UTR3 > UTR5 > ncRNA > intron > intergenic`
+
+See the [iCount segment documentation](https://icount.readthedocs.io/en/latest/_modules/iCount/genomes/segment.html) for details.
+
+> **Warning:**
+> iCount-Mini only supports **Ensembl** or **GENCODE-style** annotations.
+
+### GTF filtering for iCount segmentation
+
+Pre-filtering the annotation can improve iCount genome-level segmentation by:
+- Prioritizing one representative transcript per gene
+- Reducing conflicts in genomic region assignments caused by overlapping isoforms
+
+This can improve the biological interpretability of region assignments, especially at the genome level.
+
+GTF filtering is enabled by default. To enable, omit the parameter or set `--skip_gtf_filter false`. To disable, set `--skip_gtf_filter true`.
+
+> **Warning:**
+> Your GTF must contain valid transcript and exon features for all genes.
+> If your annotation does not meet these standards you may want to consider disabling filtering with `--skip_gtf_filter true`.
+
+When enabled, the GTF is filtered prior to segmentation to include **one transcript per gene**.
+These representative transcripts can be either a user-defined set of transcripts (`--representative_transcript`) or automatically selected by the pipeline as the longest transcript per gene.
+
+#### Transcript selection:
+- If `--representative_transcript` is provided:
+  - Must be a `.txt` file with **one transcript ID per line**
+  - Must include **exactly one transcript for each gene** in the input GTF (`--gtf`)
+  - Only these transcripts and their associated features will be retained
+- If not provided, the pipeline auto-selects one representative transcript per gene using the hierarchy:
+    1. **CDS length**
+    2. **Exon length**
+    3. **Unspliced (transcript) length**
+    4. Tie-breaker: transcript ID
+
+#### How segmentation uses the filtered GTF:
+
+When filtering is **enabled** (`--skip_gtf_filter false`), the genome is segmented **twice** using the:
+
+1. **filtered GTF**: to prioritize representative transcripts. Some regions may remain unannotated because the gene-level annotation can extend beyond the boundaries of the representative transcripts
+2. **unfiltered GTF** (the original GTF provided via `--gtf`): ensures full coverage of genes
+
+Any regions left unannotated after segmentation on the filtered GTF (**1**) are filled in using the unfiltered GTF regions (**2**) during the `RESOLVE_UNANNOTATED` step. This ensures full genome coverage while still prioritizing the set of representative transcripts.
+
+> This way, the final genomic regions (CDS, UTRs, introns, etc.) mostly reflect a single representative transcript per gene, while still ensuring no regions are left unannotated
+
+Key outputs (filtering enabled):
+ - `*_representative_transcript_filtered.gtf`: Filtered GTF containing only features for the selected representative transcripts.
+ - `*_seg.gtf`: Transcript-wise segmentation (segments) based on the unfiltered GTF.
+ - `*_representative_transcript_filtered_regions.resolved.gtf`: resolved regions file with genome-wise segmentation (regions) based on the filtered GTF, with unannotated parts filled in from the unfiltered regions.
+
+#### When GTF filtering is disabled:
+
+If `--skip_gtf_filter true` is set:
+- Segmentation is run **once**, using the original GTF provided via `--gtf`
+- All transcripts per gene are included
+- Regions (e.g., UTRs, CDS, introns) are assigned by collapsing annotations across all transcripts
+- iCount’s internal rules resolve overlapping features
+- This may result in more complex region annotations for genes with many transcripts
+
+Key outputs (filtering disabled):
+ - `*_seg.gtf`: Transcript-wise segmentation (segments) using the unfiltered GTF.
+ - `*_regions.gtf`: Genome-wise segmentation (regions) from all transcripts in the unfiltered GTF
+
 ## Credits
 
 nf-core/clipseq was originally written by Charlotte West ([@charlotte-west](https://github.com/charlotte-west)) and Anob Chakrabarti ([@amchakra](https://github.com/amchakra)) from [Luscombe Lab](https://www.crick.ac.uk/research/labs/nicholas-luscombe) at [The Francis Crick Institute](https://www.crick.ac.uk/), London, UK. It started life in April 2020 as a Nextflow DSL2 Luscombe Lab ([@luslab](https://github.com/luslab)) lockdown hackathon day and we thank all the lab members for their early contributions.

diff --git a/conf/modules.config b/conf/modules.config
@@ -111,16 +111,7 @@ if(params.run_genome_prep) {
             ]
         }
 
-        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:FIND_LONGEST_TRANSCRIPT' {
-            publishDir = [
-                path: { "${params.outdir}/00_genome" },
-                mode: "${params.publish_dir_mode}",
-                saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-                enabled: params.save_reference
-            ]
-        }
-
-        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:CLIPSEQ_FILTER_GTF' {
+        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:FILTER_GTF_BY_TRANSCRIPT' {
             publishDir = [
                 path: { "${params.outdir}/00_genome" },
                 mode: "${params.publish_dir_mode}",
@@ -133,7 +124,15 @@ if(params.run_genome_prep) {
             publishDir = [
                 path: { "${params.outdir}/00_genome" },
                 mode: "${params.publish_dir_mode}",
-                saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+                saveAs: { filename ->
+                    if (filename.equals('versions.yml')) {
+                        return null
+                    }
+                    if (!params.skip_filter_gtf && filename.endsWith('regions.gtf.gz')) {
+                        return null
+                    }
+                    return filename
+                },
                 enabled: params.save_reference
             ]
         }
@@ -143,16 +142,7 @@ if(params.run_genome_prep) {
                 path: { "${params.outdir}/00_genome" },
                 mode: "${params.publish_dir_mode}",
                 saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-                enabled: params.save_reference
-            ]
-        }
-
-        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:RESOLVE_UNANNOTATED' {
-            publishDir = [
-                path: { "${params.outdir}/00_genome" },
-                mode: "${params.publish_dir_mode}",
-                saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-                enabled: params.save_reference
+                enabled: false
             ]
         }
 
@@ -164,24 +154,6 @@ if(params.run_genome_prep) {
                 enabled: params.save_reference
             ]
         }
-
-        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:RESOLVE_UNANNOTATED_GENIC_OTHER' {
-            publishDir = [
-                path: { "${params.outdir}/00_genome" },
-                mode: "${params.publish_dir_mode}",
-                saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-                enabled: params.save_reference
-            ]
-        }
-
-        withName: 'NFCORE_CLIPSEQ:CLIPSEQ:PREPARE_GENOME:RESOLVE_UNANNOTATED_GENIC_OTHER_REGIONS' {
-            publishDir = [
-                path: { "${params.outdir}/00_genome" },
-                mode: "${params.publish_dir_mode}",
-                saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
-                enabled: params.save_reference
-            ]
-        }
     }
 }
 

diff --git a/conf/test.config b/conf/test.config
@@ -31,18 +31,14 @@ params {
     ncrna_genome_index = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/bowtie.tar.gz"
     genome_chrom_sizes = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV.fa.sizes"
     ncrna_chrom_sizes = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/homosapiens_smallRNA.fa.sizes"
-    longest_transcript = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.txt"
-    longest_transcript_fai = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.fai"
-    longest_transcript_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.gtf"
+    representative_transcript = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.txt"
+    representative_transcript_fai = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.fai"
+    representative_transcript_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.gtf"
     filtered_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered.gtf"
     seg_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_seg.gtf"
-    seg_filt_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg.gtf"
     regions_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_regions.gtf.gz"
     regions_filt_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions.gtf.gz"
-    seg_resolved_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg_genicOtherfalse.resolved.gtf"
     regions_resolved_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions_genicOtherfalse.resolved.gtf"
-    seg_resolved_gtf_genic = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg_genicOthertrue.resolved.gtf"
-    regions_resolved_gtf_genic = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions_genicOthertrue.resolved.gtf"
 
     // Logic
     debug                 = true
@@ -53,6 +49,7 @@ params {
     save_unaligned_output = true
     save_align_intermed   = true
     skip_transcriptome    = true
+    skip_filter_gtf       = false
 
     // Pipeline params
     umitools_bc_pattern = 'NNNNNNNNN'

diff --git a/conf/test_bam.config b/conf/test_bam.config
@@ -31,18 +31,14 @@ params {
     ncrna_genome_index = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/bowtie.tar.gz"
     genome_chrom_sizes = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV.fa.sizes"
     ncrna_chrom_sizes = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/homosapiens_smallRNA.fa.sizes"
-    longest_transcript = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.txt"
-    longest_transcript_fai = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.fai"
-    longest_transcript_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.gtf"
+    representative_transcript = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.txt"
+    representative_transcript_fai = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.fai"
+    representative_transcript_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/longest_transcript.gtf"
     filtered_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered.gtf"
     seg_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_seg.gtf"
-    seg_filt_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg.gtf"
     regions_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_regions.gtf.gz"
     regions_filt_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions.gtf.gz"
-    seg_resolved_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg_genicOtherfalse.resolved.gtf"
     regions_resolved_gtf = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions_genicOtherfalse.resolved.gtf"
-    seg_resolved_gtf_genic = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_seg_genicOthertrue.resolved.gtf"
-    regions_resolved_gtf_genic = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/yeast_MitoV_filtered_regions_genicOthertrue.resolved.gtf"
 
     // Logic
     debug                 = true

diff --git a/conf/test_full.config b/conf/test_full.config
@@ -17,13 +17,17 @@ params {
     config_profile_description = 'Full test dataset to check pipeline function'
 
     // Input data for full size test
-    input = 'https://raw.githubusercontent.com/nf-core/clipseq/refs/heads/feat-2-0/tests/test_new_samplesheet_FASTQ_full.csv'
+    input  = 'https://raw.githubusercontent.com/nf-core/clipseq/refs/heads/feat-2-0/tests/test_new_samplesheet_FASTQ_full.csv'
     source = "fastq"
 
     // Genome references
-    ncrna_fasta   = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/homosapiens_smallRNA.fa.gz"
-    fasta         = 's3://ngi-igenomes/test-data/clipseq/input_data/reference/GRCh38.primary_assembly.genome.fa.gz'
-    gtf           = 's3://ngi-igenomes/test-data/clipseq/input_data/reference/gencode.v37.primary_assembly.annotation.gtf.gz'
-    move_umi      = 'NNNNNNNNN'
-    umi_separator = '_'
+    ncrna_fasta  = "https://raw.githubusercontent.com/nf-core/test-datasets/clipseq/v_2_0/genome/homosapiens_smallRNA.fa.gz"
+    fasta        = 'https://ftp.ensembl.org/pub/release-111/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz'
+    gtf          = 'https://ftp.ensembl.org/pub/release-111/gtf/homo_sapiens/Homo_sapiens.GRCh38.111.gtf.gz'
+    save_reference = true
+
+    // UMI options
+    umitools_bc_pattern     = 'NNNNNNNNN'
+    umitools_umi_separator  = '_'
+    skip_umi_extract        = false
 }
diff --git a/modules/local/filter_gtf/README.md b/modules/local/filter_gtf/README.md
diff --git a/modules/local/filter_gtf/main.nf b/modules/local/filter_gtf/main.nf
diff --git a/modules/local/filter_gtf/meta.yml b/modules/local/filter_gtf/meta.yml