VIB-PSB · nicomaper · May 23, 2024 · May 13, 2024 · May 13, 2024 · May 14, 2024
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,13 @@ singularity_cache/
 # ignore large motif mapping files
 *motif_mappings*.bed
 
+# ignore icres files (too large for repo)
+*icres*.bed
+
+# ignore iCREs output files (confidentiality until publication)
+example/outputs_icres/*
+!example/outputs_icres/.gitkeep
+
 # ignore nf-test executable
 nf-test
 
@@ -18,8 +25,8 @@ nf-test
 tests/outputs/
 
 # ignore SLURM output and error files
-slurm.*.out
-slurm.*.err
+slurm*.out
+slurm*.err
 
 # ignore jupyter notebook checkpoints
 .ipynb_checkpoints/

diff --git a/README.md b/README.md
@@ -13,12 +13,10 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr
 
 Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
 * **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
-* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/configuration_pipeline.md).
-
+* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check  the instructions [here](docs/pipeline_configuration.md).
 
 A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.
 
-
 ## **Inputs**
 * **MINI-AC mode**: genome-wide or locus-based.
 * **Species**: Arabidopsis or maize (maize genome version 4 or 5).
@@ -63,15 +61,31 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10
 
 ## Usage
 
-
-Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:
+Define the paths with the input files and the desired parameters setting in the [configuration file](docs/pipeline_configuration.md), and run it executing the following Nextflow command:
 
 ```shell
 nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
 ```
 
 Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).
 
+## iCREs-based MINI-AC [NOT AVAILABLE UNTIL PUBLICATION]
+
+Given the amount of resources available to profile regulatory DNA in maize, we curated a collection of integrated cis-regulatory elements (iCREs) by combining and comparing different CRE-profiling methods (details to be published).
+
+We implemented a new framework in which it is possible to run MINI-AC given a list of maize genes. It works by retrieving the genomic coordinates of the iCREs associated with genes of interest, and submitting them to motif enrichment and GRN inference using the genome-wide mode of MINI-AC. iCREs-based MINI-AC can only be run for maize, and not for Arabidopsis. In addition, we offer different sets of iCREs that are used in the run: the "maxF1" (```maxf1```) set or the "all" (```all```) set. The first uses a set of putative CREs that is smaller but more precise (less false positives), while the second uses a more comprehensive and complete collection of maize putative CREs.
+
+To download fies with the genomic coordinates of the iCREs, the following commands should be executed on the **top-level directory of the repository**:
+
+```shell
+NOT AVAILABLE UNTIL PUBLICATION
+```
+
+To run iCREs-based MINI-AC, the [configuration file](./mini_ac_icres.config) should be prepared as explained [here](./docs/pipeline_configuration.md). Only two parameters change in comparison to the regular MINI-AC runs. Instead of providing a BED file with ACR genomic coordinates, a list of gene IDs from the maize genome version V4 or V5 should be provided, as exemplified [here](./example/inputs/gene_set_files/UP_gene_set.txt). In addition, an iCREs set should be specified (```maxf1``` or ```all```). Next, the following Nextflow command should be executed:
+
+```shell
+nextflow -C mini_ac_icres.config run mini_ac_icres.nf --icres_set <all|maxf1> --species <maize_v4|maize_v5>
+```
 
 ## Support
 
@@ -81,7 +95,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an
 
 When publishing results generated using MINI-AC, please cite:
 
-Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.
+Nicolás Manosalva Pérez, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal 117, no. 1 (2024): 280–301. https://doi.org/10.1111/tpj.16483.
 
 ## Contact
 

diff --git a/bin/geneList2iCREs.py b/bin/geneList2iCREs.py
@@ -0,0 +1,51 @@
+# %%
+import argparse
+
+def parseArgs():
+
+    parser = argparse.ArgumentParser(prog = 'Script to get a BED file with iCREs ' + \
+                                            'coordinates given a list of genes',
+                        conflict_handler='resolve')
+
+    parser.add_argument('annotated_icres', nargs = 1, type = str,
+                        help = '',
+                        metavar = 'BED file with 4th column being ' +\
+                                    'an annotated gene ID')
+
+    parser.add_argument('gene_list', nargs = 1, type = str,
+                        help = '',
+                        metavar = 'One column file containing gene IDs '+ \
+                                'of interest')
+
+    parser.add_argument('bed_of_genes_icres', nargs = 1, type = str,
+                        help = '',
+                        metavar = 'Output BED file with coordinates '+\
+                            'of iCREs associated with genes of interest')
+
+    args = parser.parse_args()
+
+    return args
+
+args = parseArgs()
+
+annot_icres = args.annotated_icres[0]
+genes_oi_file = args.gene_list[0]
+output_file = args.bed_of_genes_icres[0]
+
+# %%
+genes_oi = set()
+
+with open(genes_oi_file, "r") as fin:
+    for line in fin:
+        rec = line.strip().split("\t")
+        gene_id = rec[0]
+        genes_oi.add(gene_id)
+
+with open(output_file, "w") as fout:
+    with open(annot_icres, "r") as fin:
+        for line in fin:
+            rec = line.strip().split("\t")
+            gene_id = rec[3]
+            if gene_id in genes_oi:
+                fout.write("\t".join(rec[0:3]))
+                fout.write("\n")
diff --git a/data/icres/.gitkeep b/data/icres/.gitkeep
diff --git a/docs/FAQ.md b/docs/FAQ.md
@@ -2,7 +2,7 @@
 
 ## Q: MINI-AC failed, how can I fix it?
 A: 
-* Check the [config file](/docs/configuration_pipeline.md):
+* Check the [config file](/docs/pipeline_configuration.md):
   * Did you specify the correct [executor](https://www.nextflow.io/docs/latest/executor.html) (e.g. SGE, SLURM, ...)? Cluster-related options (i.e., all the lines starting with `clusterOptions`) should also be adapted to match the options of the selected executor.
   * Did you [specify to Singularity the path to the temporary directory](https://docs.sylabs.io/guides/3.5/user-guide/bind_paths_and_mounts.html)? It can be done by adjusting the parameter ```runOptions``` of singularity in Nextflow to ```--bind /absolute/path/to/tmp/folder```. To know the absolute path to the tmp folder in linux execute in the command line ```echo $TMPDIR```
 

diff --git a/docs/configuration_pipeline.md → docs/pipeline_configuration.md b/docs/configuration_pipeline.md → docs/pipeline_configuration.md
@@ -110,7 +110,7 @@ executor {
 }
 ```
 
-MINI-AC was developed in an SGE computer cluster, for which we used the configuration below. This was used to run the genome-wide mode on maize using an input dataset of ~600,000 MOA-seq peaks. For smaller datasets, the memory values can be further reduced. Addionally, for Arabidopsis, a species with a smaller genome, less memory can also be used.
+MINI-AC was developed in an SGE computer cluster, for which we used the configuration below. This was used to run the genome-wide mode on maize using an input dataset of ~600,000 MOA-seq peaks. For smaller datasets, the memory values can be further reduced. Additionally, for Arabidopsis, a species with a smaller genome, less memory can also be used.
 
 ```nextflow
 executor {
@@ -223,3 +223,42 @@ params {
 It is important, however, to make sure that the format is correct. The GO terms should be extended for parental terms, and this file should contain two tab-separated columns (no header),  where the first column is the GO ID, and the second column is the gene ID, as shown [here](../data/zma_v4/zma_v4_go_gene_file.txt). It is vital that the gene IDs are either on Araport11 or AGPv4/NAM5.0.
 
 This same principle can also be applied to other parameters that the user wants to change.
+
+## iCREs-based MINI-AC configuration file
+
+The configuration file of iCREs-based MINI-AC has a similar structure and input parameters as regular MINI-AC (given that it runs genome-wide MINI-AC "under the hood"). The parameter ```ACR_dir``` should be replaced by ```Gene_list_dir```. This parameter should be the path to a directory containing files in a ".txt" format, with each line containing a maize gene ID from the V4 or V5 genome version. One example can be found [here](../example/inputs/gene_set_files/UP_gene_set.txt).  One GRN will be predicted for each input file.
+
+There is an additional input parameter named ```--icres_set```, that can either be ```all``` or ```maxf1```. The parameter ```all``` uses  a more comprehensive and complete collection of maize putative CREs, while ```maxf1``` uses a set of putative CREs that is smaller but more precise (less false positives).
+
+One example of the parameters configuration from the file [mini_ac_icres.config](../mini_ac_icres.config) can be found below:
+
+```nextflow
+params {
+
+    //// Output folder
+    OutDir = "$projectDir/example/outputs_icres"
+
+    //// Required input
+    Gene_list_dir = "$projectDir/example/inputs/gene_set_files"
+
+    //// Optional input
+    // Differential expression data
+    DE_genes = false
+    DE_genes_dir = "$projectDir/example/inputs/de_files"
+    One_DE_set = true
+    // Expression data
+    Filter_set_genes = false
+    Set_genes_dir = "$projectDir/example/inputs/exp_genes_files"
+    One_filtering_set = true
+
+    //// Prediction parameters
+    Bps_intersect = false
+
+
+    //// Prediction parameters only genome-wide
+    Second_gene_annot = false
+    Second_gene_dist = 500
+}
+```
+
+This version of MINI-AC can also be run with ```DE_genes = true``` and ```Filter_set_genes = true```. However, the input files should be named accordingly, with the same name as the input file, followed by ```_icres_``` and ```_degs_table.txt``` and/or ```_expressed_genes.txt```. For example, in the case of the input file [UP_gene_set.txt](../example/inputs/gene_set_files/UP_gene_set.txt), the corresponding DEGs and expressed genes files should be named ```UP_gene_set_icres_degs_table.txt``` and ```UP_gene_set_icres_expressed_genes.txt```, respectively. 
diff --git a/example/README.md b/example/README.md
@@ -165,3 +165,15 @@ The [OUTPUTS folder](outputs/) contains four sub-folders:
 		- Maize gene name and Arabidopsis ortholog gene name combined.
 		- (Optional; if expressed genes provided)  True if the TF is present in the user-provided list of expressed genes, False otherwise.
 		- (Optional; if DE table provided) Differential expression information. The first column is the gene ID, and the rest of columns depend on the content of the user-provided table in input folder "de_files".
+
+## iCREs-based MINI-AC
+
+The outputs of the iCREs-based MINI-AC runs are identical to the default MINI-AC, as it can be seen in the folder [outputs_icres](outputs_icres) (not available until publication). However, two input parameters change:
+
+* Instead of providing an input BED file with genomic coordinates, the input should be a list of gene IDs from the version V4 or V5 of the maize genome, as in this [example](./inputs/gene_set_files/UP_gene_set.txt).
+
+* There is an additional input parameter named ```--icres_set``` that can either be ```all``` or ```maxf1```. The parameter ```all``` uses  a more comprehensive and complete collection of maize putative CREs, while ```maxf1``` uses a set of putative CREs that is smaller but more precise (less false positives). To download the files with the genomic coordinates of these two iCREs sets, the following commands should be executed on the **top-level directory of the repository**:
+
+```shell
+NOT AVAILABLE UNTIL PUBLICATION
+```