Skip to content

Feature/icres based grns #22

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
May 23, 2024
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,13 @@ singularity_cache/
# ignore large motif mapping files
*motif_mappings*.bed

# ignore icres files (too large for repo)
*icres*.bed

# ignore iCREs output files (confidentiality until publication)
example/outputs_icres/*
!example/outputs_icres/.gitkeep

# ignore nf-test executable
nf-test

Expand All @@ -18,8 +25,8 @@ nf-test
tests/outputs/

# ignore SLURM output and error files
slurm.*.out
slurm.*.err
slurm*.out
slurm*.err

# ignore jupyter notebook checkpoints
.ipynb_checkpoints/
Expand Down
26 changes: 20 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,12 +13,10 @@ MINI-AC uses a dual license to offer the distribution of the software under a pr

Currently, two species are supported by MINI-AC: *Arabidopsis thaliana* and two maize genome versions (B73 RefGen_v4 and B73 RefGen_v5). Additionally, it can be run on two different modes depending on the non-coding genomic space considered for motif mapping:
* **genome-wide**: strategy where the whole non-coding genome is considered for motif mappings. It captures all the ACRs of the input dataset for the GRN prediction, which is adviced when working with species with long intergenic regions and distal regulatory elements, like maize for example.
* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/configuration_pipeline.md).

* **locus-based**: strategy where the neighboring sequences within a pre-defined window of each locus, and introns are considered for motif mapping. It only captures the proximal ACRs of the input dataset within the pre-defined window, which can lead to missing distal ACRs in species with long intergenic regions and distal regulatory elements. However, it has the advantage of having a higher density of TFBS, which are mostly located close to the genes. The locus-based mode uses a "medium" non-coding genomic space, which corresponds, for each locus in the genome, to the 5kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns. However, for maize (but not for Arabidopsis; see publication), we generated two additional motif mapping files for the locus-based mode, that cover "large" (15kb upstream of the translation start site, the 2.5kb downstream of the translation end site, and the introns), and "small" (1kb upstream of the translation start site, the 1kb downstream of the translation end site, and the introns) non-coding genomic spaces. To use these files, check the instructions [here](docs/pipeline_configuration.md).

A detailed overview of the necessary input files and expected output files can be found in this [example](example), done on **maize V4 with the genome-wide mode**, and using as input a single-cell-derived ACR dataset of mesophyll and bundle sheath.


## **Inputs**
* **MINI-AC mode**: genome-wide or locus-based.
* **Species**: Arabidopsis or maize (maize genome version 4 or 5).
Expand Down Expand Up @@ -63,15 +61,31 @@ NOTE: MINI-AC was developed using the following versions: Nextflow version 21.10

## Usage


Define the paths with the input files and the desired parameters setting in the [configuration file](docs/configuration_pipeline.md), and run it executing the following Nextflow command:
Define the paths with the input files and the desired parameters setting in the [configuration file](docs/pipeline_configuration.md), and run it executing the following Nextflow command:

```shell
nextflow -C mini_ac.config run mini_ac.nf --mode <genome_wide|locus_based> --species <arabidopsis|maize_v4|maize_v5>
```

Having problems running MINI-AC? Check the [FAQ](docs/FAQ.md).

## iCREs-based MINI-AC [NOT AVAILABLE UNTIL PUBLICATION]

Given the amount of resources available to profile regulatory DNA in maize, we curated a collection of integrated cis-regulatory elements (iCREs) by combining and comparing different CRE-profiling methods (details to be published).

We implemented a new framework in which it is possible to run MINI-AC given a list of maize genes. It works by retrieving the genomic coordinates of the iCREs associated with genes of interest, and submitting them to motif enrichment and GRN inference using the genome-wide mode of MINI-AC. iCREs-based MINI-AC can only be run for maize, and not for Arabidopsis. In addition, we offer different sets of iCREs that are used in the run: the "maxF1" (```maxf1```) set or the "all" (```all```) set. The first uses a set of putative CREs that is smaller but more precise (less false positives), while the second uses a more comprehensive and complete collection of maize putative CREs.

To download fies with the genomic coordinates of the iCREs, the following commands should be executed on the **top-level directory of the repository**:

```shell
NOT AVAILABLE UNTIL PUBLICATION
```

To run iCREs-based MINI-AC, the [configuration file](./mini_ac_icres.config) should be prepared as explained [here](./docs/pipeline_configuration.md). Only two parameters change in comparison to the regular MINI-AC runs. Instead of providing a BED file with ACR genomic coordinates, a list of gene IDs from the maize genome version V4 or V5 should be provided, as exemplified [here](./example/inputs/gene_set_files/UP_gene_set.txt). In addition, an iCREs set should be specified (```maxf1``` or ```all```). Next, the following Nextflow command should be executed:

```shell
nextflow -C mini_ac_icres.config run mini_ac_icres.nf --icres_set <all|maxf1> --species <maize_v4|maize_v5>
```

## Support

Expand All @@ -81,7 +95,7 @@ Should you encounter a bug or have any questions or suggestions, please [open an

When publishing results generated using MINI-AC, please cite:

Manosalva Pérez, Nicolás, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal. https://doi.org/10.1111/tpj.16483.
Nicolás Manosalva Pérez, Camilla Ferrari, Julia Engelhorn, Thomas Depuydt, Hilde Nelissen, Thomas Hartwig, and Klaas Vandepoele. “MINI-AC: Inference of Plant Gene Regulatory Networks Using Bulk or Single-Cell Accessible Chromatin Profiles.” The Plant Journal 117, no. 1 (2024): 280–301. https://doi.org/10.1111/tpj.16483.

## Contact

Expand Down
51 changes: 51 additions & 0 deletions bin/geneList2iCREs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# %%
import argparse

def parseArgs():

parser = argparse.ArgumentParser(prog = 'Script to get a BED file with iCREs ' + \
'coordinates given a list of genes',
conflict_handler='resolve')

parser.add_argument('annotated_icres', nargs = 1, type = str,
help = '',
metavar = 'BED file with 4th column being ' +\
'an annotated gene ID')

parser.add_argument('gene_list', nargs = 1, type = str,
help = '',
metavar = 'One column file containing gene IDs '+ \
'of interest')

parser.add_argument('bed_of_genes_icres', nargs = 1, type = str,
help = '',
metavar = 'Output BED file with coordinates '+\
'of iCREs associated with genes of interest')

args = parser.parse_args()

return args

args = parseArgs()

annot_icres = args.annotated_icres[0]
genes_oi_file = args.gene_list[0]
output_file = args.bed_of_genes_icres[0]

# %%
genes_oi = set()

with open(genes_oi_file, "r") as fin:
for line in fin:
rec = line.strip().split("\t")
gene_id = rec[0]
genes_oi.add(gene_id)

with open(output_file, "w") as fout:
with open(annot_icres, "r") as fin:
for line in fin:
rec = line.strip().split("\t")
gene_id = rec[3]
if gene_id in genes_oi:
fout.write("\t".join(rec[0:3]))
fout.write("\n")
Empty file added data/icres/.gitkeep
Empty file.
2 changes: 1 addition & 1 deletion docs/FAQ.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

## Q: MINI-AC failed, how can I fix it?
A:
* Check the [config file](/docs/configuration_pipeline.md):
* Check the [config file](/docs/pipeline_configuration.md):
* Did you specify the correct [executor](https://www.nextflow.io/docs/latest/executor.html) (e.g. SGE, SLURM, ...)? Cluster-related options (i.e., all the lines starting with `clusterOptions`) should also be adapted to match the options of the selected executor.
* Did you [specify to Singularity the path to the temporary directory](https://docs.sylabs.io/guides/3.5/user-guide/bind_paths_and_mounts.html)? It can be done by adjusting the parameter ```runOptions``` of singularity in Nextflow to ```--bind /absolute/path/to/tmp/folder```. To know the absolute path to the tmp folder in linux execute in the command line ```echo $TMPDIR```

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ executor {
}
```

MINI-AC was developed in an SGE computer cluster, for which we used the configuration below. This was used to run the genome-wide mode on maize using an input dataset of ~600,000 MOA-seq peaks. For smaller datasets, the memory values can be further reduced. Addionally, for Arabidopsis, a species with a smaller genome, less memory can also be used.
MINI-AC was developed in an SGE computer cluster, for which we used the configuration below. This was used to run the genome-wide mode on maize using an input dataset of ~600,000 MOA-seq peaks. For smaller datasets, the memory values can be further reduced. Additionally, for Arabidopsis, a species with a smaller genome, less memory can also be used.

```nextflow
executor {
Expand Down Expand Up @@ -223,3 +223,42 @@ params {
It is important, however, to make sure that the format is correct. The GO terms should be extended for parental terms, and this file should contain two tab-separated columns (no header), where the first column is the GO ID, and the second column is the gene ID, as shown [here](../data/zma_v4/zma_v4_go_gene_file.txt). It is vital that the gene IDs are either on Araport11 or AGPv4/NAM5.0.

This same principle can also be applied to other parameters that the user wants to change.

## iCREs-based MINI-AC configuration file

The configuration file of iCREs-based MINI-AC has a similar structure and input parameters as regular MINI-AC (given that it runs genome-wide MINI-AC "under the hood"). The parameter ```ACR_dir``` should be replaced by ```Gene_list_dir```. This parameter should be the path to a directory containing files in a ".txt" format, with each line containing a maize gene ID from the V4 or V5 genome version. One example can be found [here](../example/inputs/gene_set_files/UP_gene_set.txt). One GRN will be predicted for each input file.

There is an additional input parameter named ```--icres_set```, that can either be ```all``` or ```maxf1```. The parameter ```all``` uses a more comprehensive and complete collection of maize putative CREs, while ```maxf1``` uses a set of putative CREs that is smaller but more precise (less false positives).

One example of the parameters configuration from the file [mini_ac_icres.config](../mini_ac_icres.config) can be found below:

```nextflow
params {

//// Output folder
OutDir = "$projectDir/example/outputs_icres"

//// Required input
Gene_list_dir = "$projectDir/example/inputs/gene_set_files"

//// Optional input
// Differential expression data
DE_genes = false
DE_genes_dir = "$projectDir/example/inputs/de_files"
One_DE_set = true
// Expression data
Filter_set_genes = false
Set_genes_dir = "$projectDir/example/inputs/exp_genes_files"
One_filtering_set = true

//// Prediction parameters
Bps_intersect = false


//// Prediction parameters only genome-wide
Second_gene_annot = false
Second_gene_dist = 500
}
```

This version of MINI-AC can also be run with ```DE_genes = true``` and ```Filter_set_genes = true```. However, the input files should be named accordingly, with the same name as the input file, followed by ```_icres_``` and ```_degs_table.txt``` and/or ```_expressed_genes.txt```. For example, in the case of the input file [UP_gene_set.txt](../example/inputs/gene_set_files/UP_gene_set.txt), the corresponding DEGs and expressed genes files should be named ```UP_gene_set_icres_degs_table.txt``` and ```UP_gene_set_icres_expressed_genes.txt```, respectively.
12 changes: 12 additions & 0 deletions example/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,3 +165,15 @@ The [OUTPUTS folder](outputs/) contains four sub-folders:
- Maize gene name and Arabidopsis ortholog gene name combined.
- (Optional; if expressed genes provided) True if the TF is present in the user-provided list of expressed genes, False otherwise.
- (Optional; if DE table provided) Differential expression information. The first column is the gene ID, and the rest of columns depend on the content of the user-provided table in input folder "de_files".

## iCREs-based MINI-AC

The outputs of the iCREs-based MINI-AC runs are identical to the default MINI-AC, as it can be seen in the folder [outputs_icres](outputs_icres) (not available until publication). However, two input parameters change:

* Instead of providing an input BED file with genomic coordinates, the input should be a list of gene IDs from the version V4 or V5 of the maize genome, as in this [example](./inputs/gene_set_files/UP_gene_set.txt).

* There is an additional input parameter named ```--icres_set``` that can either be ```all``` or ```maxf1```. The parameter ```all``` uses a more comprehensive and complete collection of maize putative CREs, while ```maxf1``` uses a set of putative CREs that is smaller but more precise (less false positives). To download the files with the genomic coordinates of these two iCREs sets, the following commands should be executed on the **top-level directory of the repository**:

```shell
NOT AVAILABLE UNTIL PUBLICATION
```
Loading