CoRAL - Complete Reconstruction of Amplifications with Long reads

Reference

CoRAL is a tool which utilizes aligned, single-molecule long-read data (.bam) as input, and identifies candidate ecDNA structures. The original Genome Research '24 paper is available here: https://genome.cshlp.org/content/34/9/1344.

CoRAL only works on long-read whole-genome sequencing data (PacBio, Oxford Nanopore, etc.) - not targeted sequencing!

Installation

CoRAL can be installed and run on most modern Unix-like operating systems (e.g. Ubuntu 18.04+, CentOS 7+, macOS). Python >= 3.10 is required.

Clone the repository

git clone https://github.com/AmpliconSuite/CoRAL
cd CoRAL

Install Poetry (if not already installed)
```
pip install --user pipx
pipx install poetry
```
pipx installs Poetry in its own isolated environment, preventing conflicts with system Python packages. Do not use pip install poetry directly on a system Python.
Install CoRAL dependencies
```
poetry install
```
Poetry creates an isolated virtual environment automatically. If pysam fails to build, install htslib first (sudo apt install libhtslib-dev on Debian/Ubuntu, or brew install htslib on macOS), then re-run poetry install.
Activate the environment and verify the installation
```
source $(poetry env info --path)/bin/activate
coral --help
```
This works with all versions of Poetry. To deactivate the environment, run deactivate.

Re-activating the environment

After the initial install, you only need to re-run the activation command each time you start a new shell session (e.g. after logging out and back in):

cd /your/path/to/CoRAL/
source $(poetry env info --path)/bin/activate

Run this from the CoRAL repository directory. To avoid doing this manually each session, you can add it to your ~/.bashrc or ~/.zshrc.

Download a Gurobi optimizer license (free for academic use)
- Place the gurobi.lic file in $HOME/gurobi.lic.

Finish installing CNVkit dependencies (recommended)

Rscript -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
Rscript -e 'BiocManager::install("DNAcopy")'

Getting copy number calls

Before running CoRAL, you will need genome-wide copy number (CN) calls generated from your long-read data.

If you have these already, simply ensure that they are in a .bed format like so:

chrom start end CN
If you don't have these then you can run CNVkit (installed as a dependency) to generate them, by running

./scripts/call_cnvs.sh <input.bam> ./reference/hg38full_ref_5k.cnn <output_dir>

This will create a file called [input].cns, which you can feed to CoRAL for it's --cn_segs argument.

Command line arguments to run CoRAL

CoRAL and its various run-modes can by used in the following manner

coral [mode] [mode arguments]

The run modes are as follows:

seed: Identify seed amplified intervals (copy number gain regions) with input genome-wide CN calls.
reconstruct: The main running mode of CoRAL. Perform a complete reconstuction of breakpoint graph and cycle extraction from seed amplified intervals.
cycle: Run cycle extraction on a previously generated breakpoint graph. NOTE: This requires the breakpoint graph to be generated with CoRAL v2.1.0 or later, as we require path constraints and amplicon intervals to be included in the provided *_graph.txt file.
plot: Create plots of cycles/paths from cycle extraction and/or breakpoint graph sashimi plot.
hsr: Identify candidate locations of chromosomal homogenously staining region (HSR) integration points for ecDNA.
cycle2bed: Convert the AmpliconArchitect (AA) style *_cycles.txt file to extended .bed format. The AA format is also supported by CoRAL.

1. `seed`

As the seed amplification intervals are required by the main script reconstruct mode, it is suggested the user first run seed mode to generate seed amplification intervals.

Usage: coral seed <Required arguments> <Optional arguments>

Required arguments:

--cn-seg <file> - Long read segmented whole genome CN calls. Accepts CNVkit .cns format, or a tab-separated .bed file where the last column is the copy number value.
--output-prefix <string> - Prefix of the output *_CNV_SEEDS.bed file.
--lr-bam <file> - Coordinate sorted BAM file, used to read chromosome lengths from the header.
--centromere-file <file> - Centromere BED file for the reference genome used.

The file must be tab-separated with columns chr, start, end (BED3 format; additional columns are ignored). Each contig may have one entry, or two entries that overlap or directly abut (they will be merged). Multiple distinct non-adjacent regions for the same contig will raise an error. Contigs absent from the file are treated as having no centromere (all segments on that contig are eligible as seeds). Example:
```
chr1	121700000	125100000
chr2	91800000	100200000
```
The UCSC centromeres or gap track (available via the Table Browser for most assemblies) is a convenient source for this file.

Optional arguments:

--gain <float> - A minimum CN threshold (with the assumption of diploid genome) for a particular CN segment to be considered as a seed. Default is 6.0.
--min-seed-size <int> - Minimum size (in bp) for a CN segment to be considered as a seed. Default is 100000.
--max-seg-gap <int> - Maximum gap size (in bp) to merge two proximal CN segments to be considered as seed intervals. If at least two segments are merged, then they will be treated as a single candidate to be filtered with --min-seed-size, and their aggregate size will be compared with the value. Default is 300000.
--extra-contigs <file> - Plain-text file of additional contig names (one per line) to include alongside standard chromosomes when reading chromosome sizes from the BAM header.

2. `reconstruct`

Usage: reconstruct <Required arguments> <Optional arguments>

2.1 Required arguments:

--lr-bam <file> - Coordinate sorted *.BAM file, with *.bai index (mapped to the provided reference genome) in the same directory.
--cnv-seed <file> - *.bed file with a putative list of seed amplification intervals. The seed amplification intervals can be obtained through running seed mode, or provided manually.
--output-prefix <path> - Prefix (including directory) to which the output graph.txt and cycles.txt files will be written.
--cn-seg <file> - Long read segmented whole genome CN calls. Accepts CNVkit .cns format, or a tab-separated .bed file where the last column is the copy number value.

2.2 Optional arguments:

--min-bp-support <float> - Filter out breakpoints with less than (min_bp_support * normal coverage) long read support in breakpoint graph construction. The default value is set to 1.0, meaning to filter out breakpoints supported by less than diploid coverage, but it is highly recommended to specify a much larger value, e.g. 10.0 to obtain a cleaner breakpoint graph and the dominating ecDNA cycle(s).
--skip-cycle-decomp - If specified, will stop by only outputting the breakpoint graph files *_graph.txt (see Expected output below) for all amplicons and not extract cycles from the graph and output *_cycles.txt.
--output-path_constraints <all|longest|none> - Options to output path constraints given by long reads. By default, CoRAL will only output the longest path constraints in *_graph.txt and *_cycles.txt files (see "Expected output" below). If all or none is specified, output all or no path constraints in both graph and cycles files.
--cycle-decomp-mode <min_cycles|max_weight> - min_cycles: minimize the number of cycles/paths, if solver could not find a feasible solution within time limit, switch to greedy cycle extraction; max_weight: greedy cycle extraction by iteratively extracting cycles or paths with maximum length weighted CN and satisfying the most path constraints. Default mode is max_weight.
--cycle-decomp-alpha <float between [0, 1]> - Parameter used to balance CN weight and path constraints in the objective function of greedy cycle extraction. Default value is 0.01, higher values favor the satisfaction of more path constraints.
--solver-time-limit <int> - Maximum running time (in seconds) reserved for solving a single quadratic program using the chosen integer program solver (e.g. Gurobi, SCIP). The solver would return the best solution(s) it currently found, regardless of the optimality status, when reaching this time limit. Default value is 7200 (i.e., 2 hours).
--solver-threads <int> - Number of threads reserved for for solving the quadratic program with Gurobi (integer program solver). If not specified (and by default), the solver would attempt to use up all available cores in the working machine.
--solver <choice> - Solver for cycle extraction. Must be one of [gurobi_direct, scip].
--global-time-limit <int> - Maximum running time (in seconds) reserved for the entire cycle extraction process. Default value is 21600 (i.e., 6 hours).
--postprocess-greedy-sol - If specified, automatically postprocess the cycles/paths returned in greedy cycle extraction, by solving the full quadratic program to minimize the number of cycles/paths starting with the greedy cycle extraction solution (as an initial solution).
--log-file <file> - Name of the main *.log file, which can be used to trace the status of reconstruct run(s).

2.3 Expected output:

CoRAL may identify and reconstruct a few distinct focal amplifications in the input *.BAM sample, each will be organized as an amplicon, which includes a connected component of amplified intervals and their connections by discordant edges. CoRAL writes the following files to the directory specified with --output_dir.

Graph fil-e: For each amplicon, a tab-separated text file named output_dir/amplicon*_graph.txt describing the sequence edges, concordant edges and discordant edges in the graph and their predicted copy count. Note that the graph files outputted by CoRAL have the same format as those outputted by AmpliconArchitect (and therefore the files can be used interchangeably with AmpliconArchitect). Here is an example graph file from GBM39, a cell line with EGFR amplified on ecDNA.
- As of version 2.1.0, CoRAL additionally includes path constraints and amplicon intervals in the *_graph.txt file. This results in the graph being fully self-contained and able to be passed to cycle extraction without re-parsing the BAM file. For more information on how to interpret this metadata, visit our wiki.

SequenceEdge: StartPosition, EndPosition, PredictedCN, AverageCoverage, Size, NumberOfLongReads
sequence	chr7:54659673-	chr7:54763281+	4.150534	45.907363	103609	576
sequence	chr7:54763282-	chr7:55127266+	89.340352	1052.714362	363985	40637
sequence	chr7:55127267-	chr7:55155020+	2.843655	32.729552	27754	172
sequence	chr7:55155021-	chr7:55609190+	89.340352	1013.182857	454170	49675
sequence	chr7:55609191-	chr7:55610094+	2.868261	31.027655	904	915
sequence	chr7:55610095-	chr7:56049369+	89.340352	1023.280633	439275	49106
sequence	chr7:56049370-	chr7:56149664+	4.150534	49.623899	100295	562
BreakpointEdge: StartPosition->EndPosition, PredictedCN, NumberOfLongReads
concordant	chr7:54763281+->chr7:54763282-	4.150534	26
concordant	chr7:55127266+->chr7:55127267-	2.843655	36
concordant	chr7:55155020+->chr7:55155021-	2.843655	32
concordant	chr7:55609190+->chr7:55609191-	2.697741	38
concordant	chr7:55610094+->chr7:55610095-	2.697741	41
concordant	chr7:56049369+->chr7:56049370-	4.150534	45
discordant	chr7:55610095-->chr7:55609190+	86.642611	869
discordant	chr7:56049369+->chr7:54763282-	85.189818	981
discordant	chr7:55155021-->chr7:55127266+	86.496697	978
...
PathConstraint: Path, Support
path_constraint e2+:1,c2-:1,e3+:1,c3-:1,e4+:1   6
path_constraint e4+:1,c4-:1,e5+:1,c5-:1,e6+:1   34
AmpliconIntervals: chr, start, end
interval        chr7    54659673        56149664

Cycles file: For each amplicon, a tab-separated text file named output_dir_amplicon*_cycles.txt describing the list of cycles and paths returned from cycle extraction. Note that the cycles files output by CoRAL have mostly the same format as those output by AmpliconArchitect (and therefore the files can be used interchangeably with AmpliconArchitect in most cases). Specifically a cycles file includes (i) the list of amplified intervals; (ii) the list of sequence edges; (iii) the list of cycles and paths, where an entry starts with 0+ and ends with 0- in Segments indicates a path - these lines have the same format as AmpliconArchitect output. CoRAL's cycles files additionally include (iv) a list of longest (i.e., there are no paths that can form a sub/super-path to each other) path constraint indicated by long reads, and used in CoRAL's cycle extraction. Here is an example cycles file corresponding to the above graph file from GBM39.

Interval	1	chr7	54659673	56149664
List of cycle segments
Segment	1	chr7	54659673	54763281
Segment	2	chr7	54763282	55127266
Segment	3	chr7	55127267	55155020
Segment	4	chr7	55155021	55609190
Segment	5	chr7	55609191	55610094
Segment	6	chr7	55610095	56049369
Segment	7	chr7	56049370	56149664
List of longest subpath constraints
Path constraint	1	2+,3+,4+	Support<=6	Satisfied
Path constraint	2	4+,5+,6+	Support<=34	Satisfied
Cycle=1;Copy_count=82.34616279663038;Segments=2+,4+,6+;Path_constraints_satisfied=
Cycle=2;Copy_count=2.8436550275157644;Segments=0+,2+,3+,4+,5+,6+,0-;Path_constraints_satisfied=1,2

Note that if --output-all-path-constraints is specified, then all path constraints given by long reads will be written to in *.cycles file.

Other outputs include the output_dir_amplicon*_model.lp file(s) and output_dir_amplicon*_model.log file(s) given by Gurobi (integer program solver), for each amplicon, respectively describing the quadratic (constrained) program in a human readable format, and the standard output produced by Gurobi.

3. `cycle`

Usage:

coral cycle <Required arguments> <Optional arguments> for cycle extraction from a single amplicon (breakpoint graph);
coral cycle_all <Required arguments> <Optional arguments> for cycle extraction from all amplicons in a directory.

3.1 Required arguments:

Argument	Descripion
`--graph <file>` (for `cycle` mode)	AA or CoRAL-formatted `_graph.txt` file.
`--bp-dir <directory>` (for `cycle_all` mode)	Directory containing AA or CoRAL-formatted `_graph.txt` files for different amplicons.
`--output-prefix <path>`	Prefix (including directory) of the output `cycles.txt` files.

3.2 Optional arguments:

Argument	Default	Description
`--alpha <float>`	0.01	Parameter used to balance CN weight and path constraints in the objective function of greedy cycle extraction. Default value is 0.01, higher values favor the satisfaction of more path constraints.
`--solver-time-limit <int>`	7200	Time limit for cycle extraction (in seconds)
`--threads <int>`	-1	Number of threads for cycle extraction. If not specified, use all available cores.
`--solver <choice>`	gurobi_direct	Solver for cycle extraction. Must be one of `[gurobi_direct, scip]`.
`--cycle-decomp-mode <choice>`	max_weight	Cycle extraction mode. Must be one of `[min_cycles, max_weight]` (see `reconstruct` mode).
`--output-path-constraints <choice>`	longest	Options for output path constraints in graph and cycle files. Must be one of `[all, longest, none]` (see `reconstruct` mode).
`--postprocess-greedy-sol`	False	If specified, automatically postprocess the cycles/paths returned in greedy cycle extraction, by solving the full quadratic program to minimize the number of cycles/paths starting with the greedy cycle extraction solution (as an initial solution).

4. `plot`

Usage: coral plot <Required arguments> <Optional arguments>

4.1 Required arguments:

Argument	Description
`--ref <choice>`	Reference genome. Must be one of `[hg19, hg38, mm10]`
`--output-prefix <str>`	Prefix of output files

At least one of --graph or --cycles must also be provided.

4.2 Optional arguments:

Argument	Default	Description
`--graph <file>`		AA-formatted `_graph.txt` file
`--cycles <file>`		AA-formatted `_cycles.txt` file
`--bam <file>`		Sorted indexed BAM file (required for coverage track in graph plot)
`--only-cyclic-paths`		Only visualize the cyclic paths in the cycles file
`--num-cycles <int>`	`[all]`	Only plot the first `[arg]` cycles from the cycles file
`--max-coverage <float>`	`[1.25x max coverage in region]`	Do not extend coverage plot in graph sashimi plot above `[arg]` value
`--min-mapq <int>`	15	Do not use alignment in coverage plot with MAPQ value below `[arg]`
`--gene-subset-list <str> <str> <str> ...`	`[all]`	Only indicate positions of the gene names in this list
`--hide-genes`		Do not plot positions of genes
`--gene-fontsize <float>`	12	Adjust fontsize of gene names
`--bushman-genes`		Only plot genes found in the Bushman lab cancer-related gene list ('Bushman group allOnco').
`--region <chrom:pos1-pos2>`	`[entire amplicon]`	Only plot genome region in the interval given by `chrom:start-end`

5. `hsr`

Usage: coral hsr <Required arguments> <Optional arguments>

5.1 Required arguments:

Argument	Descripion
`--lr-bam <file>`	Coordinate-sorted and indexed long read .bam file
`--cycles <file>`	AA-formatted `_cycles.txt` file
`--cn-seg <file>`	Long read segmented whole genome CN calls. Accepts CNVkit `.cns` format, or a tab-separated `.bed` file where the last column is the copy number value.
`--normal-cov <float>`	Estimated coverage of diploid genome regions

5.2 Optional arguments:

Argument	Default	Description
`--bp-match-cutoff <int>`	100	Breakpoint matching cutoff distance (bp)
`--bp-match-cutoff-clustering <int>`	2000	Crude breakpoint matching cutoff distance (bp) for clustering
`--extra-contigs <file>`		Plain-text file of additional contig names (one per line) to include alongside standard chromosomes when reading chromosome sizes from the BAM header.

6. `cycle2bed`

CoRAL provides an option to convert its cycles output in AmpliconArchitect format *_cycles.txt into *.bed format (similar to Decoil), which makes it easier for downstream analysis of these cycles.

Usage: coral cycle2bed <Required arguments> <Optional arguments>

6.1 Required arguments:

--cycle-file <file> - Input cycles file in AmpliconArchitect format.
--output-file <file> - Output cycles file in *.bed format.

6.2 Optional arguments:

--num-cycles <int> - If specified, only convert the first NUM_CYCLES cycles.

Here is an example output of cycle2bed given by the above cycles file from GBM39.

#chr	start	end	orientation	cycle_id	iscyclic	weight
chr7	54763282	55127266	+	1	True	82.346163
chr7	55155021	55609190	+	1	True	82.346163
chr7	55610095	56049369	+	1	True	82.346163
chr7	54763282	56049369	+	2	False	2.843655

FAQs

call_cnvs.sh didn't produce segmented CN calls in a .cns file?
- cnvkit.py batch contains multiple steps detailed in their documentation. The errors from a particular stage don't always percolate up when running the complete pipeline via batch, so try running each stage separately to pinpoint the root cause.

Name		Name	Last commit message	Last commit date
Latest commit History 243 Commits
coral		coral
reference		reference
sample_data		sample_data
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
poetry.toml		poetry.toml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CoRAL - Complete Reconstruction of Amplifications with Long reads

Reference

Installation

Re-activating the environment

Getting copy number calls

Command line arguments to run CoRAL

1. `seed`

2. `reconstruct`

3. `cycle`

4. `plot`

5. `hsr`

6. `cycle2bed`

FAQs

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

CoRAL - Complete Reconstruction of Amplifications with Long reads

Reference

Installation

Re-activating the environment

Getting copy number calls

Command line arguments to run CoRAL

1. seed

2. reconstruct

3. cycle

4. plot

5. hsr

6. cycle2bed

FAQs

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

1. `seed`

2. `reconstruct`

3. `cycle`

4. `plot`

5. `hsr`

6. `cycle2bed`

Packages