Skip to content

AmpliconSuite/CoRAL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

243 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CoRAL - Complete Reconstruction of Amplifications with Long reads

Reference

CoRAL is a tool which utilizes aligned, single-molecule long-read data (.bam) as input, and identifies candidate ecDNA structures. The original Genome Research '24 paper is available here: https://genome.cshlp.org/content/34/9/1344.

CoRAL only works on long-read whole-genome sequencing data (PacBio, Oxford Nanopore, etc.) - not targeted sequencing!

Installation

CoRAL can be installed and run on most modern Unix-like operating systems (e.g. Ubuntu 18.04+, CentOS 7+, macOS). Python >= 3.10 is required.

  1. Clone the repository

    git clone https://github.com/AmpliconSuite/CoRAL
    cd CoRAL
  2. Install Poetry (if not already installed)

    pip install --user pipx
    pipx install poetry

    pipx installs Poetry in its own isolated environment, preventing conflicts with system Python packages. Do not use pip install poetry directly on a system Python.

  3. Install CoRAL dependencies

    poetry install

    Poetry creates an isolated virtual environment automatically. If pysam fails to build, install htslib first (sudo apt install libhtslib-dev on Debian/Ubuntu, or brew install htslib on macOS), then re-run poetry install.

  4. Activate the environment and verify the installation

    source $(poetry env info --path)/bin/activate
    coral --help

    This works with all versions of Poetry. To deactivate the environment, run deactivate.

Re-activating the environment

After the initial install, you only need to re-run the activation command each time you start a new shell session (e.g. after logging out and back in):

cd /your/path/to/CoRAL/
source $(poetry env info --path)/bin/activate

Run this from the CoRAL repository directory. To avoid doing this manually each session, you can add it to your ~/.bashrc or ~/.zshrc.

  1. Download a Gurobi optimizer license (free for academic use)

    • Place the gurobi.lic file in $HOME/gurobi.lic.
  2. Finish installing CNVkit dependencies (recommended)

    Rscript -e 'if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager")'
    Rscript -e 'BiocManager::install("DNAcopy")'

Getting copy number calls

Before running CoRAL, you will need genome-wide copy number (CN) calls generated from your long-read data.

  • If you have these already, simply ensure that they are in a .bed format like so:

    chrom start end CN

  • If you don't have these then you can run CNVkit (installed as a dependency) to generate them, by running

    ./scripts/call_cnvs.sh <input.bam> ./reference/hg38full_ref_5k.cnn <output_dir>

    This will create a file called [input].cns, which you can feed to CoRAL for it's --cn_segs argument.

Command line arguments to run CoRAL

CoRAL and its various run-modes can by used in the following manner

coral [mode] [mode arguments]

The run modes are as follows:

  1. seed: Identify seed amplified intervals (copy number gain regions) with input genome-wide CN calls.
  2. reconstruct: The main running mode of CoRAL. Perform a complete reconstuction of breakpoint graph and cycle extraction from seed amplified intervals.
  3. cycle: Run cycle extraction on a previously generated breakpoint graph. NOTE: This requires the breakpoint graph to be generated with CoRAL v2.1.0 or later, as we require path constraints and amplicon intervals to be included in the provided *_graph.txt file.
  4. plot: Create plots of cycles/paths from cycle extraction and/or breakpoint graph sashimi plot.
  5. hsr: Identify candidate locations of chromosomal homogenously staining region (HSR) integration points for ecDNA.
  6. cycle2bed: Convert the AmpliconArchitect (AA) style *_cycles.txt file to extended .bed format. The AA format is also supported by CoRAL.

1. seed

As the seed amplification intervals are required by the main script reconstruct mode, it is suggested the user first run seed mode to generate seed amplification intervals.

Usage: coral seed <Required arguments> <Optional arguments>

Required arguments:

  • --cn-seg <file> - Long read segmented whole genome CN calls. Accepts CNVkit .cns format, or a tab-separated .bed file where the last column is the copy number value.

  • --output-prefix <string> - Prefix of the output *_CNV_SEEDS.bed file.

  • --lr-bam <file> - Coordinate sorted BAM file, used to read chromosome lengths from the header.

  • --centromere-file <file> - Centromere BED file for the reference genome used.

    The file must be tab-separated with columns chr, start, end (BED3 format; additional columns are ignored). Each contig may have one entry, or two entries that overlap or directly abut (they will be merged). Multiple distinct non-adjacent regions for the same contig will raise an error. Contigs absent from the file are treated as having no centromere (all segments on that contig are eligible as seeds). Example:

    chr1	121700000	125100000
    chr2	91800000	100200000
    

    The UCSC centromeres or gap track (available via the Table Browser for most assemblies) is a convenient source for this file.

Optional arguments:

  • --gain <float> - A minimum CN threshold (with the assumption of diploid genome) for a particular CN segment to be considered as a seed. Default is 6.0.
  • --min-seed-size <int> - Minimum size (in bp) for a CN segment to be considered as a seed. Default is 100000.
  • --max-seg-gap <int> - Maximum gap size (in bp) to merge two proximal CN segments to be considered as seed intervals. If at least two segments are merged, then they will be treated as a single candidate to be filtered with --min-seed-size, and their aggregate size will be compared with the value. Default is 300000.
  • --extra-contigs <file> - Plain-text file of additional contig names (one per line) to include alongside standard chromosomes when reading chromosome sizes from the BAM header.

2. reconstruct

Usage: reconstruct <Required arguments> <Optional arguments>

2.1 Required arguments:

  • --lr-bam <file> - Coordinate sorted *.BAM file, with *.bai index (mapped to the provided reference genome) in the same directory.
  • --cnv-seed <file> - *.bed file with a putative list of seed amplification intervals. The seed amplification intervals can be obtained through running seed mode, or provided manually.
  • --output-prefix <path> - Prefix (including directory) to which the output graph.txt and cycles.txt files will be written.
  • --cn-seg <file> - Long read segmented whole genome CN calls. Accepts CNVkit .cns format, or a tab-separated .bed file where the last column is the copy number value.

2.2 Optional arguments:

  • --min-bp-support <float> - Filter out breakpoints with less than (min_bp_support * normal coverage) long read support in breakpoint graph construction. The default value is set to 1.0, meaning to filter out breakpoints supported by less than diploid coverage, but it is highly recommended to specify a much larger value, e.g. 10.0 to obtain a cleaner breakpoint graph and the dominating ecDNA cycle(s).
  • --skip-cycle-decomp - If specified, will stop by only outputting the breakpoint graph files *_graph.txt (see Expected output below) for all amplicons and not extract cycles from the graph and output *_cycles.txt.
  • --output-path_constraints <all|longest|none> - Options to output path constraints given by long reads. By default, CoRAL will only output the longest path constraints in *_graph.txt and *_cycles.txt files (see "Expected output" below). If all or none is specified, output all or no path constraints in both graph and cycles files.
  • --cycle-decomp-mode <min_cycles|max_weight> - min_cycles: minimize the number of cycles/paths, if solver could not find a feasible solution within time limit, switch to greedy cycle extraction; max_weight: greedy cycle extraction by iteratively extracting cycles or paths with maximum length weighted CN and satisfying the most path constraints. Default mode is max_weight.
  • --cycle-decomp-alpha <float between [0, 1]> - Parameter used to balance CN weight and path constraints in the objective function of greedy cycle extraction. Default value is 0.01, higher values favor the satisfaction of more path constraints.
  • --solver-time-limit <int> - Maximum running time (in seconds) reserved for solving a single quadratic program using the chosen integer program solver (e.g. Gurobi, SCIP). The solver would return the best solution(s) it currently found, regardless of the optimality status, when reaching this time limit. Default value is 7200 (i.e., 2 hours).
  • --solver-threads <int> - Number of threads reserved for for solving the quadratic program with Gurobi (integer program solver). If not specified (and by default), the solver would attempt to use up all available cores in the working machine.
  • --solver <choice> - Solver for cycle extraction. Must be one of [gurobi_direct, scip].
  • --global-time-limit <int> - Maximum running time (in seconds) reserved for the entire cycle extraction process. Default value is 21600 (i.e., 6 hours).
  • --postprocess-greedy-sol - If specified, automatically postprocess the cycles/paths returned in greedy cycle extraction, by solving the full quadratic program to minimize the number of cycles/paths starting with the greedy cycle extraction solution (as an initial solution).
  • --log-file <file> - Name of the main *.log file, which can be used to trace the status of reconstruct run(s).

2.3 Expected output:

CoRAL may identify and reconstruct a few distinct focal amplifications in the input *.BAM sample, each will be organized as an amplicon, which includes a connected component of amplified intervals and their connections by discordant edges. CoRAL writes the following files to the directory specified with --output_dir.

  • Graph fil-e: For each amplicon, a tab-separated text file named output_dir/amplicon*_graph.txt describing the sequence edges, concordant edges and discordant edges in the graph and their predicted copy count. Note that the graph files outputted by CoRAL have the same format as those outputted by AmpliconArchitect (and therefore the files can be used interchangeably with AmpliconArchitect). Here is an example graph file from GBM39, a cell line with EGFR amplified on ecDNA.
    • As of version 2.1.0, CoRAL additionally includes path constraints and amplicon intervals in the *_graph.txt file. This results in the graph being fully self-contained and able to be passed to cycle extraction without re-parsing the BAM file. For more information on how to interpret this metadata, visit our wiki.
SequenceEdge: StartPosition, EndPosition, PredictedCN, AverageCoverage, Size, NumberOfLongReads
sequence	chr7:54659673-	chr7:54763281+	4.150534	45.907363	103609	576
sequence	chr7:54763282-	chr7:55127266+	89.340352	1052.714362	363985	40637
sequence	chr7:55127267-	chr7:55155020+	2.843655	32.729552	27754	172
sequence	chr7:55155021-	chr7:55609190+	89.340352	1013.182857	454170	49675
sequence	chr7:55609191-	chr7:55610094+	2.868261	31.027655	904	915
sequence	chr7:55610095-	chr7:56049369+	89.340352	1023.280633	439275	49106
sequence	chr7:56049370-	chr7:56149664+	4.150534	49.623899	100295	562
BreakpointEdge: StartPosition->EndPosition, PredictedCN, NumberOfLongReads
concordant	chr7:54763281+->chr7:54763282-	4.150534	26
concordant	chr7:55127266+->chr7:55127267-	2.843655	36
concordant	chr7:55155020+->chr7:55155021-	2.843655	32
concordant	chr7:55609190+->chr7:55609191-	2.697741	38
concordant	chr7:55610094+->chr7:55610095-	2.697741	41
concordant	chr7:56049369+->chr7:56049370-	4.150534	45
discordant	chr7:55610095-->chr7:55609190+	86.642611	869
discordant	chr7:56049369+->chr7:54763282-	85.189818	981
discordant	chr7:55155021-->chr7:55127266+	86.496697	978
...
PathConstraint: Path, Support
path_constraint e2+:1,c2-:1,e3+:1,c3-:1,e4+:1   6
path_constraint e4+:1,c4-:1,e5+:1,c5-:1,e6+:1   34
AmpliconIntervals: chr, start, end
interval        chr7    54659673        56149664
  • Cycles file: For each amplicon, a tab-separated text file named output_dir_amplicon*_cycles.txt describing the list of cycles and paths returned from cycle extraction. Note that the cycles files output by CoRAL have mostly the same format as those output by AmpliconArchitect (and therefore the files can be used interchangeably with AmpliconArchitect in most cases). Specifically a cycles file includes (i) the list of amplified intervals; (ii) the list of sequence edges; (iii) the list of cycles and paths, where an entry starts with 0+ and ends with 0- in Segments indicates a path - these lines have the same format as AmpliconArchitect output. CoRAL's cycles files additionally include (iv) a list of longest (i.e., there are no paths that can form a sub/super-path to each other) path constraint indicated by long reads, and used in CoRAL's cycle extraction. Here is an example cycles file corresponding to the above graph file from GBM39.
Interval	1	chr7	54659673	56149664
List of cycle segments
Segment	1	chr7	54659673	54763281
Segment	2	chr7	54763282	55127266
Segment	3	chr7	55127267	55155020
Segment	4	chr7	55155021	55609190
Segment	5	chr7	55609191	55610094
Segment	6	chr7	55610095	56049369
Segment	7	chr7	56049370	56149664
List of longest subpath constraints
Path constraint	1	2+,3+,4+	Support<=6	Satisfied
Path constraint	2	4+,5+,6+	Support<=34	Satisfied
Cycle=1;Copy_count=82.34616279663038;Segments=2+,4+,6+;Path_constraints_satisfied=
Cycle=2;Copy_count=2.8436550275157644;Segments=0+,2+,3+,4+,5+,6+,0-;Path_constraints_satisfied=1,2

Note that if --output-all-path-constraints is specified, then all path constraints given by long reads will be written to in *.cycles file.

  • Other outputs include the output_dir_amplicon*_model.lp file(s) and output_dir_amplicon*_model.log file(s) given by Gurobi (integer program solver), for each amplicon, respectively describing the quadratic (constrained) program in a human readable format, and the standard output produced by Gurobi.

3. cycle

Usage:

  • coral cycle <Required arguments> <Optional arguments> for cycle extraction from a single amplicon (breakpoint graph);
  • coral cycle_all <Required arguments> <Optional arguments> for cycle extraction from all amplicons in a directory.

3.1 Required arguments:

Argument Descripion
--graph <file> (for cycle mode) AA or CoRAL-formatted _graph.txt file.
--bp-dir <directory> (for cycle_all mode) Directory containing AA or CoRAL-formatted _graph.txt files for different amplicons.
--output-prefix <path> Prefix (including directory) of the output cycles.txt files.

3.2 Optional arguments:

Argument Default Description
--alpha <float> 0.01 Parameter used to balance CN weight and path constraints in the objective function of greedy cycle extraction. Default value is 0.01, higher values favor the satisfaction of more path constraints.
--solver-time-limit <int> 7200 Time limit for cycle extraction (in seconds)
--threads <int> -1 Number of threads for cycle extraction. If not specified, use all available cores.
--solver <choice> gurobi_direct Solver for cycle extraction. Must be one of [gurobi_direct, scip].
--cycle-decomp-mode <choice> max_weight Cycle extraction mode. Must be one of [min_cycles, max_weight] (see reconstruct mode).
--output-path-constraints <choice> longest Options for output path constraints in graph and cycle files. Must be one of [all, longest, none] (see reconstruct mode).
--postprocess-greedy-sol False If specified, automatically postprocess the cycles/paths returned in greedy cycle extraction, by solving the full quadratic program to minimize the number of cycles/paths starting with the greedy cycle extraction solution (as an initial solution).

4. plot

Usage: coral plot <Required arguments> <Optional arguments>

4.1 Required arguments:

Argument Description
--ref <choice> Reference genome. Must be one of [hg19, hg38, mm10]
--output-prefix <str> Prefix of output files

At least one of --graph or --cycles must also be provided.

4.2 Optional arguments:

Argument Default Description
--graph <file> AA-formatted _graph.txt file
--cycles <file> AA-formatted _cycles.txt file
--bam <file> Sorted indexed BAM file (required for coverage track in graph plot)
--only-cyclic-paths Only visualize the cyclic paths in the cycles file
--num-cycles <int> [all] Only plot the first [arg] cycles from the cycles file
--max-coverage <float> [1.25x max coverage in region] Do not extend coverage plot in graph sashimi plot above [arg] value
--min-mapq <int> 15 Do not use alignment in coverage plot with MAPQ value below [arg]
--gene-subset-list <str> <str> <str> ... [all] Only indicate positions of the gene names in this list
--hide-genes Do not plot positions of genes
--gene-fontsize <float> 12 Adjust fontsize of gene names
--bushman-genes Only plot genes found in the Bushman lab cancer-related gene list ('Bushman group allOnco').
--region <chrom:pos1-pos2> [entire amplicon] Only plot genome region in the interval given by chrom:start-end

5. hsr

Usage: coral hsr <Required arguments> <Optional arguments>

5.1 Required arguments:

Argument Descripion
--lr-bam <file> Coordinate-sorted and indexed long read .bam file
--cycles <file> AA-formatted _cycles.txt file
--cn-seg <file> Long read segmented whole genome CN calls. Accepts CNVkit .cns format, or a tab-separated .bed file where the last column is the copy number value.
--normal-cov <float> Estimated coverage of diploid genome regions

5.2 Optional arguments:

Argument Default Description
--bp-match-cutoff <int> 100 Breakpoint matching cutoff distance (bp)
--bp-match-cutoff-clustering <int> 2000 Crude breakpoint matching cutoff distance (bp) for clustering
--extra-contigs <file> Plain-text file of additional contig names (one per line) to include alongside standard chromosomes when reading chromosome sizes from the BAM header.

6. cycle2bed

CoRAL provides an option to convert its cycles output in AmpliconArchitect format *_cycles.txt into *.bed format (similar to Decoil), which makes it easier for downstream analysis of these cycles.

Usage: coral cycle2bed <Required arguments> <Optional arguments>

6.1 Required arguments:

  • --cycle-file <file> - Input cycles file in AmpliconArchitect format.
  • --output-file <file> - Output cycles file in *.bed format.

6.2 Optional arguments:

  • --num-cycles <int> - If specified, only convert the first NUM_CYCLES cycles.

Here is an example output of cycle2bed given by the above cycles file from GBM39.

#chr	start	end	orientation	cycle_id	iscyclic	weight
chr7	54763282	55127266	+	1	True	82.346163
chr7	55155021	55609190	+	1	True	82.346163
chr7	55610095	56049369	+	1	True	82.346163
chr7	54763282	56049369	+	2	False	2.843655

FAQs

  • call_cnvs.sh didn't produce segmented CN calls in a .cns file?
    • cnvkit.py batch contains multiple steps detailed in their documentation. The errors from a particular stage don't always percolate up when running the complete pipeline via batch, so try running each stage separately to pinpoint the root cause.

About

CoRAL: Reconstruction of focal amplifications with long reads

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors