Skip to content

Commit c24186e

Browse files
committedJan 21, 2023
Readme
1 parent 563de6f commit c24186e

10 files changed

+193779
-65
lines changed
 

‎README.md

+97-48
Original file line numberDiff line numberDiff line change
@@ -72,71 +72,80 @@ ulimit -n 2048
7272

7373
## Quick start
7474

75-
Themisto takes as an input a set of sequences in FASTA or FASTQ format, and a file specifying the color (a non-negative integer) of each sequence. The i-th line of the color file contains the color of the i-th sequence in the sequence file. For optimal compression, use color numbers in the range [0, n-1], where n is the number of distinct colors. If no color file is given, the index is built without colors. This way, the user can later try multiple colorings without recomputing the de Bruijn graph.
76-
77-
There is an example dataset with sequences at `example_input/coli3.fna` and colors at `example_input/colors.txt`. To build the index with order k = 30, such that the index files are written to `my_index.tdbg` and `my_index.tcolors`, using the directory `temp` as temporary storage, using four threads and up to 2GB of memory.
75+
To build the Themisto index for a set of genomes, you need to pass in a text file that contains the paths to the FASTA files of the genomes, one file per line. Each FASTA file is given a different color 0,1,2,3... in the same order as the appear in the list. There are three example genomes of E. coli in `example_input` and a file at `example_input/coli_file_list.txt` listing the file names. To build the index for this data, run the following command:
7876

7977
```
80-
./build/bin/themisto build --node-length 30 -i example_input/coli3.fna -c example_input/colors.txt --index-prefix my_index --temp-dir temp --mem-megas 2048 --n-threads 4
78+
./build/bin/themisto build -k 31 -i example_input/coli_file_list.txt --index-prefix my_index --temp-dir temp --mem-megas 2048 --n-threads 4 --file-colors --reverse-complements
8179
```
8280

83-
We recommend to use a fast SSD drive for the temporary directory. With a reasonable desktop workstation and an SSD drive, the program should take less than one minute on this example input. Beware: for inputs that are in the range of tens of gigabytes, the index construction may need over a terabyte of temporary disk space.
81+
This build an index with k = 31, such that the index files are written to `my_index.tdbg` and `my_index.tcolors`, using the directory `temp` as temporary storage, using four threads and up to 2GB of memory. The flag --reverse-complements add the reference complements of all k-mers to the index. We recommend to use a fast SSD drive for the temporary directory. With a reasonable desktop workstation and an SSD drive.
8482

8583
To align the four sequences in `example_input/queries.fna` against the index we just built, writing output to out.txt run:
8684

8785
```
88-
./build/bin/themisto pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp --out-file out.txt --n-threads 4
86+
./build/bin/themisto pseudoalign --query-file example_input/queries.fna --index-prefix my_index --temp-dir temp --out-file out.txt --n-threads 4 --threshold 0.7 --ignore-unknown-kmers
8987
```
9088

89+
This reports all colors such that at least a fraction 0.7 of the k-mers of the query are in the reference genome of the color, ignoring k-mers that are not found in any reference.
90+
9191
This should produce the following output file:
9292

9393
```
94-
0 43 748
95-
1 524
96-
2 855
97-
3 787
94+
0 0 2
95+
1 0 1 2
96+
2 2
97+
3 2
9898
```
9999

100-
There is one line for each query sequence. The lines may appear in a different order if parallelism was used. The first integer on a line is the 0-based rank of a query sequence in the query file, and the rest of the integers are the colors that are pseudoaligned with the query. For example, here the query with rank 2 (i.e. the 3rd sequence in the query file) pseudoaligns to color 855.
100+
There is one line for each query sequence. The lines may appear in a different order if parallelism was used. The first integer on a line is the 0-based rank of a query sequence in the query file, and the rest of the integers are the colors that are pseudoaligned with the query. For example, here the query with rank 1 (i.e. the second sequence in the query file) pseudoaligns to colors 0, 1 and 2.
101101

102102
## Full instructions for index construction
103103

104-
This command builds an index consisting of compact de Bruijn graph using the BOSS data structure (implemented as a [Wheeler graph](https://www.sciencedirect.com/science/article/pii/S0304397517305285)) and color information. The input is a set of reference sequences in a single file in fasta or fastq format, and a colorfile, which is a plain text file containing the colors (integers) of the reference sequences in the same order as they appear in the reference sequence file, one line per sequence.
105-
106104
```
107105
Usage:
108106
build [OPTION...]
109107
110-
-k, --node-length arg The k of the k-mers.
108+
-k, --node-length arg The k of the k-mers. (default: 0)
111109
-i, --input-file arg The input sequences in FASTA or FASTQ
112110
format. The format is inferred from the
113111
file extension. Recognized file extensions
114112
for fasta are: .fasta, .fna, .ffn, .faa and
115113
.frn . Recognized extensions for fastq are:
116-
.fastq and .fq . If the file ends with .gz,
117-
it is uncompressed into a temporary
118-
directory and the temporary file is deleted
119-
after use.
120-
-c, --color-file arg One color per sequence in the fasta file,
121-
one color per line. If not given, the
122-
sequences are given colors 0,1,2... in the
123-
order they appear in the input file.
114+
.fastq and .fq. (default: "")
115+
-c, --manual-colors arg A file containing one integer color per
116+
sequence, one color per line. If there are
117+
multiple sequence files, then this file
118+
should be a text file containing the
119+
corresponding color filename for each
120+
sequence file, one filename per line.
124121
(default: "")
122+
-f, --file-colors Creates a distinct color 0,1,2,... for each
123+
file in the input file list, in the order
124+
the files appear in the list
125+
-e, --sequence-colors Creates a distinct color 0,1,2,... for each
126+
sequence in the input, in the order the
127+
sequences are processed. This is the
128+
default behavior if no other color options
129+
are given.
130+
--no-colors Build only the de Bruijn graph without
131+
colors.
125132
-o, --index-prefix arg The de Bruijn graph will be written to
126133
[prefix].tdbg and the color structure to
127134
[prefix].tcolors.
135+
-r, --reverse-complements Also add reverse complements of the k-mers
136+
to the index.
128137
--temp-dir arg Directory for temporary files. This
129138
directory should have fast I/O operations
130139
and should have as much space as possible.
131140
-m, --mem-megas arg Number of megabytes allowed for external
132-
memory algorithms. Default: 1000 (default:
133-
1000)
141+
memory algorithms (must be at least 2048).
142+
(default: 2048)
134143
-t, --n-threads arg Number of parallel exectuion threads.
135144
Default: 1 (default: 1)
136145
--randomize-non-ACGT Replace non-ACGT letters with random
137146
nucleotides. If this option is not given,
138-
(k+1)-mers containing a non-ACGT character
139-
are deleted instead.
147+
k-mers containing a non-ACGT character are
148+
deleted instead.
140149
-d, --colorset-pointer-tradeoff arg
141150
This option controls a time-space tradeoff
142151
for storing and querying color sets. If
@@ -148,13 +157,26 @@ Usage:
148157
if the number of distinct color sets is
149158
small and the graph is large and has long
150159
unitigs. (default: 1)
151-
--no-colors Build only the de Bruijn graph without
152-
colors.
153160
--load-dbg If given, loads a precomputed de Bruijn
154161
graph from the index prefix. If this is
155-
given, the parameter -k must not be given
162+
given, the value of parameter -k is ignored
156163
because the order k is defined by the
157164
precomputed de Bruijn graph.
165+
-s, --coloring-structure-type arg
166+
Type of coloring structure to build
167+
("sdsl-hybrid", "roaring"). (default:
168+
sdsl-hybrid)
169+
--from-index arg Take as input a pre-built Themisto index.
170+
Builds a new index in the format specified
171+
by --coloring-structure-type. This is
172+
currenlty implemented by decompressing the
173+
distinct color sets in memory before
174+
re-encoding them, so this might take a lot
175+
of RAM. (default: "")
176+
-v, --verbose More verbose progress reporting into
177+
stderr.
178+
--silent Print as little as possible to stderr (only
179+
errors).
158180
-h, --help Print usage
159181
```
160182

@@ -170,25 +192,52 @@ The query file(s) should be in fasta of fastq format. The format is inferred fro
170192
Usage:
171193
pseudoalign [OPTION...]
172194
173-
-q, --query-file arg Input file of the query sequences (default:
174-
"")
175-
--query-file-list arg A list of query filenames, one line per
176-
filename (default: "")
177-
-o, --out-file arg Output filename. Print results
178-
if no output filename is given. (default: "")
179-
--out-file-list arg A file containing a list of output filenames,
180-
one per line. (default: "")
181-
-i, --index-prefix arg The index prefix that was given to the build
182-
command.
183-
--temp-dir arg Directory for temporary files.
184-
--rc Whether to to consider the reverse complement
185-
k-mers in the pseudoalignemt.
186-
-t, --n-threads arg Number of parallel exectuion threads. Default:
187-
1 (default: 1)
188-
--gzip-output Compress the output files with gzip.
189-
--sort-output Sort the lines of the out files by sequence
190-
rank in the input files.
191-
-h, --help Print usage
195+
-q, --query-file arg Input file of the query sequences (default:
196+
"")
197+
--query-file-list arg A list of query filenames, one line per
198+
filename (default: "")
199+
-o, --out-file arg Output filename. Print results if no output
200+
filename is given. (default: "")
201+
--out-file-list arg A file containing a list of output
202+
filenames, one per line. (default: "")
203+
-i, --index-prefix arg The index prefix that was given to the build
204+
command.
205+
--temp-dir arg Directory for temporary files.
206+
--threshold arg Run a thresholded pseudoalignment, i.e.
207+
report all colors that match to at least the
208+
given fraction k-mers in the query. If not
209+
given, runs intersection pseudoalignment.
210+
(default: -1.0)
211+
--ignore-unknown-kmers Ignore in thresholded pseudoalignment all
212+
k-mers that are not found in the de Bruijn
213+
graph, or that have no colors. The
214+
intersection pseudoalignment always ignores
215+
unknown k-mers.
216+
--rc Also pseudoalign against the reverse
217+
complement of the query. Note: If the
218+
reverse complements were added to the index
219+
with the option --reverse complements in
220+
themisto build, then this option has no
221+
effect on the pseudoalignment and the
222+
program does unnecessary work.
223+
-t, --n-threads arg Number of parallel exectuion threads.
224+
Default: 1 (default: 1)
225+
--gzip-output Compress the output files with gzip.
226+
--sort-output Sort the lines of the out files by sequence
227+
rank in the input files.
228+
--buffer-size-megas arg Size of the input buffer in megabytes in
229+
each thread. If this is larger than the
230+
number of nucleotides in the input divided
231+
by the number of threads, then some threads
232+
will be idle. So if your input files are
233+
really small and you have a lot of threads,
234+
consider using a small buffer. (default:
235+
8.0)
236+
-v, --verbose More verbose progress reporting into stderr.
237+
--silent Print as little as possible to stderr (only
238+
errors).
239+
-h, --help Print usage
240+
192241
```
193242

194243
Examples:

0 commit comments

Comments
 (0)
Please sign in to comment.