You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: README.md
+97-48
Original file line number
Diff line number
Diff line change
@@ -72,71 +72,80 @@ ulimit -n 2048
72
72
73
73
## Quick start
74
74
75
-
Themisto takes as an input a set of sequences in FASTA or FASTQ format, and a file specifying the color (a non-negative integer) of each sequence. The i-th line of the color file contains the color of the i-th sequence in the sequence file. For optimal compression, use color numbers in the range [0, n-1], where n is the number of distinct colors. If no color file is given, the index is built without colors. This way, the user can later try multiple colorings without recomputing the de Bruijn graph.
76
-
77
-
There is an example dataset with sequences at `example_input/coli3.fna` and colors at `example_input/colors.txt`. To build the index with order k = 30, such that the index files are written to `my_index.tdbg` and `my_index.tcolors`, using the directory `temp` as temporary storage, using four threads and up to 2GB of memory.
75
+
To build the Themisto index for a set of genomes, you need to pass in a text file that contains the paths to the FASTA files of the genomes, one file per line. Each FASTA file is given a different color 0,1,2,3... in the same order as the appear in the list. There are three example genomes of E. coli in `example_input` and a file at `example_input/coli_file_list.txt` listing the file names. To build the index for this data, run the following command:
We recommend to use a fast SSD drive for the temporary directory. With a reasonable desktop workstation and an SSD drive, the program should take less than one minute on this example input. Beware: for inputs that are in the range of tens of gigabytes, the index construction may need over a terabyte of temporary disk space.
81
+
This build an index with k = 31, such that the index files are written to `my_index.tdbg`and `my_index.tcolors`, using the directory `temp` as temporary storage, using four threads and up to 2GB of memory. The flag --reverse-complements add the reference complements of all k-mers to the index. We recommend to use a fast SSD drive for the temporary directory. With a reasonable desktop workstation and an SSD drive.
84
82
85
83
To align the four sequences in `example_input/queries.fna` against the index we just built, writing output to out.txt run:
This reports all colors such that at least a fraction 0.7 of the k-mers of the query are in the reference genome of the color, ignoring k-mers that are not found in any reference.
90
+
91
91
This should produce the following output file:
92
92
93
93
```
94
-
0 43 748
95
-
1 524
96
-
2 855
97
-
3 787
94
+
0 0 2
95
+
1 0 1 2
96
+
2 2
97
+
3 2
98
98
```
99
99
100
-
There is one line for each query sequence. The lines may appear in a different order if parallelism was used. The first integer on a line is the 0-based rank of a query sequence in the query file, and the rest of the integers are the colors that are pseudoaligned with the query. For example, here the query with rank 2 (i.e. the 3rd sequence in the query file) pseudoaligns to color 855.
100
+
There is one line for each query sequence. The lines may appear in a different order if parallelism was used. The first integer on a line is the 0-based rank of a query sequence in the query file, and the rest of the integers are the colors that are pseudoaligned with the query. For example, here the query with rank 1 (i.e. the second sequence in the query file) pseudoaligns to colors 0, 1 and 2.
101
101
102
102
## Full instructions for index construction
103
103
104
-
This command builds an index consisting of compact de Bruijn graph using the BOSS data structure (implemented as a [Wheeler graph](https://www.sciencedirect.com/science/article/pii/S0304397517305285)) and color information. The input is a set of reference sequences in a single file in fasta or fastq format, and a colorfile, which is a plain text file containing the colors (integers) of the reference sequences in the same order as they appear in the reference sequence file, one line per sequence.
105
-
106
104
```
107
105
Usage:
108
106
build [OPTION...]
109
107
110
-
-k, --node-length arg The k of the k-mers.
108
+
-k, --node-length arg The k of the k-mers. (default: 0)
111
109
-i, --input-file arg The input sequences in FASTA or FASTQ
112
110
format. The format is inferred from the
113
111
file extension. Recognized file extensions
114
112
for fasta are: .fasta, .fna, .ffn, .faa and
115
113
.frn . Recognized extensions for fastq are:
116
-
.fastq and .fq . If the file ends with .gz,
117
-
it is uncompressed into a temporary
118
-
directory and the temporary file is deleted
119
-
after use.
120
-
-c, --color-file arg One color per sequence in the fasta file,
121
-
one color per line. If not given, the
122
-
sequences are given colors 0,1,2... in the
123
-
order they appear in the input file.
114
+
.fastq and .fq. (default: "")
115
+
-c, --manual-colors arg A file containing one integer color per
116
+
sequence, one color per line. If there are
117
+
multiple sequence files, then this file
118
+
should be a text file containing the
119
+
corresponding color filename for each
120
+
sequence file, one filename per line.
124
121
(default: "")
122
+
-f, --file-colors Creates a distinct color 0,1,2,... for each
123
+
file in the input file list, in the order
124
+
the files appear in the list
125
+
-e, --sequence-colors Creates a distinct color 0,1,2,... for each
126
+
sequence in the input, in the order the
127
+
sequences are processed. This is the
128
+
default behavior if no other color options
129
+
are given.
130
+
--no-colors Build only the de Bruijn graph without
131
+
colors.
125
132
-o, --index-prefix arg The de Bruijn graph will be written to
126
133
[prefix].tdbg and the color structure to
127
134
[prefix].tcolors.
135
+
-r, --reverse-complements Also add reverse complements of the k-mers
136
+
to the index.
128
137
--temp-dir arg Directory for temporary files. This
129
138
directory should have fast I/O operations
130
139
and should have as much space as possible.
131
140
-m, --mem-megas arg Number of megabytes allowed for external
132
-
memory algorithms. Default: 1000 (default:
133
-
1000)
141
+
memory algorithms (must be at least 2048).
142
+
(default: 2048)
134
143
-t, --n-threads arg Number of parallel exectuion threads.
135
144
Default: 1 (default: 1)
136
145
--randomize-non-ACGT Replace non-ACGT letters with random
137
146
nucleotides. If this option is not given,
138
-
(k+1)-mers containing a non-ACGT character
139
-
are deleted instead.
147
+
k-mers containing a non-ACGT character are
148
+
deleted instead.
140
149
-d, --colorset-pointer-tradeoff arg
141
150
This option controls a time-space tradeoff
142
151
for storing and querying color sets. If
@@ -148,13 +157,26 @@ Usage:
148
157
if the number of distinct color sets is
149
158
small and the graph is large and has long
150
159
unitigs. (default: 1)
151
-
--no-colors Build only the de Bruijn graph without
152
-
colors.
153
160
--load-dbg If given, loads a precomputed de Bruijn
154
161
graph from the index prefix. If this is
155
-
given, the parameter -k must not be given
162
+
given, the value of parameter -k is ignored
156
163
because the order k is defined by the
157
164
precomputed de Bruijn graph.
165
+
-s, --coloring-structure-type arg
166
+
Type of coloring structure to build
167
+
("sdsl-hybrid", "roaring"). (default:
168
+
sdsl-hybrid)
169
+
--from-index arg Take as input a pre-built Themisto index.
170
+
Builds a new index in the format specified
171
+
by --coloring-structure-type. This is
172
+
currenlty implemented by decompressing the
173
+
distinct color sets in memory before
174
+
re-encoding them, so this might take a lot
175
+
of RAM. (default: "")
176
+
-v, --verbose More verbose progress reporting into
177
+
stderr.
178
+
--silent Print as little as possible to stderr (only
179
+
errors).
158
180
-h, --help Print usage
159
181
```
160
182
@@ -170,25 +192,52 @@ The query file(s) should be in fasta of fastq format. The format is inferred fro
170
192
Usage:
171
193
pseudoalign [OPTION...]
172
194
173
-
-q, --query-file arg Input file of the query sequences (default:
174
-
"")
175
-
--query-file-list arg A list of query filenames, one line per
176
-
filename (default: "")
177
-
-o, --out-file arg Output filename. Print results
178
-
if no output filename is given. (default: "")
179
-
--out-file-list arg A file containing a list of output filenames,
180
-
one per line. (default: "")
181
-
-i, --index-prefix arg The index prefix that was given to the build
182
-
command.
183
-
--temp-dir arg Directory for temporary files.
184
-
--rc Whether to to consider the reverse complement
185
-
k-mers in the pseudoalignemt.
186
-
-t, --n-threads arg Number of parallel exectuion threads. Default:
187
-
1 (default: 1)
188
-
--gzip-output Compress the output files with gzip.
189
-
--sort-output Sort the lines of the out files by sequence
190
-
rank in the input files.
191
-
-h, --help Print usage
195
+
-q, --query-file arg Input file of the query sequences (default:
196
+
"")
197
+
--query-file-list arg A list of query filenames, one line per
198
+
filename (default: "")
199
+
-o, --out-file arg Output filename. Print results if no output
200
+
filename is given. (default: "")
201
+
--out-file-list arg A file containing a list of output
202
+
filenames, one per line. (default: "")
203
+
-i, --index-prefix arg The index prefix that was given to the build
204
+
command.
205
+
--temp-dir arg Directory for temporary files.
206
+
--threshold arg Run a thresholded pseudoalignment, i.e.
207
+
report all colors that match to at least the
208
+
given fraction k-mers in the query. If not
209
+
given, runs intersection pseudoalignment.
210
+
(default: -1.0)
211
+
--ignore-unknown-kmers Ignore in thresholded pseudoalignment all
212
+
k-mers that are not found in the de Bruijn
213
+
graph, or that have no colors. The
214
+
intersection pseudoalignment always ignores
215
+
unknown k-mers.
216
+
--rc Also pseudoalign against the reverse
217
+
complement of the query. Note: If the
218
+
reverse complements were added to the index
219
+
with the option --reverse complements in
220
+
themisto build, then this option has no
221
+
effect on the pseudoalignment and the
222
+
program does unnecessary work.
223
+
-t, --n-threads arg Number of parallel exectuion threads.
224
+
Default: 1 (default: 1)
225
+
--gzip-output Compress the output files with gzip.
226
+
--sort-output Sort the lines of the out files by sequence
227
+
rank in the input files.
228
+
--buffer-size-megas arg Size of the input buffer in megabytes in
229
+
each thread. If this is larger than the
230
+
number of nucleotides in the input divided
231
+
by the number of threads, then some threads
232
+
will be idle. So if your input files are
233
+
really small and you have a lot of threads,
234
+
consider using a small buffer. (default:
235
+
8.0)
236
+
-v, --verbose More verbose progress reporting into stderr.
237
+
--silent Print as little as possible to stderr (only
0 commit comments