Skip to content

Commit 81dedfb

Browse files
committed
clarifying data format and memory usage in docs
1 parent dcbf94a commit 81dedfb

File tree

2 files changed

+12
-3
lines changed

2 files changed

+12
-3
lines changed

README.md

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -45,7 +45,13 @@ DeepXML supports multiple feature architectures such as Bag-of-embedding/Astec,
4545
```txt
4646
* Download the (zipped file) BoW features from XML repository.
4747
* Extract the zipped file into data directory.
48-
* The following files should be available in <work_dir>/data/<dataset>
48+
* The following files should be available in <work_dir>/data/<dataset> for new datasets (ignore the next step)
49+
- trn_X_Xf.txt
50+
- trn_X_Y.txt
51+
- tst_X_Xf.txt
52+
- tst_X_Y.txt
53+
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
54+
* The following files should be available in <work_dir>/data/<dataset> if the dataset is in old format (please refer to next step to convert the data to new format)
4955
- train.txt
5056
- test.txt
5157
- fasttextB_embeddings_300d.npy or fasttextB_embeddings_512d.npy
@@ -89,8 +95,8 @@ An ensemble can be trained as follows. A json file is used to specify architectu
8995
9096
* framework
9197
- DeepXML: Divides the XML problems in 4 modules as proposed in the paper.
92-
- DeepXML-OVA: Train the method in 1-vs-all fashion [4][5], i.e., loss is computed for each label in each iteration.
93-
- DeepXML-ANNS: Train the method using a label shortlist. Support is available for a fixed graph or periodic training of the ANNS graph.
98+
- DeepXML-OVA: Train the architecture in 1-vs-all fashion [4][5], i.e., loss is computed for each label in each iteration.
99+
- DeepXML-ANNS: Train the architecture using a label shortlist. Support is available for a fixed graph or periodic training of the ANNS graph.
94100
95101
* dataset
96102
- Name of the dataset.
@@ -117,6 +123,8 @@ An ensemble can be trained as follows. A json file is used to specify architectu
117123
* Other file formats such as npy, npz, pickle are also supported.
118124
* Initializing with token embeddings (computed from FastText) leads to noticible accuracy gain in Astec. Please ensure that the token embedding file is available in data directory, if 'init=token_embeddings', otherwise it'll throw an error.
119125
* Config files are made available in deepxml/configs/<framework>/<method> for datasets in XC repository. You can use them when trying out Astec/DeepXML on new datasets.
126+
* We conducted our experiments on a 24-core Intel Xeon 2.6 GHz machine with 440GB RAM with a single Nvidia P40 GPU. 128GB memory should suffice for most datasets.
127+
* Astec make use of CPU (mainly for nmslib) as well as GPU.
120128
```
121129

122130
## Cite as

deepxml/configs/DeepXML/LF-AmazonTitles-1.3M.json

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@
1111
"surrogate_method": 1,
1212
"embedding_dims": 300,
1313
"top_k": 350,
14+
"save_top_k": 100,
1415
"beta": 0.60,
1516
"save_predictions": true,
1617
"trn_label_fname": "trn_X_Y.txt",

0 commit comments

Comments
 (0)