-
Notifications
You must be signed in to change notification settings - Fork 233
Distributed running nytimes through mpi #69
Description
After I use dump_library split the libsvm file into 2 parts, I send the block.1, vocab.1, cocab.1.txt, vocab.nytimes.txt.1 to the second node.
And then I execute the command on the first node
mpiexec -machinefile mpi_machine_file ../bin/lightlda -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir ./data/nytimes/ -data_capacity 800
My mpi_machine_file is
10.107.14.100
10.107.14.70
100 is the first node and 70 is the second node.
But I don't think it runs correctly. Here is the log. Anyone could help?
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
...
Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
...
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
...
[INFO] [2018-04-02 03:13:14] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 03:13:17] Rank = 0, Alias Time used: 7.33 s
[INFO] [2018-04-02 02:01:15] Rank = 0, Training Time used: 7593.99 s
[INFO] [2018-04-02 02:01:16] Rank = 0, sampling throughput: 13075.037526 (tokens/thread/sec)
[INFO] [2018-04-02 02:01:17] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:01:17] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:01:17] Rank = 0, Evaluation Time used: 1100.00 s
[DEBUG] [2018-04-02 02:01:17] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:01:46] Rank = 0, Training Time used: 7615.72 s
[INFO] [2018-04-02 02:01:46] Rank = 0, sampling throughput: 13037.700213 (tokens/thread/sec)
[INFO] [2018-04-02 02:01:49] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:01:49] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:01:49] Rank = 0, Evaluation Time used: 969.25 s
[DEBUG] [2018-04-02 02:01:49] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:02:03] Rank = 0, Training Time used: 7651.74 s
[INFO] [2018-04-02 02:02:03] Rank = 0, sampling throughput: 12976.338971 (tokens/thread/sec)
[INFO] [2018-04-02 03:14:18] doc likelihood : -6.422194e+08
[INFO] [2018-04-02 02:02:15] Rank = 0, Training Time used: 7662.92 s
[INFO] [2018-04-02 02:02:15] Rank = 0, sampling throughput: 12957.428332 (tokens/thread/sec)
[INFO] [2018-04-02 03:14:43] word likelihood : 5.329020e+08
[INFO] [2018-04-02 03:14:43] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:14:43] Rank = 0, Evaluation Time used: 985.55 s
[DEBUG] [2018-04-02 03:14:44] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:14:56] doc likelihood : -6.421577e+08
[INFO] [2018-04-02 03:15:26] Rank = 0, Training Time used: 8809.39 s
[INFO] [2018-04-02 03:15:26] Rank = 0, sampling throughput: 11270.751379 (tokens/thread/sec)
[INFO] [2018-04-02 03:15:45] word likelihood : 5.329053e+08
[INFO] [2018-04-02 03:15:45] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:15:45] Rank = 0, Evaluation Time used: 1008.38 s
[INFO] [2018-04-02 02:03:32] Rank = 0, Training Time used: 8703.96 s
[INFO] [2018-04-02 02:03:32] Rank = 0, sampling throughput: 11407.660278 (tokens/thread/sec)
[INFO] [2018-04-02 02:03:55] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 02:03:56] Rank = 0, Training Time used: 8760.07 s
[INFO] [2018-04-02 02:03:56] Rank = 0, sampling throughput: 11334.575356 (tokens/thread/sec)
[INFO] [2018-04-02 02:03:59] Rank = 0, Alias Time used: 5.45 s
[INFO] [2018-04-02 02:04:07] doc likelihood : -6.422155e+08
[INFO] [2018-04-02 02:04:17] Rank = 0, Training Time used: 7770.12 s
[INFO] [2018-04-02 02:04:17] Rank = 0, sampling throughput: 12778.687178 (tokens/thread/sec)
[DEBUG] [2018-04-02 03:16:33] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:04:26] word likelihood : 5.328867e+08
[INFO] [2018-04-02 02:04:26] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:04:26] Rank = 0, Evaluation Time used: 830.64 s
[INFO] [2018-04-02 03:16:41] Rank = 0, Iter = 1, Block = 0, Slice = 0
[DEBUG] [2018-04-02 02:04:26] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:16:45] Rank = 0, Alias Time used: 7.17 s
[INFO] [2018-04-02 03:16:59] Rank = 0, Training Time used: 8880.24 s
[INFO] [2018-04-02 03:16:59] Rank = 0, sampling throughput: 11181.214254 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:02] doc likelihood : -6.421963e+08
[INFO] [2018-04-02 03:17:17] doc likelihood : -6.422194e+08
[INFO] [2018-04-02 02:05:09] Rank = 0, Training Time used: 7802.58 s
[INFO] [2018-04-02 02:05:09] Rank = 0, sampling throughput: 12725.500550 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:17] doc likelihood : -6.421963e+08
[INFO] [2018-04-02 02:05:20] Rank = 0, Training Time used: 7819.40 s
[INFO] [2018-04-02 02:05:20] [INFO] [2018-04-02 03:17:40] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 03:17:43] Rank = 0, Alias Time used: 5.78 s
Rank = 0, sampling throughput: 12698.067672 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:30] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:05:30] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:05:30] Rank = 0, Evaluation Time used: 515.49 s
[INFO] [2018-04-02 02:05:33] Rank = 0, Training Time used: 8845.28 s
[INFO] [2018-04-02 02:05:33] Rank = 0, sampling throughput: 11225.364458 (tokens/thread/sec)
[DEBUG] [2018-04-02 02:05:45] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:18:07] word likelihood : 5.329020e+08
[INFO] [2018-04-02 03:18:07] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:18:07] Rank = 0, Evaluation Time used: 1170.88 s
[INFO] [2018-04-02 02:05:58] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:05:58] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:05:58] Rank = 0, Evaluation Time used: 509.73 s
...
- Anyway, there is a lot of repetition, and never shows the other node, always rank0, rank0, rank0...
- An iteration lasts for several hours, much slower than running on just one node. But I have checked the resources status, and both nodes have used 400G memory.