Skip to content
This repository was archived by the owner on Jan 26, 2021. It is now read-only.
This repository was archived by the owner on Jan 26, 2021. It is now read-only.

Distributed running nytimes through mpi #69

@Abigale001

Description

@Abigale001

After I use dump_library split the libsvm file into 2 parts, I send the block.1, vocab.1, cocab.1.txt, vocab.nytimes.txt.1 to the second node.
And then I execute the command on the first node
mpiexec -machinefile mpi_machine_file ../bin/lightlda -num_vocabs 111400 -num_topics 1000 -num_iterations 100 -alpha 0.1 -beta 0.01 -mh_steps 2 -num_local_workers 1 -num_blocks 1 -max_num_document 300000 -input_dir ./data/nytimes/ -data_capacity 800
My mpi_machine_file is
10.107.14.100
10.107.14.70
100 is the first node and 70 is the second node.
But I don't think it runs correctly. Here is the log. Anyone could help?

[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-02 00:45:49] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-02 00:45:49] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
...
Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-02 00:45:49] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-01 23:33:34] INFO: block = 0, the number of slice = 1
[INFO] [2018-04-01 23:33:34] Server 0 starts: num_workers=1 endpoint=inproc://server
[INFO] [2018-04-01 23:33:34] Server 0: Worker registratrion completed: workers=1 trainers=1 servers=1
[INFO] [2018-04-01 23:33:34] Rank 0/1: Multiverso initialized successfully.
...
[INFO] [2018-04-02 00:45:49] Rank 0/1: Multiverso initialized successfully.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:50] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:36] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:51] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-02 00:45:52] Rank 0/1: Begin of configuration and initialization.
[INFO] [2018-04-01 23:33:37] [INFO] [2018-04-01 23:33:37] Rank 0/1: Begin of configuration and initialization.
Rank 0/1: Begin of configuration and initialization.
...
[INFO] [2018-04-02 03:13:14] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 03:13:17] Rank = 0, Alias Time used: 7.33 s
[INFO] [2018-04-02 02:01:15] Rank = 0, Training Time used: 7593.99 s
[INFO] [2018-04-02 02:01:16] Rank = 0, sampling throughput: 13075.037526 (tokens/thread/sec)
[INFO] [2018-04-02 02:01:17] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:01:17] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:01:17] Rank = 0, Evaluation Time used: 1100.00 s
[DEBUG] [2018-04-02 02:01:17] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:01:46] Rank = 0, Training Time used: 7615.72 s
[INFO] [2018-04-02 02:01:46] Rank = 0, sampling throughput: 13037.700213 (tokens/thread/sec)
[INFO] [2018-04-02 02:01:49] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:01:49] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:01:49] Rank = 0, Evaluation Time used: 969.25 s
[DEBUG] [2018-04-02 02:01:49] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:02:03] Rank = 0, Training Time used: 7651.74 s
[INFO] [2018-04-02 02:02:03] Rank = 0, sampling throughput: 12976.338971 (tokens/thread/sec)
[INFO] [2018-04-02 03:14:18] doc likelihood : -6.422194e+08
[INFO] [2018-04-02 02:02:15] Rank = 0, Training Time used: 7662.92 s
[INFO] [2018-04-02 02:02:15] Rank = 0, sampling throughput: 12957.428332 (tokens/thread/sec)
[INFO] [2018-04-02 03:14:43] word likelihood : 5.329020e+08
[INFO] [2018-04-02 03:14:43] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:14:43] Rank = 0, Evaluation Time used: 985.55 s
[DEBUG] [2018-04-02 03:14:44] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:14:56] doc likelihood : -6.421577e+08
[INFO] [2018-04-02 03:15:26] Rank = 0, Training Time used: 8809.39 s
[INFO] [2018-04-02 03:15:26] Rank = 0, sampling throughput: 11270.751379 (tokens/thread/sec)
[INFO] [2018-04-02 03:15:45] word likelihood : 5.329053e+08
[INFO] [2018-04-02 03:15:45] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:15:45] Rank = 0, Evaluation Time used: 1008.38 s
[INFO] [2018-04-02 02:03:32] Rank = 0, Training Time used: 8703.96 s
[INFO] [2018-04-02 02:03:32] Rank = 0, sampling throughput: 11407.660278 (tokens/thread/sec)
[INFO] [2018-04-02 02:03:55] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 02:03:56] Rank = 0, Training Time used: 8760.07 s
[INFO] [2018-04-02 02:03:56] Rank = 0, sampling throughput: 11334.575356 (tokens/thread/sec)
[INFO] [2018-04-02 02:03:59] Rank = 0, Alias Time used: 5.45 s
[INFO] [2018-04-02 02:04:07] doc likelihood : -6.422155e+08
[INFO] [2018-04-02 02:04:17] Rank = 0, Training Time used: 7770.12 s
[INFO] [2018-04-02 02:04:17] Rank = 0, sampling throughput: 12778.687178 (tokens/thread/sec)
[DEBUG] [2018-04-02 03:16:33] Request params. start = 0, end = 101635
[INFO] [2018-04-02 02:04:26] word likelihood : 5.328867e+08
[INFO] [2018-04-02 02:04:26] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:04:26] Rank = 0, Evaluation Time used: 830.64 s
[INFO] [2018-04-02 03:16:41] Rank = 0, Iter = 1, Block = 0, Slice = 0
[DEBUG] [2018-04-02 02:04:26] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:16:45] Rank = 0, Alias Time used: 7.17 s
[INFO] [2018-04-02 03:16:59] Rank = 0, Training Time used: 8880.24 s
[INFO] [2018-04-02 03:16:59] Rank = 0, sampling throughput: 11181.214254 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:02] doc likelihood : -6.421963e+08
[INFO] [2018-04-02 03:17:17] doc likelihood : -6.422194e+08
[INFO] [2018-04-02 02:05:09] Rank = 0, Training Time used: 7802.58 s
[INFO] [2018-04-02 02:05:09] Rank = 0, sampling throughput: 12725.500550 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:17] doc likelihood : -6.421963e+08
[INFO] [2018-04-02 02:05:20] Rank = 0, Training Time used: 7819.40 s
[INFO] [2018-04-02 02:05:20] [INFO] [2018-04-02 03:17:40] Rank = 0, Iter = 1, Block = 0, Slice = 0
[INFO] [2018-04-02 03:17:43] Rank = 0, Alias Time used: 5.78 s
Rank = 0, sampling throughput: 12698.067672 (tokens/thread/sec)
[INFO] [2018-04-02 02:05:30] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:05:30] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:05:30] Rank = 0, Evaluation Time used: 515.49 s
[INFO] [2018-04-02 02:05:33] Rank = 0, Training Time used: 8845.28 s
[INFO] [2018-04-02 02:05:33] Rank = 0, sampling throughput: 11225.364458 (tokens/thread/sec)
[DEBUG] [2018-04-02 02:05:45] Request params. start = 0, end = 101635
[INFO] [2018-04-02 03:18:07] word likelihood : 5.329020e+08
[INFO] [2018-04-02 03:18:07] Normalized likelihood : -1.561649e+09
[INFO] [2018-04-02 03:18:07] Rank = 0, Evaluation Time used: 1170.88 s
[INFO] [2018-04-02 02:05:58] word likelihood : 5.329021e+08
[INFO] [2018-04-02 02:05:58] Normalized likelihood : -1.561648e+09
[INFO] [2018-04-02 02:05:58] Rank = 0, Evaluation Time used: 509.73 s
...

  1. Anyway, there is a lot of repetition, and never shows the other node, always rank0, rank0, rank0...
  2. An iteration lasts for several hours, much slower than running on just one node. But I have checked the resources status, and both nodes have used 400G memory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions