Skip to content
This repository was archived by the owner on Jan 26, 2021. It is now read-only.
This repository was archived by the owner on Jan 26, 2021. It is now read-only.

terminate called after throwing an instance of 'zmq::error_t' #79

@airlsyn

Description

@airlsyn

I started the training routine in single machine with 256G memory and 40 core cpus. After 23 iterations, it was killed.

[INFO] [2019-04-18 15:15:54] Rank = 0, Training Time used: 824.51 s
[INFO] [2019-04-18 15:15:54] Rank = 0, sampling throughput: 675.029212 (tokens/thread/sec)
[INFO] [2019-04-18 15:15:55] Rank = 0, Iter = 23, Block = 0, Slice = 94
[DEBUG] [2019-04-18 15:15:55] Request params. start = 636142, end = 958620
[INFO] [2019-04-18 15:15:55] Rank = 0, Alias Time used: 1.36 s
Assertion failed in file src/mpid/ch3/channels/nemesis/netmod/tcp/socksm.c at line 591: hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_ID_INFO || hdr.pkt_type == MPIDI_NEM_TCP_SOCKSM_PKT_TMPVC_INFO
internal ABORT - process 0
terminate called after throwing an instance of 'zmq::error_t'
  what():  Context was terminated

My configuration:

num_vocabs: 833w
num_topics: 10w
num_iterations=100
alpha=0.0005
beta=0.01
mh_stpes=2
num_local_workers=30
num_blocks=1
max_num_document=13916w (block size 71G in lightLDA format)
data_capacity=73G

So what is the problem? Could anyone help me? Thanks very much.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions