Skip to content

When RDMAClient::Connect() start trigger fi_connect and fi_eq_sread, it cost about 18.5s when RDMAServer::GetEvent(VineyardEventEntry& vineyard_entry) recv the msg. Why? #2023

@hsh258

Description

@hsh258

Describe your problem

A clear and concise description of what your problem is. It might be a bug,
a feature request, or just a problem that need support from the vineyard team.

When RDMAClient::Connect() start trigger fi_connect and fi_eq_sread, it cost about 18.5s when Status RDMAServer::GetEvent(VineyardEventEntry& vineyard_entry) recv the msg. It is so long.
By the way,how to set the interval of client reconnect? tks

Status RDMAClient::Connect() {
CHECK_ERROR(!fi_connect(ep, fi->dest_addr, NULL, 0), "fi_connect failed.");
fi_eq_cm_entry entry;
uint32_t event;
CHECK_ERROR(
fi_eq_sread(eq, &event, &entry, sizeof(entry), -1, 0) == sizeof(entry),
"fi_eq_sread failed.");
if (event != FI_CONNECTED || entry.fid != &ep->fid) {
return Status::Invalid("Unexpected event:" + std::to_string(event));
}
return Status::OK();
}

Status RDMAServer::GetEvent(VineyardEventEntry& vineyard_entry) {
struct fi_eq_cm_entry entry;
uint32_t event;
while (true) {
int rd = fi_eq_sread(eq, &event, &entry, sizeof entry, 500, 0);
if (rd < 0 && (rd != -FI_ETIMEDOUT && rd != -FI_EAGAIN)) {
return Status::IOError("fi_eq_sread broken. ret:" + std::to_string(rd));
}
if (rd == -FI_ETIMEDOUT || rd == -FI_EAGAIN) {
if (state == STOPED) {
return Status::Invalid("Server is stoped.");
}
continue;
}
if (event == FI_SHUTDOWN) {
fid_ep* closed_ep = container_of(entry.fid, fid_ep, fid);
RemoveClient(closed_ep);
continue;
}
vineyard_entry.fi = entry.info;
vineyard_entry.event_id = event;
vineyard_entry.fid = entry.fid;
return Status::OK();
}
}

If is is a bug report, to help us reproducing this bug, please provide information below:

  1. Your Operation System version (uname -a):
  2. The version of vineyard you use (vineyard.__version__):
  3. Versions of crucial packages, such as gcc, numpy, pandas, etc.:
  4. Full stack of the error (if there are a crash):
  5. Minimized code to reproduce the error:

If it is a feature request, please provides a clear and concise description of what you want to happen:

What is the problem:

The behaviour that you expect to work:

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions