Skip to content

KeyError in pg leave_remote/3 #10872

@jamilbk

Description

@jamilbk

Describe the bug

We hit what appears to be an edge case or race condition that caused a pg crash in our production system. This occurred seemingly because a group that was expected to exist did not, which caused a crash because leave_remote/3 does not use the crash-safe version of maps:get.

A bit about our system:

  • 3 regions: Australia, Central US, North Europe
  • Several Erlang (Elixir) nodes in each, joined into a global cluster consisting of a couple dozen total cluster nodes
  • Our workload uses pg to maintain a cluster-wide mapping of uuid -> pid in many single-member groups - one for each client
  • As clients connect/disconnect, we add a group member consisting of the client's uuid -> pid, and then on disconnect we remove this member. This is the only member of the group.

To Reproduce

We have so far seen only one instance of this occur across hundreds of thousands of client connect/disconnect cycles (the workload that inserts and removes groups).

Expected behavior

We would expect this not to crash. Perhaps the crash-safe version of maps:get should be used here instead?

As a workaround, we need to have the client pids monitor the pg scope pid to detect if this crashes again, and then re-register themselves. Otherwise, when pg crashes, all group memberships are lost and the clients lose their ability to reach each other.

Affected versions

We are using OTP 28.3 but the bug appears to present on the latest master as well.

Additional context

Crash trace:

KeyError: key {Portal.Channels, :client, "<uuid>"} not found in:

    %{}

    ?erlang.:erlang.map_get/2
    pg.erl:822: pg.anonymous fn/3 in :pg.leave_remote/3
    lists.erl:2466: lists.:lists.foldl/3
    pg.erl:542: pg.:pg.handle_info/2
    gen_server.erl:2434: gen_server.:gen_server.try_handle_info/3
    gen_server.erl:2420: gen_server.:gen_server.handle_msg/3
    proc_lib.erl:333: proc_lib.:proc_lib.init_p_do_apply/3

Here is our application module:

https://github.com/firezone/firezone/blob/f086bf3e3d52ac2281e34d547dfad05c18cb2ede/elixir/lib/portal/channels.ex

Metadata

Metadata

Assignees

Labels

bugIssue is reported as a bugteam:VMAssigned to OTP team VM

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions