-
Notifications
You must be signed in to change notification settings - Fork 3.1k
Description
Describe the bug
We hit what appears to be an edge case or race condition that caused a pg crash in our production system. This occurred seemingly because a group that was expected to exist did not, which caused a crash because leave_remote/3 does not use the crash-safe version of maps:get.
A bit about our system:
- 3 regions: Australia, Central US, North Europe
- Several Erlang (Elixir) nodes in each, joined into a global cluster consisting of a couple dozen total cluster nodes
- Our workload uses
pgto maintain a cluster-wide mapping ofuuid -> pidin many single-member groups - one for each client - As clients connect/disconnect, we add a group member consisting of the client's
uuid -> pid, and then on disconnect we remove this member. This is the only member of the group.
To Reproduce
We have so far seen only one instance of this occur across hundreds of thousands of client connect/disconnect cycles (the workload that inserts and removes groups).
Expected behavior
We would expect this not to crash. Perhaps the crash-safe version of maps:get should be used here instead?
As a workaround, we need to have the client pids monitor the pg scope pid to detect if this crashes again, and then re-register themselves. Otherwise, when pg crashes, all group memberships are lost and the clients lose their ability to reach each other.
Affected versions
We are using OTP 28.3 but the bug appears to present on the latest master as well.
Additional context
Crash trace:
KeyError: key {Portal.Channels, :client, "<uuid>"} not found in:
%{}
?erlang.:erlang.map_get/2
pg.erl:822: pg.anonymous fn/3 in :pg.leave_remote/3
lists.erl:2466: lists.:lists.foldl/3
pg.erl:542: pg.:pg.handle_info/2
gen_server.erl:2434: gen_server.:gen_server.try_handle_info/3
gen_server.erl:2420: gen_server.:gen_server.handle_msg/3
proc_lib.erl:333: proc_lib.:proc_lib.init_p_do_apply/3
Here is our application module: