Falcosidekick crashes with concurrent map iteration and map write error #746

sethstuart · 2024-09-25T02:27:30Z

Describe the bug

Falcosidekick is encountering frequent crashes with a fatal error: concurrent map iteration and map write. After updating Falcosidekick to run with a multi-replica configuration (replicaCount=3), enabling buffered output (falco.buffered_outputs=true), and modifying rate and burst limits (rate: 10, burst: 20), failures are still observed

Initially, the WebUI output was enabled, which caused instability. After disabling the WebUI output, there was some improvement in stability, but the application continues to crash when handling events.

Errors logged during operation:

fatal error: concurrent map iteration and map write

How to reproduce it

Deploy Falcosidekick with the following configuration:
- replicaCount=3
- falco.buffered_outputs=true
- rate: 10
- burst: 20
Enable Slack and Elasticsearch outputs.
Disable the WebUI output.
Trigger multiple Falco events to observe Falcosidekick crashes.

Expected behavior

Falcosidekick should remain stable and handle a high volume of Falco events in a multi-replica setup without crashing.

Environment

Falco version: 0.38.2
Falcosidekick version: 2.29.0
Kubernetes version: v1.28.11
System info:
- Machine architecture: x86_64
- Kernel: 6.1.0-23-amd64 (Debian 6.1.99-1, built on 2024-07-15)
Operating system: Debian GNU/Linux 11 (bullseye)
- PRETTY_NAME="Debian GNU/Linux 11 (bullseye)"
- VERSION="11 (bullseye)"
Cloud provider: AWS EC2 t3.large instances
Installation method: Deployed via Helm

Additional context

Despite disabling the WebUI (which initially caused instability), the system continues to crash when forwarding events to Opensearch (using elasticsearch output) and Slack outputs. Buffered outputs have been enabled to optimize performance, but the issue persists across all replicas.

Stack traces and further details can be found in the log attached and in the original conversation with Thomas Labarussias:
https://kubernetes.slack.com/archives/CMWH3EH32/p1726712928741829

falcologs1.txt

The text was updated successfully, but these errors were encountered:

Issif · 2024-09-25T08:53:01Z

I will try to reproduce the issue, can you paste your settings for falcosidekick please (obfuscate the all sensitive data of course). Thanks

Issif · 2024-09-25T14:13:03Z

I did tests with a vanilla config.yaml and just Slack as enabled output (I used a mock server to avoid Slack's rate limiting) and I wasn't able to replicate the issue with the 2.29.0 and ~100 req/s (which is a ridiculous rate for security alerts in real life).

sethstuart · 2024-10-03T23:05:03Z

I did tests with a vanilla config.yaml and just Slack as enabled output (I used a mock server to avoid Slack's rate limiting) and I wasn't able to replicate the issue with the 2.29.0 and ~100 req/s (which is a ridiculous rate for security alerts in real life).

Hello again Thomas.

Let me know what more I can inform you of. I will say that looking at our elasticsearch I see this behavior of several minutes of 5-10 logs, then a few hundred thousand logs all at once. This makes no sense to me, as the only alerting rules I have apply to a cluster of 30 some nodes, and it is just the k8s audit rules and our own custom rule for ssh intrusion detection. There should not be this insane volume.

What can I help provide to help reproduce the error.

Issif · 2024-10-07T13:13:54Z

Do you have more details about which rule is triggered?

sethstuart · 2024-10-07T17:42:30Z

Hello Thomas,

I saw your slack message, and the other bug report. The alerts I am seeing are:

Critical Executing binary not part of base image (proc_exe=/usr/bin/c_rehash proc_sname=sh gparent=ca-certificates ....)
Critical Executing binary not part of base image (proc_exe=/usr/sbin/update-ca-certificates proc_sname=sh gparent=apk ....)
Critical Executing binary not part of base image (proc_exe=curl proc_sname=sh gparent=containerd-shim ....)

So they all appear to be triggers for the rule:
rule: Drop and execute new binary in container

Recently I have tried redeploying my instances of falco with the rules list including our own custom rules and the k8s audit rules only. This has "fixed" the issue by removing the rules with issues and reducing our count from ~120,000/5 min to ~1000/hr which is much more tolerable, but removes the majority of the default falco monitoring for us.

Issif · 2024-10-07T18:22:45Z

Your rates are really huge, it's noisy for sure. Falco is a security agent, you have to fine tune the rules to get compliant with your env. It's not supposed to fire so much alerts. I would like to reproduce the bug anyway. I tried with a highly restrained container, to know if it can be related to a lack of resources leading to a race condition. No success.

sethstuart · 2024-10-08T19:04:15Z

I have reenabled falco_rules, and then disabled the following rules in my config:

    - /etc/falco/k8s_audit_rules.yaml
    - /etc/falco/falco_rules.yaml
    - /etc/falco/rules.d
  rules:
    - disable:
        rule: Drop and execute new binary in container
    - disable:
        rule: Contact K8s API Server From Container

But I am still observing the same failure and restart of falco sidekick. Additionally I see that 2 hosts in my cluster seem to not be observing the fact that the rules are disabled. They are still alerting, but no other of the 40 some hosts in this cluster are. As far as getting you logs I'm not even sure where to start. Let me know what would be useful.

Issif · 2024-10-09T08:37:20Z

Are you using Helm? If so, the rules field is not used, here's the syntax to disable some rules:

customRules:
  override-rules.yaml: |-
    - rule: Drop and execute new binary in container
      enabled: false
      override:
        enabled: replace
    - rule: Contact K8s API Server From Container
      enabled: false
      override:
        enabled: replace

sethstuart added the kind/bug Something isn't working label Sep 25, 2024

Issif added the area/falcosidekick-chart label Sep 25, 2024

Issif self-assigned this Sep 25, 2024

Issif mentioned this issue Oct 7, 2024

Sidekick Crashes After Triggering the Same Rule Multiple Times in a Short Window with Falco 0.38.2 falcosecurity/falcosidekick#1011

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falcosidekick crashes with concurrent map iteration and map write error #746

Falcosidekick crashes with concurrent map iteration and map write error #746

sethstuart commented Sep 25, 2024

Issif commented Sep 25, 2024

Issif commented Sep 25, 2024 •

edited

Loading

sethstuart commented Oct 3, 2024 •

edited

Loading

Issif commented Oct 7, 2024

sethstuart commented Oct 7, 2024

Issif commented Oct 7, 2024

sethstuart commented Oct 8, 2024 •

edited

Loading

Issif commented Oct 9, 2024

Falcosidekick crashes with concurrent map iteration and map write error #746

Falcosidekick crashes with concurrent map iteration and map write error #746

Comments

sethstuart commented Sep 25, 2024

Issif commented Sep 25, 2024

Issif commented Sep 25, 2024 • edited Loading

sethstuart commented Oct 3, 2024 • edited Loading

Issif commented Oct 7, 2024

sethstuart commented Oct 7, 2024

Issif commented Oct 7, 2024

sethstuart commented Oct 8, 2024 • edited Loading

Issif commented Oct 9, 2024

Issif commented Sep 25, 2024 •

edited

Loading

sethstuart commented Oct 3, 2024 •

edited

Loading

sethstuart commented Oct 8, 2024 •

edited

Loading