Skip to content

[BUG] HTTP Metrics + Shutdown race condition resulting in SEGFAULTs (0.43.0) #2830

@shawnboutilier-tech

Description

@shawnboutilier-tech

Description

A race condition exists during process shutdown with prometheus metrics are enabled. I believe it is caused by the following issue by my very rusty C++ debugging skills:

The HTTP server thread iterates the thread table threadinfo_map_t::loop() while the event processing thread is removing entries sinsp_thread_manager::remove_thread(). Because the underlying std::unordered_map is not thread safe. In what appears to be a rare case; this can result in a nullptr dereference (sinsp_threadinfo::get_fd_tabe(this=0x0)), which results in a SEGFAULT and a crashed process.

This results in noisy process monitoring metrics, and false alarms during normal k8s cluster churn at scale.

Steps to reproduce

  1. Using falco 0.43.0 amd64 container on kubernetes
  2. Launch a large daemonset with the following falco.yml webserver and metrics configuration:
    webserver:
      enabled: true
      threadiness: 0
      listen_address: 0.0.0.0
      listen_port: 8765
      ssl_enabled: false
      k8s_healthz_endpoint: /healthz
      prometheus_metrics_enabled: true

    metrics:
      enabled: true
      interval: 1m
      output_rule: false
      output_file: /dev/stdout
      rules_counters_enabled: true
      resource_utilization_enabled: true
      state_counters_enabled: true
      kernel_event_counters_enabled: true
      libbpf_stats_enabled: true
      plugins_metrics_enabled: true
      convert_memory_to_mb: true
      include_empty_values: false
  1. In a loop continually rollout the daemon set
while true; do kubectl rollout restart ds/falco; sleep 10; done
  1. Monitor pod process exit status for SEGFAULTs.

Expected Behaviour

We expect clean shutdown

Additional information

Image
(gdb) bt
#0  sinsp_threadinfo::get_fd_table (this=0x0) at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/threadinfo.h:438
#1  libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0::operator()(sinsp_threadinfo&) const (this=0x78a4743efe08, tinfo=...)
    at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/metrics_collector.cpp:260
#2  std::__1::__invoke[abi:ne200100]<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&>(libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&) (__f=..., __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:179
#3  std::__1::__invoke_void_return_wrapper<bool, false>::__call[abi:ne200100]<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&>(libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&) (__args=..., __args=...)
    at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:243
#4  std::__1::__invoke_r[abi:ne200100]<bool, libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&>(libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0&, sinsp_threadinfo&) (__args=..., __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:273
#5  std::__1::__function::__alloc_func<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0, std::__1::allocator<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0>, bool (sinsp_threadinfo&)>::operator()[abi:ne200100](sinsp_threadinfo&) (this=0x78a4743efe08, __arg=...)
    at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:167
#6  std::__1::__function::__func<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0, std::__1::allocator<libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0>, bool (sinsp_threadinfo&)>::operator()(sinsp_threadinfo&) (this=0x78a4743efe00, __arg=...)
    at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:319
#7  0x0000000001c89bb7 in std::__1::__function::__value_func<bool (sinsp_threadinfo&)>::operator()[abi:ne200100](sinsp_threadinfo&) const (this=0x78a4743efe00, __args=...)
    at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:436
#8  std::__1::function<bool (sinsp_threadinfo&)>::operator()(sinsp_threadinfo&) const (this=0x78a4743efe00, __arg=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:995
#9  threadinfo_map_t::loop(std::__1::function<bool (sinsp_threadinfo&)>) (this=<optimized out>, callback=...)
    at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/threadinfo.h:612
#10 libs::metrics::libs_state_counters::libs_state_counters (this=<optimized out>, sinsp_stats_v2=..., thread_manager=<optimized out>)
    at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/metrics_collector.cpp:259
#11 0x0000000001c8b887 in libs::metrics::libs_metrics_collector::snapshot (this=0x78a4743efe90)
    at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/metrics_collector.cpp:425
#12 0x000000000179e4b6 in falco_metrics::sources_to_text_prometheus (state=..., prometheus_metrics_converter=..., additional_wrapper_metrics=...) at /home/runner/work/falco/falco/userspace/falco/falco_metrics.cpp:316
#13 0x00000000017a0976 in falco_metrics::to_text_prometheus (state=...) at /home/runner/work/falco/falco/userspace/falco/falco_metrics.cpp:534
#14 0x0000000001775f4b in falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0::operator()(httplib::Request const&, httplib::Response&) const (this=<optimized out>, res=...)
    at /home/runner/work/falco/falco/userspace/falco/webserver.cpp:108
#15 std::__1::__invoke[abi:ne200100]<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&>(falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&) (__f=..., __args=..., __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:179
#16 std::__1::__invoke_void_return_wrapper<void, true>::__call[abi:ne200100]<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&>(falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&) (__args=..., __args=..., __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:251
#17 std::__1::__invoke_r[abi:ne200100]<void, falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&>(falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0&, httplib::Request const&, httplib::Response&) (__args=..., __args=..., __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:273
#18 std::__1::__function::__alloc_func<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0, std::__1::allocator<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0>, void (httplib::Request const&, httplib::Response&)>::operator()[abi:ne200100](httplib::Request const&, httplib::Response&) (this=<optimized out>, __arg=..., __arg=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:167
#19 std::__1::__function::__func<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0, std::__1::allocator<falco_webserver::enable_prometheus_metrics(falco::app::state const&)::$_0>, void (httplib::Request const&, httplib::Response&)>::operator()(httplib::Request const&, httplib::Response&) (this=<optimized out>, __arg=..., __arg=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:319
#20 0x000000000177db26 in std::__1::__function::__value_func<void (httplib::Request const&, httplib::Response&)>::operator()[abi:ne200100](httplib::Request const&, httplib::Response&) const (this=<optimized out>, __args=...,
    __args=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:436
#21 std::__1::function<void (httplib::Request const&, httplib::Response&)>::operator()(httplib::Request const&, httplib::Response&) const (this=<optimized out>, __arg=..., __arg=...)
    at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:995
#22 httplib::Server::dispatch_request(httplib::Request&, httplib::Response&, std::__1::vector<std::__1::pair<std::__1::unique_ptr<httplib::detail::MatcherBase, std::__1::default_delete<httplib::detail::MatcherBase> >, std::__1::function<void (httplib::Request const&, httplib::Response&)> >, std::__1::allocator<std::__1::pair<std::__1::unique_ptr<httplib::detail::MatcherBase, std::__1::default_delete<httplib::detail::MatcherBase> >, std::__1::function<void (httplib::Request const&, httplib::Response&)> > > > const&) const (this=<optimized out>, req=..., res=..., handlers=...) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:7894
#23 0x000000000177db26 in httplib::Server::routing (this=0x78a4743efe88, this@entry=0xa, req=..., res=..., strm=...)
#24 0x000000000177b0d2 in httplib::Server::process_request(httplib::Stream&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, int, bool, bool&, std::__1::function<void (httplib::Request&)> const&) (this=0x78a4cb13c900, strm=..., remote_addr=..., remote_port=<optimized out>, local_addr=..., local_port=8765,
    close_connection=<optimized out>, connection_closed=@0x78a4743f18c7: false, setup_request=...) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:8145
#25 0x000000000177a2f3 in httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}::operator()(httplib::Stream&, bool, bool&) const (this=<optimized out>, strm=..., close_connection=false,
    connection_closed=@0x78a4743efe90: 216) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:8239
#26 httplib::detail::process_server_socket<httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}>(std::__1::atomic<int> const&, int, unsigned long, long, long, long, long, long, httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1})::{lambda(bool, bool&)#1}::operator()(bool, bool&) const (this=this@entry=0x78a4743f1948, close_connection=false, connection_closed=@0x78a4743efe90: 216)
    at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:3452
#27 0x0000000001778244 in httplib::detail::process_server_socket_core<httplib::detail::process_server_socket<httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}>(std::__1::atomic<int> const&, int, unsigned long, long, long, long, long, long, httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1})::{lambda(bool, bool&)#1}>(std::__1::atomic<int> const&, int, unsigned long, long, httplib::detail::pr--Type <RET> for more, q to quit, c to continue without paging--c
ocess_server_socket<httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}>(std::__1::atomic<int> const&, int, unsigned long, long, long, long, long, long, httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1})::{lambda(bool, bool&)#1}) (svr_sock=..., sock=1950285456, keep_alive_max_count=<optimized out>, keep_alive_timeout_sec=132647720256536, callback=...) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:3433
#28 httplib::detail::process_server_socket<httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}>(std::__1::atomic<int> const&, int, unsigned long, long, long, long, long, long, httplib::Server::process_and_close_socket(int)::{lambda(httplib::Stream&, bool, bool&)#1}) (svr_sock=..., sock=26, keep_alive_max_count=<optimized out>, keep_alive_timeout_sec=132647720256536, read_timeout_sec=5, read_timeout_usec=0, write_timeout_sec=5, write_timeout_usec=0, callback=...) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:3447
#29 httplib::Server::process_and_close_socket (this=<optimized out>, sock=1950285456) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:8234
#30 0x000000000179ac58 in std::__1::__function::__value_func<void ()>::operator()[abi:ne200100]() const (this=0x78a4743f1da0) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:436
#31 std::__1::function<void ()>::operator()() const (this=0x78a4743f1da0) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__functional/function.h:995
#32 httplib::ThreadPool::worker::operator() (this=0x78a4743efe88, this@entry=0x78a4743efe98) at /home/runner/work/falco/falco/build/_deps/cpp-httplib-src/httplib.h:927
#33 0x000000000179a9de in std::__1::__invoke[abi:ne200100]<httplib::ThreadPool::worker>(httplib::ThreadPool::worker&&) (__f=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__type_traits/invoke.h:179
#34 _ZNSt3__116__thread_executeB8ne200100INS_10unique_ptrINS_15__thread_structENS_14default_deleteIS2_EEEEN7httplib10ThreadPool6workerEJETpTnmJEEEvRNS_5tupleIJT_T0_DpT1_EEENS_15__tuple_indicesIJXspT2_EEEE (__t=...) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__thread/thread.h:199
#35 std::__1::__thread_proxy[abi:ne200100]<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, httplib::ThreadPool::worker> >(void*) (__vp=0x78a4743efe90) at /home/runner/work/falco/falco/zig/lib/libcxx/include/__thread/thread.h:208
#36 0x000078a4cb60b51d in start_thread () from /debug/usr/lib/libc.so.6
#37 0x000078a4cb690f6c in __clone3 () from /debug/usr/lib/libc.so.6
(gdb) frame 0
#0  sinsp_threadinfo::get_fd_table (this=0x0) at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/threadinfo.h:438
438	/home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/threadinfo.h: No such file or directory.
(gdb) info args
this = 0x0
(gdb) info locals
root = <optimized out>
(gdb) frame 1
#1  libs::metrics::libs_state_counters::libs_state_counters(std::__1::shared_ptr<sinsp_stats_v2> const&, sinsp_thread_manager*)::$_0::operator()(sinsp_threadinfo&) const (this=0x78a4743efe08, tinfo=...)
    at /home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/metrics_collector.cpp:260
260	/home/runner/work/falco/falco/build/falcosecurity-libs-repo/falcosecurity-libs-prefix/src/falcosecurity-libs/userspace/libsinsp/metrics_collector.cpp: No such file or directory.
(gdb) info args
this = 0x78a4743efe08
tinfo = <error reading variable: Cannot access memory at address 0x0>
(gdb) info locals
fdtable = <optimized out>
(gdb)

Environment

  • Falco version: 0.43.0 amd64
  • System info:
{
  "machine": "x86_64",
  "nodename": "gke-trust-staging-us-default-custom-1-e2802b56-mpk9",
  "release": "6.6.113+",
  "sysname": "Linux",
  "version": "#1 SMP Sat Nov 29 10:43:19 UTC 2025"
}
  • Cloud provider or hardware configuration: GCP/GKE
  • OS: Multiple
  • Kernel: Multiple
  • Installation method: Kubernetes

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions