Description
Bug Report
Describe the bug
When using threaded mode in filter_multiline, segmentation faults or deadlocks are occurring randomly (especially in high load situations).
I assume this is caused by missing thread-safe implementation within the flb_log_event_encoder
functions.
There is also an auto-closed issue #6728, together with an open and outdated PR from @nokute78 #6765 which are describing a similar issue, which is obviously still not fixed.
Example deadlock stacktraces:
flb_log_event_encoder_commit_record
Thread 57 (Thread 0x7fbe132dc6c0 (LWP 113) "flb-in-tail.47-"):
#0 futex_wait (private=0, expected=2, futex_word=0x7fbe4ec16708) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7fbe4ec16708, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007fbe505a90f1 in lll_mutex_lock_optimized (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:93
#4 0x00005648f0d551d1 in ?? ()
#5 0x00005648f0d625b0 in ?? ()
#6 0x00005648f0cf2417 in ?? ()
#7 0x00005648f0dd4436 in flb_log_event_encoder_dynamic_field_scope_leave ()
#8 0x00005648f0dd465d in flb_log_event_encoder_dynamic_field_flush ()
#9 0x00005648f0dd2ac6 in flb_log_event_encoder_commit_record ()
#10 0x00005648f0db459d in flb_ml_flush_stream_group ()
#11 0x00005648f0dd6627 in flb_ml_rule_process ()
#12 0x00005648f0db4f9b in ?? ()
#13 0x00005648f0db5458 in ?? ()
#14 0x00005648f0db573d in flb_ml_append_object ()
#15 0x00005648f0eb7963 in ?? ()
#16 0x00005648f0da95bb in flb_processor_run ()
#17 0x00005648f0dcc8e7 in ?? ()
#18 0x00005648f0dcca6c in flb_input_log_append_skip_processor_stages ()
#19 0x00005648f0ebe3dc in ?? ()
#20 0x00005648f0da95bb in flb_processor_run ()
#21 0x00005648f0dcc8e7 in ?? ()
#22 0x00005648f0dcca9d in flb_input_log_append_records ()
#23 0x00005648f0e0b516 in flb_tail_file_chunk ()
#24 0x00005648f0e05c57 in in_tail_collect_event ()
flb_log_event_encoder_dynamic_field_reset
Thread 153 (Thread 0x7fbe4f67f6c0 (LWP 17) "flb-pipeline"):
#0 futex_wait (private=0, expected=2, futex_word=0x7fbe4ec16708) at ../sysdeps/nptl/futex-internal.h:146
#1 __GI___lll_lock_wait (futex=futex@entry=0x7fbe4ec16708, private=0) at ./nptl/lowlevellock.c:49
#2 0x00007fbe505a90f1 in lll_mutex_lock_optimized (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:48
#3 ___pthread_mutex_lock (mutex=0x7fbe4ec16708) at ./nptl/pthread_mutex_lock.c:93
#4 0x00005648f0d551d1 in ?? ()
#5 0x00005648f0d625b0 in ?? ()
#6 0x00005648f0cf2417 in ?? ()
#7 0x00005648f0dd4436 in flb_log_event_encoder_dynamic_field_scope_leave ()
#8 0x00005648f0dd46aa in flb_log_event_encoder_dynamic_field_reset ()
#9 0x00005648f0dd2891 in flb_log_event_encoder_reset_record ()
#10 0x00005648f0dd2979 in flb_log_event_encoder_emit_record ()
#11 0x00005648f0db459d in flb_ml_flush_stream_group ()
#12 0x00005648f0db4cd5 in flb_ml_flush_parser_instance ()
#13 0x00005648f0db4d91 in flb_ml_flush_pending ()
#14 0x00005648f0da0446 in flb_sched_event_handler ()
#15 0x00005648f0d9c7c8 in flb_engine_start ()
#16 0x00005648f0d79268 in ?? ()
#17 0x00007fbe505a5a94 in start_thread (arg=) at ./nptl/pthread_create.c:447
#18 0x00007fbe50632c3c in clone3 () at ../sysdeps/unix/sysv/linux/x86_64/clone3.S:78
and similar stacktraces for other flb_log_event_encoder functions.
Example stacktrace for segmentation fault crash:
[2025/01/09 08:36:09] [engine] caught signal (SIGSEGV)
[2025/01/09 08:36:09] [engine] caught signal (SIGSEGV)
#0 0x55a8643027c8 in cfl_list_add_before() at lib/cfl/include/cfl/cfl_list.h:130
#1 0x55a864302832 in cfl_list_prepend() at lib/cfl/include/cfl/cfl_list.h:154
#2 0x55a8643063f2 in flb_log_event_encoder_dynamic_field_scope_enter() at src/flb_log_event_encoder_dynamic_field.c:67
#3 0x55a864306524 in flb_log_event_encoder_dynamic_field_begin_array() at src/flb_log_event_encoder_dynamic_field.c:124
#4 0x55a8642fbab2 in flb_log_event_encoder_emit_record() at src/flb_log_event_encoder.c:168
#5 0x55a8642fbd1c in flb_log_event_encoder_commit_record() at src/flb_log_event_encoder.c:267
#6 0x55a8642806a0 in flb_ml_flush_stream_group() at src/multiline/flb_ml.c:1505
#7 0x55a86427d92a in flb_ml_flush_parser_instance() at src/multiline/flb_ml.c:117
#8 0x55a86427d9e0 in flb_ml_flush_pending() at src/multiline/flb_ml.c:137
#9 0x55a86427da93 in cb_ml_flush_timer() at src/multiline/flb_ml.c:163
#10 0x55a864225b73 in flb_sched_event_handler() at src/flb_scheduler.c:624
#11 0x55a864216cf7 in flb_engine_start() at src/flb_engine.c:1044
#12 0x55a8641ae5d4 in flb_lib_worker() at src/flb_lib.c:763
#13 0x7f2ac7abaa93 in start_thread() at c:447
#14 0x7f2ac7b47c3b in clone3() at inux/x86_64/clone3.S:78
#15 0xffffffffffffffff in ???() at ???:0
@nokute78 (cc @edsiper) Was there a reason for #6765 not to be merged (and updated to current code base)?
To Reproduce
- Use tail input plugin (we use globs for multiple files)
- Use multiline filter with threaded mode enabled
- Put enough load on it and watch it crash/see deadlock in gdb (e.g. use:
gdb -p <pid> --batch -ex "thread apply all bt" -ex "detach" -ex "quit"
)
Your Environment
- Version used: 3.2.4 (but the issue exists since many versions)
Maybe related:
As I read in the announcement of v2.0.2, the memory ring buffer mem_buf_limit
should be no less than 20M in size. As far as I understand the code, the in_emitter
is used with memrb
in case of threaded multiline filter.
However, as I've already mentioned in #8473, there is this strange (and most probably wrong) assignment:
fluent-bit/plugins/in_emitter/emitter.c
Line 245 in 9652b0d
The default value for the flush frequency is 2000, so I assume this would set the ring buffer size to only 2k. Can you please verify this @nokute78 @edsiper @leonardo-albertovich @pwhelan