Conversation
With a setting of multiple cfs and WriteBufferManager with allow_stall, the DB can enter a deadlock when the WBM initiates a stall. This happens since only the oldest cf is picked for flush when HandleWriteBufferManagerFlush is called to flush the data and prevent the stall. When using multiple CFs, this does not ensure the FreeMem will evict enough memory to prevent a stall and no other flush is scheduled. To fix this, add cfs to the flush queue so that we'll be below the mutable_limit_.
ofriedma
left a comment
There was a problem hiding this comment.
There is still a permanent stall using this command:
./db_stress --acquire_snapshot_one_in=10000 --adaptive_readahead=1 --allow_concurrent_memtable_write=0 --allow_data_in_errors=True --allow_wbm_stalls=1 --async_io=0 --avoid_flush_during_recovery=1 --avoid_unnecessary_blocking_io=1 --backup_max_size=104857600 --backup_one_in=100000 --batch_protection_bytes_per_key=0 --block_size=16384 --bloom_bits=1.3037923446511857 --bottommost_compression_type=none --bytes_per_sync=0 --cache_index_and_filter_blocks=0 --cache_size=8388608 --cache_type=lru_cache --charge_compression_dictionary_building_buffer=1 --charge_file_metadata=1 --charge_filter_construction=0 --charge_table_reader=0 --checkpoint_one_in=0 --checksum_type=kxxHash --clear_column_family_one_in=0 --compact_files_one_in=1000000 --compact_range_one_in=1000000 --compaction_pri=0 --compaction_ttl=1000 --compare_full_db_state_snapshot=0 --compression_max_dict_buffer_bytes=511 --compression_max_dict_bytes=16384 --compression_parallel_threads=1 --compression_type=xpress --compression_use_zstd_dict_trainer=0 --compression_zstd_max_train_bytes=0 --continuous_verification_interval=0 --create_timestamped_snapshot_one_in=20 --customopspercent=0 --data_block_index_type=0 --db=/tmp/rocksdb_crashtest_blackboxod86ddmf --db_write_buffer_size=67108864 --delpercent=19 --delrangepercent=0 --destroy_db_initially=0 --detect_filter_construct_corruption=0 --disable_wal=0 --enable_compaction_filter=0 --enable_pipelined_write=0 --expected_values_dir=/tmp/rocksdb_crashtest_expected_vy0z2l_r --fail_if_options_file_error=0 --fifo_allow_compaction=0 --file_checksum_impl=crc32c --flush_one_in=1000000 --format_version=5 --get_current_wal_file_one_in=0 --get_live_files_one_in=100000 --get_property_one_in=1000000 --get_sorted_wal_files_one_in=0 --index_block_restart_interval=4 --index_type=0 --ingest_external_file_one_in=0 --initial_auto_readahead_size=16384 --initiate_wbm_flushes=0 --iterpercent=7 --key_len_percent_dist=100 --level_compaction_dynamic_level_bytes=True --lock_wal_one_in=1000000 --long_running_snapshots=0 --manual_wal_flush_one_in=1000 --mark_for_compaction_one_file_in=10 --max_auto_readahead_size=16384 --max_background_compactions=20 --max_bytes_for_level_base=10485760 --max_key=102400 --max_key_len=1 --max_manifest_file_size=1073741824 --max_write_batch_group_size_bytes=64 --max_write_buffer_number=3 --max_write_buffer_size_to_maintain=10485760000 --memtable_prefix_bloom_size_ratio=0 --memtable_protection_bytes_per_key=2 --memtable_whole_key_filtering=0 --memtablerep=skip_list --min_write_buffer_number_to_merge=2 --mmap_read=1 --mock_direct_io=False --nooverwritepercent=30 --num_file_reads_for_auto_readahead=2 --num_iterations=25 --open_files=-1 --open_metadata_write_fault_one_in=0 --open_read_fault_one_in=0 --open_write_fault_one_in=0 --ops_per_thread=100000000 --optimize_filters_for_memory=1 --paranoid_file_checks=1 --partition_filters=0 --partition_pinning=3 --pause_background_one_in=1000000 --periodic_compaction_seconds=10 --pinning_policy=speedb_scoped_pinning_policy --prefix_size=-1 --prefixpercent=0 --prepopulate_block_cache=1 --preserve_internal_time_seconds=36000 --progress_reports=0 --read_fault_one_in=32 --readahead_size=0 --readpercent=28 --recycle_log_file_num=0 --reopen=0 --ribbon_starting_level=999 --secondary_cache_fault_one_in=32 --secondary_cache_uri= --seed=3618268266 --set_options_one_in=0 --snapshot_hold_ops=100000 --sst_file_manager_bytes_per_sec=104857600 --sst_file_manager_bytes_per_truncate=1048576 --start_delay_percent=22 --stats_dump_period_sec=600 --subcompactions=2 --sync=0 --sync_fault_injection=1 --sync_wal_one_in=100000 --target_file_size_base=2097152 --target_file_size_multiplier=2 --test_batches_snapshots=0 --top_level_index_pinning=3 --txn_write_policy=0 --unordered_write=0 --unpartitioned_pinning=3 --use_direct_io_for_flush_and_compaction=0 --use_direct_reads=0 --use_dynamic_delay=1 --use_full_merge_v1=False --use_get_entity=1 --use_merge=1 --use_multiget=1 --use_put_entity_one_in=0 --use_txn=1 --user_timestamp_size=0 --value_size_mult=32 --verify_before_write=False --verify_checksum=1 --verify_checksum_one_in=1000000 --verify_db_one_in=100000 --verify_sst_unique_id_in_manifest=1 --wal_bytes_per_sync=0 --wal_compression=none --write_buffer_size=1073741824 --write_dbid_to_manifest=1 --writepercent=46
and memory_usage() instead of mutable_memtable_memory_usage()
With a setting of multiple cfs and WriteBufferManager with allow_stall, the DB can enter a deadlock when the WBM initiates a stall. This happens since only the oldest cf is picked for flush when HandleWriteBufferManagerFlush is called to flush the data and prevent the stall. When using multiple CFs, this does not ensure the FreeMem will evict enough memory to prevent a stall and no other flush is scheduled.
To fix this, add cfs to the flush queue so that we'll be below the mutable_limit_.
closes #857