Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix segmentation fault in mvn update_shapes #27263

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

yuanxion
Copy link
Contributor

Details:

  • Add mutex for std::vector push_back in mvn update_shapes, to avoid multi-threads competition

Tickets:

@github-actions github-actions bot added the category: GPU OpenVINO GPU plugin label Oct 27, 2024
@sys-openvino-ci sys-openvino-ci added the ExternalPR External contributor label Oct 27, 2024
@yuanxion
Copy link
Contributor Author

After applying the commit 838043e, the test_cross-lingual-books-alignment.ipynb notebook can be passed without segmentation fault now.
image

@yuanxion yuanxion marked this pull request as ready for review October 27, 2024 09:26
@yuanxion yuanxion requested review from a team as code owners October 27, 2024 09:26
@@ -300,7 +300,7 @@ inline void update_shapes(kernel_selector::Params& p, const kernel_impl_params&
auto& fd = bp.fused_ops[i];
fd.output_tensor = convert_data_tensor(fused_prim.output_layout);
for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) {
fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i)));
fd.AddTensor(convert_data_tensor(impl_param.get_input_layout(i)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think data race is unexpected here. kernel_impl_params object should be unique for each primitive in each thread. Could you describe a little bit what causes data race?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vladimir-paramuzov, previously, I found that update_shapes was called twice from the debug log:

--> primitive_inst: multiply:Multiply_22381, execute update_impl start
--> primitive_inst: multiply:Multiply_22381, execute update_impl start
[+] update_shapes
[+] update_shapes

and it will fail at fd.tensors.push_back, since 1 of them updated the fd.tensors data pointer:
(so the second "[after ] ***" doesn't printed out in the log)

[before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960
[before] fd.tensors.size: 0x10, fd.tensors.data: 0x7ffd2484f960
[after ] fd.tensors.size: 0x11, fd.tensors.data: 0x7ffd24a88840

I think this may be caused by my 2 Arc A770 cards setting on the machine.
After changing to only 1 iGPU on another machine, I can't re-produce the twice called update_shapes in my debug log now, so update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.

After excluding update_shapes, I will try to track the fd to see if it is used simultaneously somewhere else. (to see the potential cause of data race).
BTW, my commit in this PR also works for the iGPU (as well as previous dGPU A770 when debugging).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update_shapes is only called once (for primitive) and should not be the problem. But the segmentation fault issue still existed.
Could you go deeper why it happens

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@p-durandin Yes, this issue happens when running multiple times "start_async()" in the test_cross-lingual-books-alignment.ipynb. notebook.
Changing multiple times running of "start_async()" to "infer()" will have no issue, so the problem is related to multiple asynchronized running.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a log to detect if concurrent access happen. Yes, it indeed happened when running in notebook. benchmark_app doesn't reproduce this issue with many nstreams.

for (size_t i = 0; i < bp.fused_ops.size(); i++) {
    const auto& fused_prim = impl_param.fused_desc[i];
    auto& fd = bp.fused_ops[i];
    if (!vector_mutex.try_lock()) {
        log_message("WARNING: Concurrent access detected! Another thread is accessing the vector.");
    }
    fd.output_tensor = convert_data_tensor(fused_prim.output_layout);
    for (size_t i = fd.dep_idx_start; i < fd.dep_idx_start + fd.dep_size; i++) {

fd.tensors.push_back(convert_data_tensor(impl_param.get_input_layout(i)));
}
if (vector_mutex.try_lock()) { // Check if this thread holds the lock
vector_mutex.unlock();
}
}

Another suggestion to the fix. Instead using mutex, change the std::vector to concurrent_vector.

src/plugins/intel_gpu/src/kernel_selector/kernel_selector_params.h

#include <tbb/concurrent_vector.h>

//using MultiDataTensor = std::vector;
using MultiDataTensor = tbb::concurrent_vector;

Copy link
Contributor

@clee30 clee30 Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this below happen, the segfault trigger.

[2024-11-06 18:51:51.444] [Thread 135250088822336] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.
[2024-11-06 18:51:51.449] [Thread 135245015811648] WARNING: Concurrent access detected! Another thread is accessing the vector.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that you are referring to the AsyncInferQueue and its' start_async() calls in the test_cross-lingual-books-alignment.ipynb notebook?
It seems that this segmentation fault is a side effect of another issue. Could you please check why and how the single network (and impl_params of the primitives) are being used simultaneously by two different threads? It's not expected behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @clee30 & @sshlyapn for your suggestions.

Dumped the model properties for the model, and seems the numbers of infer requests & streams are different:
(left: with performance hint, failed; right: without performance hint, succeeded)
image

Copy link
Contributor Author

@yuanxion yuanxion Nov 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. With config={hints.performance_mode(): hints.PerformanceMode.THROUGHPUT}, the NUM_STREAMS will be 2.
PERFORMANCE_HINT: PerformanceMode.THROUGHPUT
EXECUTION_MODE_HINT: ExecutionMode.PERFORMANCE
COMPILATION_NUM_THREADS: 20
NUM_STREAMS: 2
PERFORMANCE_HINT_NUM_REQUESTS: 0
INFERENCE_PRECISION_HINT: <Type: 'float16'>
DYNAMIC_QUANTIZATION_GROUP_SIZE: 32
ACTIVATIONS_SCALE_FACTOR: 0.0
DEVICE_ID: 0
EXECUTION_DEVICES: ['GPU.0']

And 2 different threads (thread 729e28c00640 and 729e17e00640) will be used to start_async:

--> sentence idx = 0 start_async
--> cpu_streams_executor Impl run task with thread: 729e28c00640
--> sentence idx = 1 start_async
--> cpu_streams_executor Impl run task with thread: 729e17e00640

These 2 threads will have different networks to run:

// 1st thread
#8  0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fcf3450,
#10 ov::intel_gpu::SyncInferRequest::infer (this=0x5555647b64d0)
#25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250

// 2nd thread
#8  0x00007ffedcf31007 in cldnn::network::execute[abi:cxx11](this=0x55556fdf5370,
#10 ov::intel_gpu::SyncInferRequest::infer (this=0x55556fac2240)
#25 ov::threading::CPUStreamsExecutor::Impl::Execute 0x55556467b250

But will access the same fd (fused_operation_desc: 0x729bdc004490), and push_back new element to the same fd.tensors (std::vector: 0x729c0c040850).

--> update_shapes[729e28c00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 3, ptr: 0x729c0c040850
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 4, ptr: 0x729c0c040850

And thus there is data race of fd.tensors under this THROUGHPUT mode.

  1. BTW, for different mvn primitive_inst, the fd (and fd.tensors) are different (so should be no data race between primitive_inst within the same thread):
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bdc004490, id: mvn:__module.encoder.layer.0.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bdc003bd0
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bb9fe0f40, id: mvn:__module.encoder.layer.0.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb83128c0
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bf40106e0, id: mvn:__module.encoder.layer.1.attention.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bf4002690
--> update_shapes[729e17e00640] i: 0, &fd: 0x729bba26f320, id: mvn:__module.encoder.layer.1.output.LayerNorm/aten::layer_norm/MVN, fd.tensors size: 1, ptr: 0x729bb8685ad0

@geunhwan geunhwan added this to the 2024.5 milestone Oct 29, 2024
@geunhwan geunhwan removed this from the 2024.5 milestone Oct 30, 2024
@yuanxion yuanxion force-pushed the fix-segment-fault branch 2 times, most recently from b8d8db9 to 27ddd4d Compare November 15, 2024 03:41
@yuanxion
Copy link
Contributor Author

yuanxion commented Nov 15, 2024

Reverted mutex for fd.tensors (std::vector) and copied _impl KernelData params for different networks when running start_async() with multi-threads (THROUGHPUT mode), no segmentation fault now with commit 2f4b607 for test_cross-lingual-books-alignment.ipynb notebook with iGPU as well as Arc A770.

auto prim_impl = make_unique<permute_impl>(*this);
kernel_params_t* params_ptr = dynamic_cast<kernel_params_t*>(prim_impl->_kernel_data.params.get());
prim_impl->_kernel_data.params = make_unique<kernel_params_t>(*params_ptr);
return prim_impl;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can have a template helper to minimize code copy-paste.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
category: GPU OpenVINO GPU plugin ExternalPR External contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants