Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vm-node] Dead lock caused by VMNode::reset() while executing QemuFSM_::connect_vm() #14

Closed
likebreath opened this issue Jan 2, 2017 · 14 comments
Labels

Comments

@likebreath
Copy link
Collaborator

While QemuFSM_::connect_vm() is waiting for "server->open_connection_wait()", a reset on vm-node ( signal 'packet_type::cluster_reset' from dispatch) will cause deadlock.

The deadlock happened while executing "vms_.clear();" in VMNode::rest(). It tries to destroy all QemuFSM_ within the vm-node, which finally tries to destory the async_task that is executing QemuFSM_::connect_vm().

@likebreath likebreath added the bug label Jan 2, 2017
@likebreath
Copy link
Collaborator Author

likebreath commented Jan 2, 2017

#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00002b317c4f96bb in boost::condition_variable::wait(boost::unique_lock<boost::mutex>&) () from /home/chenbo/MSoftware/libs/boost_1_59_0/stage/lib/libboost_thread.so.1.59.0
**#2  0x00002b317c4f4c5c in boost::thread::join_noexcept() () from /home/chenbo/MSoftware/libs/boost_1_59_0/stage/lib/libboost_thread.so.1.59.0
#3  0x000000000054697f in boost::thread::join (this=0x32f8820) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/thread/detail/thread.hpp:765
#4  0x000000000054680e in crete::AsyncTask::~AsyncTask (this=0x32f8820) at /home/chenbo/crete/crete-dev/lib/include/crete/async_task.h:103
#5  0x0000000000546685 in crete::AsyncTask::~AsyncTask (this=0x32f8820) at /home/chenbo/crete/crete-dev/lib/include/crete/async_task.h:82**
#6  0x00002b317b8f815e in std::default_delete<crete::AsyncTask>::operator() (this=0x32a7bc8, __ptr=0x32f8820)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/unique_ptr.h:67
#7  0x00002b317b8f80c6 in std::unique_ptr<crete::AsyncTask, std::default_delete<crete::AsyncTask> >::~unique_ptr (this=0x32a7bc8)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/unique_ptr.h:184
#8  0x00002b317b8f8075 in std::unique_ptr<crete::AsyncTask, std::default_delete<crete::AsyncTask> >::~unique_ptr (this=0x32a7bc8)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/unique_ptr.h:181
**#9  0x00002b317b9477a5 in crete::cluster::node::vm::fsm::QemuFSM_::NextTest::~NextTest (this=0x32a7bc0) at /home/chenbo/crete/crete-dev/lib/cluster/vm_node_fsm.cpp:455**
#10 0x00002b317b947575 in crete::cluster::node::vm::fsm::QemuFSM_::NextTest::~NextTest (this=0x32a7bc0) at /home/chenbo/crete/crete-dev/lib/cluster/vm_node_fsm.cpp:455
#11 0x00002b317b947403 in boost::fusion::vector_data15<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start>::~vector_data15 (this=0x32a7ba0)
    at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/vector/detail/cpp03/preprocessed/vector20.hpp:739
#12 0x00002b317b947355 in boost::fusion::vector15<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start>::~vector15 (this=0x32a7ba0)
    at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/vector/detail/cpp03/preprocessed/vector20.hpp:806
#13 0x00002b317b947335 in boost::fusion::vector15<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start>::~vector15 (this=0x32a7ba0)
    at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/vector/detail/cpp03/preprocessed/vector20.hpp:806
#14 0x00002b317b947315 in boost::fusion::vector<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_>::~vector (this=0x32a7ba0) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/vector/detail/cpp03/preprocessed/vvector30.hpp:14
#15 0x00002b317b9472f5 in boost::fusion::vector<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_>::~vector (this=0x32a7ba0) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/vector/detail/cpp03/preprocessed/vvector30.hpp:14
#16 0x00002b317b9472d5 in boost::fusion::set<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_>::~set (this=0x32a7ba0) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/set/detail/cpp03/preprocessed/set30.hpp:14
---Type <return> to continue, or q <return> to quit---
#17 0x00002b317b946125 in boost::fusion::set<crete::cluster::node::vm::fsm::QemuFSM_::Error, crete::cluster::node::vm::fsm::QemuFSM_::Terminated, crete::cluster::node::vm::fsm::QemuFSM_::Valid, crete::cluster::node::vm::fsm::QemuFSM_::Active, crete::cluster::node::vm::fsm::QemuFSM_::Finished, crete::cluster::node::vm::fsm::QemuFSM_::StoreTrace, crete::cluster::node::vm::fsm::QemuFSM_::Testing, crete::cluster::node::vm::fsm::QemuFSM_::NextTest, crete::cluster::node::vm::fsm::QemuFSM_::GuestDataRxed, crete::cluster::node::vm::fsm::QemuFSM_::RxGuestData, crete::cluster::node::vm::fsm::QemuFSM_::ConnectVM, crete::cluster::node::vm::fsm::QemuFSM_::StartVM, crete::cluster::node::vm::fsm::QemuFSM_::UpdateImage, crete::cluster::node::vm::fsm::QemuFSM_::ValidateImage, crete::cluster::node::vm::fsm::QemuFSM_::Start, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_, boost::fusion::void_>::~set (this=0x32a7ba0) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/fusion/container/set/detail/cpp03/preprocessed/set30.hpp:14
#18 0x00002b317b9460d5 in boost::msm::back::state_machine<crete::cluster::node::vm::fsm::QemuFSM_, boost::parameter::void_, boost::parameter::void_, boost::parameter::void_, boost::parameter::void_>::~state_machine (this=0x32a7800) at /home/chenbo/MSoftware/libs/boost_1_59_0/boost/msm/back/state_machine.hpp:148
**#19 0x00002b317b9460a5 in crete::cluster::node::vm::fsm::QemuFSM::~QemuFSM (this=0x32a7800) at /home/chenbo/crete/crete-dev/lib/cluster/vm_node_fsm.cpp:1093**
#20 0x00002b317b946085 in crete::cluster::node::vm::fsm::QemuFSM::~QemuFSM (this=0x32a7800) at /home/chenbo/crete/crete-dev/lib/cluster/vm_node_fsm.cpp:1093
#21 0x00002b317b9c2b8e in std::_Sp_counted_ptr<crete::cluster::node::vm::fsm::QemuFSM*, (__gnu_cxx::_Lock_policy)1>::_M_dispose (this=0x32f8c80)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr_base.h:290
#22 0x000000000055ddda in std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)1>::_M_release (this=0x32f8c80)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr_base.h:144
#23 0x000000000055dd7d in std::__shared_count<1>::~__shared_count (this=0x1e258a8) at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr_base.h:553
#24 0x000000000055dd45 in std::__shared_count<1>::~__shared_count (this=0x1e258a8) at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr_base.h:551
#25 0x000000000055dd1c in std::__shared_ptr<crete::cluster::node::vm::fsm::QemuFSM, 1>::~__shared_ptr (this=0x1e258a0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr_base.h:810
#26 0x000000000055dcf5 in std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>::~shared_ptr (this=0x1e258a0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr.h:93
#27 0x000000000055dcd5 in std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>::~shared_ptr (this=0x1e258a0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/shared_ptr.h:93
#28 0x000000000055dca5 in std::_Destroy<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM> > (__pointer=0x1e258a0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_construct.h:93
#29 0x000000000055dc6f in std::_Destroy_aux<false>::__destroy<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>*> (__first=0x1e258a0, __last=0x1e258b0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_construct.h:103
#30 0x000000000055dc2d in std::_Destroy<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>*> (__first=0x1e258a0, __last=0x1e258b0)
    at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_construct.h:126
#31 0x000000000055daa1 in std::_Destroy<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>*, std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM> > (__first=0x1e258a0, 
    __last=0x1e258b0) at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_construct.h:151
#32 0x00002b317b95e713 in std::vector<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>, std::allocator<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM> > >::_M_erase_at_end (this=0x7ffc98f1eb38, __pos=0x1e258a0) at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_vector.h:1352
#33 0x00002b317b93fa74 in std::vector<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM>, std::allocator<std::shared_ptr<crete::cluster::node::vm::fsm::QemuFSM> > >::clear (
    this=0x7ffc98f1eb38) at /usr/lib/gcc/x86_64-linux-gnu/4.8/../../../../include/c++/4.8/bits/stl_vector.h:1126
**#34 0x00002b317b93e3e6 in crete::cluster::VMNode::reset (this=0x7ffc98f1e900) at /home/chenbo/crete/crete-dev/lib/cluster/vm_node.cpp:257**
#35 0x000000000054ac79 in crete::cluster::process_default<crete::AtomicGuard<crete::cluster::VMNode> > (node=..., request=...)
    at /home/chenbo/crete/crete-dev/lib/include/crete/cluster/node_driver.h:237
#36 0x0000000000545d9f in crete::cluster::NodeDriver<crete::cluster::VMNode>::run_listener (this=0x7ffc98f1e8c0)
    at /home/chenbo/crete/crete-dev/lib/include/crete/cluster/node_driver.h:156

@moralismercatus
Copy link
Collaborator

I was aware of this deadlock.

The only reasonable way to address it, as far as I know, is to use a timeout on get_connection_wait(). Boost.ASIO has support for timers, but I never quite got it working. I'll revisit.

likebreath added a commit to likebreath/crete-dev that referenced this issue Jan 3, 2017
@likebreath
Copy link
Collaborator Author

Hi @moralismercatus , I came up with a quick fix right after I posted this issue.

The idea is that making "QemuFSM_::connect_vm()" hold the lock of "QemuFSM_::child_" (the running qemu instance) all the time, which can prevent "QemuFSM_::terminate()" (fired by "VMNode::reset()") to kill the running qemu instance at the same time.

This should fix the deadlock for this particular situation, while the potential deadlock from the "get_connection_wait()" is still a concern.

Please let me know your thoughts. Thanks.

@moralismercatus
Copy link
Collaborator

@likebreath Are you saying that there are two bugs here?

  1. QemuFSM_::terminate() is killing QEMU before it has a chance to connect, so a connection is never made.
  2. vms_.clear() is blocked because the AsyncTask for connect_vm() isn't completed.

If your fix is sufficient for you now, that's fine, but a timeout would solve (1) and (2), I think.

@likebreath
Copy link
Collaborator Author

@moralismercatus Yes, you are right. I would say (1) is a particular case of (2).

@moralismercatus
Copy link
Collaborator

@likebreath I've implemented a timeout mechanism for open_connection_wait(). Before I do a pull request (just for the feature, not the fix), I need to know if we're still restricting guest libraries to pre-C++11. If so, I need to make a few changes.

@likebreath
Copy link
Collaborator Author

@moralismercatus Thanks.

I think the only reason to restrict guest libraries to pre-C++11 only is to make the setup of crete guest libraries easier for various guest OS, like older version of Ubuntu and debian that do not comes with a c++11 compatible compiler by default. Besides this, I can't think of any other reasons. So the real question is do we want to keep it or not.

Let me know your thoughts. No hurry on the pull request, as it is not urgent.

@moralismercatus
Copy link
Collaborator

My vote is to lift the restriction. Ubuntu 14.04 (which is 2 years old now) comes with gcc-4.8 which is practically C++11 compliant. For the older systems, I think the convenience of development outweighs the inconvenience of downloading a clang binary.

@likebreath
Copy link
Collaborator Author

@moralismercatus It seems that the deadlock is still here after my fix (1fa4c88).

I am surprised that even I made 'QemuFSM_::connect_vm()' hold the lock on 'QemuFSM_::child_', 'QemuFSM_::terminate' was still processed and killed qemu while connect_vm() was running. I thought ''QemuFSM_::terminate'' should be blocked by this line of code:
" auto pid = fsm.child_->acquire()->get_id();"

Please let me know your thoughts on this.

@moralismercatus
Copy link
Collaborator

@likebreath Could it be that terminate() is called, acquires, the lock, kills QEMU, and releases the lock before connect_vm() acquires the lock?

@likebreath
Copy link
Collaborator Author

likebreath commented Jan 6, 2017

@moralismercatus

Not likely. If termniate() has killed QEMU before connect_vm() acquires the lock, there should be an exception thrown in connect_vm():

`
struct QemuFSM_::connect_vm
{
template <class EVT,class FSM,class SourceState,class TargetState>
auto operator()(EVT const&, FSM& fsm, SourceState&, TargetState& ts) -> void
{
auto lock = fsm.child_->acquire();
auto pid = lock->get_id();

    if(!process::is_running(pid))
    {
        BOOST_THROW_EXCEPTION(VMException{} << err::process_exited{"pid_"});
    }

...
}
`

@moralismercatus
Copy link
Collaborator

I may know the problem. The lock is acquired outside the async_task block, is it not? The worker thread that actually connects to the VM is spun off and the lock would then be released upon scope exit.

likebreath added a commit to likebreath/crete-dev that referenced this issue Jan 6, 2017
@likebreath
Copy link
Collaborator Author

@moralismercatus
I revised my fix for this issue by cd3e304 . Please review this change. Thanks.

@moralismercatus
Copy link
Collaborator

@likebreath Looks good to me.

likebreath added a commit to likebreath/crete-dev that referenced this issue Feb 10, 2017
likebreath added a commit to likebreath/crete-dev that referenced this issue Feb 10, 2017
likebreath added a commit that referenced this issue Feb 16, 2017
likebreath added a commit that referenced this issue Feb 16, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants