-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vm-node] Dead lock caused by VMNode::reset() while executing QemuFSM_::connect_vm() #14
Comments
|
I was aware of this deadlock. The only reasonable way to address it, as far as I know, is to use a timeout on get_connection_wait(). Boost.ASIO has support for timers, but I never quite got it working. I'll revisit. |
Hi @moralismercatus , I came up with a quick fix right after I posted this issue. The idea is that making "QemuFSM_::connect_vm()" hold the lock of "QemuFSM_::child_" (the running qemu instance) all the time, which can prevent "QemuFSM_::terminate()" (fired by "VMNode::reset()") to kill the running qemu instance at the same time. This should fix the deadlock for this particular situation, while the potential deadlock from the "get_connection_wait()" is still a concern. Please let me know your thoughts. Thanks. |
@likebreath Are you saying that there are two bugs here?
If your fix is sufficient for you now, that's fine, but a timeout would solve (1) and (2), I think. |
@moralismercatus Yes, you are right. I would say (1) is a particular case of (2). |
@likebreath I've implemented a timeout mechanism for open_connection_wait(). Before I do a pull request (just for the feature, not the fix), I need to know if we're still restricting guest libraries to pre-C++11. If so, I need to make a few changes. |
@moralismercatus Thanks. I think the only reason to restrict guest libraries to pre-C++11 only is to make the setup of crete guest libraries easier for various guest OS, like older version of Ubuntu and debian that do not comes with a c++11 compatible compiler by default. Besides this, I can't think of any other reasons. So the real question is do we want to keep it or not. Let me know your thoughts. No hurry on the pull request, as it is not urgent. |
My vote is to lift the restriction. Ubuntu 14.04 (which is 2 years old now) comes with gcc-4.8 which is practically C++11 compliant. For the older systems, I think the convenience of development outweighs the inconvenience of downloading a clang binary. |
@moralismercatus It seems that the deadlock is still here after my fix (1fa4c88). I am surprised that even I made 'QemuFSM_::connect_vm()' hold the lock on 'QemuFSM_::child_', 'QemuFSM_::terminate' was still processed and killed qemu while connect_vm() was running. I thought ''QemuFSM_::terminate'' should be blocked by this line of code: Please let me know your thoughts on this. |
@likebreath Could it be that terminate() is called, acquires, the lock, kills QEMU, and releases the lock before connect_vm() acquires the lock? |
Not likely. If termniate() has killed QEMU before connect_vm() acquires the lock, there should be an exception thrown in connect_vm(): `
... |
I may know the problem. The lock is acquired outside the async_task block, is it not? The worker thread that actually connects to the VM is spun off and the lock would then be released upon scope exit. |
@moralismercatus |
@likebreath Looks good to me. |
While QemuFSM_::connect_vm() is waiting for "server->open_connection_wait()", a reset on vm-node ( signal 'packet_type::cluster_reset' from dispatch) will cause deadlock.
The deadlock happened while executing "vms_.clear();" in VMNode::rest(). It tries to destroy all QemuFSM_ within the vm-node, which finally tries to destory the async_task that is executing QemuFSM_::connect_vm().
The text was updated successfully, but these errors were encountered: