Skip to content

[BUG] PDLP throws exception in concurrent mode #967

@Iroy30

Description

@Iroy30

Describe the bug
When run_concurrent is called , run_pdlp throws an error intermittently and causes termination of the barrier thread prior to joining. PR #966 handles the exception to avoid a crash but the root cause of the exception needs to be evaluated. The snippet of the caught exception is below:

===================================================================================================== FAILURES ======================================================================================================
___________________________________________________________________________________ test_incumbent_get_callback[/mip/swath1.mps] ____________________________________________________________________________________
file_name = '/mip/swath1.mps'
    @pytest.mark.parametrize(
        "file_name",
        [
            ("/mip/swath1.mps"),
            ("/mip/neos5-free-bound.mps"),
        ],
    )
    def test_incumbent_get_callback(file_name):
>       _run_incumbent_solver_callback(file_name, include_set_callback=False)
python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py:112: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
python/cuopt/cuopt/tests/linear_programming/test_incumbent_callbacks.py:87: in _run_incumbent_solver_callback
    solution = solver.Solve(data_model_obj, settings)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
/raid/iroy/miniforge3/envs/py313/lib/python3.13/site-packages/cuopt/utilities/exception_handler.py:48: in func
    raise e
/raid/iroy/miniforge3/envs/py313/lib/python3.13/site-packages/cuopt/utilities/exception_handler.py:24: in func
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
/raid/iroy/miniforge3/envs/py313/lib/python3.13/site-packages/cuopt/linear_programming/solver/solver.py:98: in Solve
    s = solver_wrapper.Solve(
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
>   ???
E   RuntimeError: CUDA error encountered at: file=/home/nfs/iroy/cuopt-1/cpp/src/pdlp/utilities/ping_pong_graph.cu line=56: call='cudaStreamEndCapture(stream_view_.value(), &even_graph)', Reason=cudaErrorStreamCaptureInvalidated:operation failed due to a previous error during capture
E   Obtained 49 stack frames
E   #1 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so(+0x2b66b1) [0x7f31a28fc6b1]
E   #2 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::ping_pong_graph_t<int>::end_capture(int) +0xa9e [0x7f31a2aecaee]
E   #3 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::pdhg_solver_t<int, double>::compute_next_primal_dual_solution_reflected(rmm::device_uvector<double>&, rmm::device_uvector<double>&, bool) +0x4cc [0x7f31a29e045c]
E   #4 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::pdhg_solver_t<int, double>::take_step(rmm::device_uvector<double>&, rmm::device_uvector<double>&, int, bool, int, bool) +0x8b [0x7f31a29e370b]
E   #5 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::pdlp_solver_t<int, double>::run_solver(cuopt::timer_t const&) +0xbdc [0x7f31a29ca19c]
E   #6 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so(+0x3265a2) [0x7f31a296c5a2]
E   #7 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::optimization_problem_solution_t<int, double> cuopt::linear_programming::run_pdlp<int, double>(cuopt::linear_programming::detail::problem_t<int, double>&, cuopt::linear_programming::pdlp_solver_settings_t<int, double> const&, cuopt::timer_t const&, bool) +0xcd [0x7f31a297093d]
E   #8 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::optimization_problem_solution_t<int, double> cuopt::linear_programming::run_concurrent<int, double>(cuopt::linear_programming::detail::problem_t<int, double>&, cuopt::linear_programming::pdlp_solver_settings_t<int, double> const&, cuopt::timer_t const&, bool) +0x323 [0x7f31a2972203]
E   #9 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::optimization_problem_solution_t<int, double> cuopt::linear_programming::solve_lp_with_method<int, double>(cuopt::linear_programming::detail::problem_t<int, double>&, cuopt::linear_programming::pdlp_solver_settings_t<int, double> const&, cuopt::timer_t const&, bool) +0x35 [0x7f31a2973285]
E   #10 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::diversity_manager_t<int, double>::run_solver() +0x1184 [0x7f31a2d83844]
E   #11 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::detail::mip_solver_t<int, double>::run_solver() +0x1f8b [0x7f31a2d7272b]
E   #12 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::mip_solution_t<int, double> cuopt::linear_programming::run_mip<int, double>(cuopt::linear_programming::detail::problem_t<int, double>&, cuopt::linear_programming::mip_solver_settings_t<int, double> const&, cuopt::timer_t&) +0x1186 [0x7f31a2d64d86]
E   #13 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::linear_programming::mip_solution_t<int, double> cuopt::linear_programming::solve_mip<int, double>(cuopt::linear_programming::optimization_problem_t<int, double>&, cuopt::linear_programming::mip_solver_settings_t<int, double> const&) +0xcec [0x7f31a2d6645c]
E   #14 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: std::unique_ptr<cuopt::linear_programming::mip_solution_interface_t<int, double>, std::default_delete<cuopt::linear_programming::mip_solution_interface_t<int, double> > > cuopt::linear_programming::solve_mip<int, double>(cuopt::linear_programming::optimization_problem_interface_t<int, double>*, cuopt::linear_programming::mip_solver_settings_t<int, double> const&) +0x176 [0x7f31a2d6a8d6]
E   #15 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::cython::call_solve_mip(cuopt::linear_programming::optimization_problem_interface_t<int, double>*, cuopt::linear_programming::mip_solver_settings_t<int, double>&) +0x61 [0x7f31a2aedb51]
E   #16 in /raid/iroy/miniforge3/envs/py313/lib/libcuopt.so: cuopt::cython::call_solve(cuopt::mps_parser::data_model_view_t<int, double>*, cuopt::linear_programming::solver_settings_t<int, double>*, unsigned int, bool) +0x7e3 [0x7f31a2aee843]
E   #17 in /raid/iroy/miniforge3/envs/py313/lib/python3.13/site-packages/cuopt/linear_programming/solver/solver_wrapper.cpython-313-x86_64-linux-gnu.so(+0x529ff) [0x7f31a88de9ff]
E   #18 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: PyObject_Vectorcall +0x2e [0x557521595e6e]
E   #19 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x9245 [0x5575215ad375]
E   #20 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x27b2e7) [0x5575216672e7]
E   #21 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2cac98) [0x5575216b6c98]
E   #22 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyObject_MakeTpCall +0x27c [0x557521593c5c]
E   #23 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x9245 [0x5575215ad375]
E   #24 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x27b2e7) [0x5575216672e7]
E   #25 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2cac98) [0x5575216b6c98]
E   #26 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x28a699) [0x557521676699]
E   #27 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x3df7 [0x5575215a7f27]
E   #28 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x27b2e7) [0x5575216672e7]
E   #29 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2cac98) [0x5575216b6c98]
E   #30 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyObject_MakeTpCall +0x27c [0x557521593c5c]
E   #31 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x9245 [0x5575215ad375]
E   #32 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x27b2e7) [0x5575216672e7]
E   #33 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2cac98) [0x5575216b6c98]
E   #34 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyObject_MakeTpCall +0x27c [0x557521593c5c]
E   #35 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x9245 [0x5575215ad375]
E   #36 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x27b2e7) [0x5575216672e7]
E   #37 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2cac98) [0x5575216b6c98]
E   #38 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyObject_MakeTpCall +0x27c [0x557521593c5c]
E   #39 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: _PyEval_EvalFrameDefault +0x9245 [0x5575215ad375]
E   #40 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: PyEval_EvalCode +0x9f [0x55752166903f]
E   #41 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2bc5a3) [0x5575216a85a3]
E   #42 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2b96ac) [0x5575216a56ac]
E   #43 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2b64b6) [0x5575216a24b6]
E   #44 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2b6173) [0x5575216a2173]
E   #45 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x2b5f2c) [0x5575216a1f2c]
E   #46 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: Py_RunMain +0x3b4 [0x5575216a08e4]
E   #47 in /raid/iroy/miniforge3/envs/py313/bin/python3.13: Py_BytesMain +0x37 [0x557521654947]
E   #48 in /lib/x86_64-linux-gnu/libc.so.6: __libc_start_main +0xf3 [0x7f31b6eb8083]
E   #49 in /raid/iroy/miniforge3/envs/py313/bin/python3.13(+0x267cdd) [0x557521653cdd]
cuopt/linear_programming/solver/solver_wrapper.pyx:519: RuntimeError

Steps/Code to reproduce bug
Follow this guide http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports to craft a minimal bug report. This helps us reproduce the issue you're having and resolve the issue more quickly.

Expected behavior
A clear and concise description of what you expected to happen.

Environment details (please complete the following information):

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuOpt install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

awaiting responseThis expects a response from maintainer or contributor depending on who requested in last comment.bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions