Skip to content

Improve crossover dual simplex#948

Open
rg20 wants to merge 7 commits intoNVIDIA:mainfrom
rg20:improve_crossover_dual_simplex
Open

Improve crossover dual simplex#948
rg20 wants to merge 7 commits intoNVIDIA:mainfrom
rg20:improve_crossover_dual_simplex

Conversation

@rg20
Copy link
Contributor

@rg20 rg20 commented Mar 10, 2026

Description

This PR includes two performance improvements.

  1. Take advantage of hyper sparsity while computing reduced costs in dual pushes in crossover code
  2. Optimize right looking lu by replacing the linear degree bucket search with O(1) search

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

rg20 added 2 commits March 10, 2026 11:42
…Replace linear degree-bucket search with O(1) swap-with-last removal using col_pos/row_pos position arrays, and eliminate O(row_degree) pre-traversal in schur_complement via a persistent last_in_row[] array
@rg20 rg20 requested a review from a team as a code owner March 10, 2026 18:47
@rg20 rg20 requested review from aliceb-nv and chris-maes March 10, 2026 18:47
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 10, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@rg20 rg20 added improvement Improves an existing functionality non-breaking Introduces a non-breaking change labels Mar 10, 2026
@rg20 rg20 added this to the 26.04 milestone Mar 10, 2026
@rg20
Copy link
Contributor Author

rg20 commented Mar 10, 2026

/ok to test 5f377a2

@coderabbitai
Copy link

coderabbitai bot commented Mar 10, 2026

📝 Walkthrough

Walkthrough

Introduced a sparse row representation (Arow) and sparse delta_y handling in the dual-simplex crossover; added per-bucket position-tracking state (col_pos, row_pos, last_in_row) and propagated signature changes through right-looking LU factorization routines.

Changes

Cohort / File(s) Summary
Sparse Dual Computation
cpp/src/dual_simplex/crossover.cpp
Added Arow (CSR) parameter to dual_push and call sites; constructed Arow via lp.A.to_compressed_row(Arow) in crossover. Implemented sparse vs dense delta_y strategy, introduced delta_zN, delta_expanded, and delta_y_dense work buffers, sparse expansion using Arow, and updated y/delta_zN updates to apply only nonzero sparse contributions. Minor control-flow/formatting adjustments around the new paths.
LU Factorization Position Tracking
cpp/src/dual_simplex/right_looking_lu.cpp
Added initialize_bucket_positions and per-bucket position arrays col_pos, row_pos; extended load_elements to accept last_in_row. Threaded col_pos, row_pos, and last_in_row through update_Cdegree_and_col_count, update_Rdegree_and_row_count, schur_complement, remove_pivot_col, and top-level LU routines (right_looking_lu, right_looking_lu_row_permutation_only) with signature updates and state maintenance for O(1) bucket removals.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title "Improve crossover dual simplex" directly relates to the main changes: performance improvements in crossover/dual-simplex code including sparsity optimization and LU factorization enhancement.
Description check ✅ Passed The description clearly outlines the two performance improvements implemented: hyper-sparsity optimization in dual pushes and O(1) degree-bucket search in right-looking LU, matching the file changes provided.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
cpp/src/dual_simplex/crossover.cpp (1)

418-436: Consider hoisting delta_expanded allocation outside the while loop.

Currently delta_expanded is allocated as a new std::vector<f_t>(n, 0.) on every iteration of the while (superbasic_list.size() > 0) loop (starting at line 391). For large problems with many superbasic variables, this results in repeated O(n) allocations and deallocations.

Moving the allocation before the loop and using std::fill to reset would avoid this overhead.

♻️ Suggested refactor
+  std::vector<f_t> delta_expanded(n);
   while (superbasic_list.size() > 0) {
     // ... existing code ...
     
     // delta_zN = -N^T delta_y
     std::vector<f_t> delta_zN(n - m);
-    std::vector<f_t> delta_expanded(n, 0.);
+    std::fill(delta_expanded.begin(), delta_expanded.end(), 0.);
     
     // Iterate directly over sparse delta_y instead of checking zeros
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/crossover.cpp` around lines 418 - 436, Hoist the
allocation of delta_expanded (and optionally delta_zN) out of the while
(superbasic_list.size() > 0) loop: create std::vector<f_t> delta_expanded(n)
once before the loop and inside each iteration reset its contents with
std::fill(delta_expanded.begin(), delta_expanded.end(), 0.0) (and similarly
reset delta_zN or ensure it is resized once and overwritten). Update the loop
body that accumulates into delta_expanded and the subsequent assignment to
delta_zN[k] = -delta_expanded[nonbasic_list[k]] to rely on the reused buffers;
if n can change, ensure you call delta_expanded.resize(n) before the fill.
cpp/src/dual_simplex/right_looking_lu.cpp (1)

649-661: Extract the bucket-position bootstrap into one helper.

The col_pos / row_pos initialization is now duplicated in both factorization entry points, with only the bucket bounds differing. This is subtle invariant code; a small helper next to initialize_degree_data() would reduce drift the next time the bucket bookkeeping changes.

Also applies to: 1049-1062

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/dual_simplex/right_looking_lu.cpp`:
- Around line 405-409: work_estimate is using Cdegree[pivot_j] and
Rdegree[pivot_i] after those entries have been set to -1 by
update_Cdegree_and_col_count() / update_Rdegree_and_row_count(), causing
work_estimate to decrease incorrectly; fix by using the actual degree before it
is clobbered — either capture a local variable like int deg_j = Cdegree[pivot_j]
and/or deg_i = Rdegree[pivot_i] before calling the update functions, or compute
the degree from the corresponding count arrays (e.g. col_count[pivot_j] /
row_count[pivot_i]) or by traversing first_in_col/first_in_row if counts are not
available — then use deg_j/deg_i when updating work_estimate instead of reading
Cdegree/Rdegree after they were set to -1.

---

Nitpick comments:
In `@cpp/src/dual_simplex/crossover.cpp`:
- Around line 418-436: Hoist the allocation of delta_expanded (and optionally
delta_zN) out of the while (superbasic_list.size() > 0) loop: create
std::vector<f_t> delta_expanded(n) once before the loop and inside each
iteration reset its contents with std::fill(delta_expanded.begin(),
delta_expanded.end(), 0.0) (and similarly reset delta_zN or ensure it is resized
once and overwritten). Update the loop body that accumulates into delta_expanded
and the subsequent assignment to delta_zN[k] = -delta_expanded[nonbasic_list[k]]
to rely on the reused buffers; if n can change, ensure you call
delta_expanded.resize(n) before the fill.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 6a904205-beca-40f2-bce0-a3dcdb15be1e

📥 Commits

Reviewing files that changed from the base of the PR and between 2b21118 and 5f377a2.

📒 Files selected for processing (2)
  • cpp/src/dual_simplex/crossover.cpp
  • cpp/src/dual_simplex/right_looking_lu.cpp

Comment on lines 405 to 409
// Initialize row_last_workspace from last_in_row (O(1) per row, no full row traversal)
for (i_t p1 = first_in_col[pivot_j]; p1 != kNone; p1 = elements[p1].next_in_column) {
element_t<i_t, f_t>* e = &elements[p1];
const i_t i = e->i;
i_t row_last = kNone;
for (i_t p3 = first_in_row[i]; p3 != kNone; p3 = elements[p3].next_in_row) {
row_last = p3;
}
work_estimate += 2 * Rdegree[i];
row_last_workspace[i] = row_last;
row_last_workspace[elements[p1].i] = last_in_row[elements[p1].i];
}
work_estimate += 4 * Cdegree[pivot_j];
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix work_estimate accounting after pivot elimination.

Line 409, Line 499, and Line 509 still read Cdegree[pivot_j] / Rdegree[pivot_i], but those were already set to -1 by update_Cdegree_and_col_count() and update_Rdegree_and_row_count() before this helper is called. That makes work_estimate decrease on every pivot instead of approximating the actual loop cost.

Also applies to: 499-509

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/right_looking_lu.cpp` around lines 405 - 409,
work_estimate is using Cdegree[pivot_j] and Rdegree[pivot_i] after those entries
have been set to -1 by update_Cdegree_and_col_count() /
update_Rdegree_and_row_count(), causing work_estimate to decrease incorrectly;
fix by using the actual degree before it is clobbered — either capture a local
variable like int deg_j = Cdegree[pivot_j] and/or deg_i = Rdegree[pivot_i]
before calling the update functions, or compute the degree from the
corresponding count arrays (e.g. col_count[pivot_j] / row_count[pivot_i]) or by
traversing first_in_col/first_in_row if counts are not available — then use
deg_j/deg_i when updating work_estimate instead of reading Cdegree/Rdegree after
they were set to -1.

elements[nz].x = A.x[p];
elements[nz].next_in_column = kNone;
if (p > col_start) { elements[nz - 1].next_in_column = nz; }
elements[nz].next_in_row = kNone; // set the current next in row to None (since we don't know
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind restoring the comments here. Those are helpful

}
work_estimate += 2 * Rdegree[i];
row_last_workspace[i] = row_last;
row_last_workspace[elements[p1].i] = last_in_row[elements[p1].i];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit. I think this would read clearer if we did

const i_t i = elements[p1].i;
row_last_workspace[i] = last_in_row[i]; 

initialize_degree_data(A, column_list, Cdegree, Rdegree, col_count, row_count, work_estimate);

// Position arrays for O(1) degree-bucket removal
std::vector<i_t> col_pos(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document what col_pos stores. As we do with other variables. I think it is

std::vector<i_t> col_pos(n);  // if Cdegree[j] = nz, than j is in col_count[nz][col_pos[j]]

col_pos[col_count[d][pos]] = pos;
}
}
std::vector<i_t> row_pos(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add the comment

std::vector<i_t> row_pos(n); // If Rdegree[i] = nz, than i is in row_count[nz][row_pos[i]]

std::vector<i_t>& Cdegree,
std::vector<std::vector<i_t>>& row_count,
std::vector<std::vector<i_t>>& col_count,
std::vector<i_t>& last_in_row,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between row_last_workspace and last_in_row?


// Position arrays for O(1) degree-bucket removal
// col_count has m+1 buckets, row_count has n+1 buckets
std::vector<i_t> col_pos(n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here can we add comment explaining col_pos

col_pos[col_count[d][pos]] = pos;
}
}
std::vector<i_t> row_pos(m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here can we add comments explaining row_pos

element_t<i_t, f_t>* e = &elements[p1];
const i_t i = e->i;
i_t last = kNone;
i_t last_surviving = kNone;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between last and last_surviving. I'm not sure if last is used. Maybe we can use last for last_surviving?

f_t dot = 0.0;
for (i_t p = col_start; p < col_end; ++p) {
dot += lp.A.x[p] * delta_y[lp.A.i[p]];
std::vector<f_t> delta_expanded(n, 0.);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This always computes delta_z using delta_z = -sum_{ delta_y_i != 0} A(i, non_basic).

However, when delta_y is dense it can be better to compute it using the previous method.

I dynamically decide which method to use in dual simplex by looking at the sparsity of delta_y. If delta_y is less than 30% sparse I use this way, otherwise I use the previous method.

Also, we should probably pull out the code for computing this into a function that can be used in both dual simplex and crossover. .

But if you want you can merge this as is. And I can make those changes.

rg20 added 4 commits March 16, 2026 07:40
Allocate buffers once before the superbasic loop and reset with std::fill
each iteration to avoid repeated O(n) allocations (PR NVIDIA#948 review).

Made-with: Cursor
@rg20
Copy link
Contributor Author

rg20 commented Mar 16, 2026

/ok to test cc54886

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
cpp/src/dual_simplex/right_looking_lu.cpp (1)

371-372: ⚠️ Potential issue | 🟡 Minor

Preserve pivot degrees until Schur work accounting is finished.

Line 371 and Line 408 set pivot degrees to -1, but schur_complement() still uses those pivot degrees for work_estimate terms, which can push the estimate downward incorrectly. Please cache pivot degrees (or defer sentinel assignment) before those bookkeeping terms are computed.

Also applies to: 408-409

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/right_looking_lu.cpp` around lines 371 - 372, The code
currently sets Cdegree[pivot_j] = -1 (in the block around pivot handling) before
schur_complement() computes work_estimate terms, which causes those terms to use
the sentinel -1 and undercount work; preserve the original pivot degree value
until after all Schur work accounting is finished by caching the degree (e.g.,
save old_degree = Cdegree[pivot_j]) or by deferring the sentinel assignment
until after schur_complement() and any work_estimate calculations complete;
update both spots where Cdegree[pivot_j] is set to -1 (the one near pivot_j and
the one around the other referenced location) to use the cached value approach
and only assign -1 after work_estimate is computed.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@cpp/src/dual_simplex/right_looking_lu.cpp`:
- Around line 371-372: The code currently sets Cdegree[pivot_j] = -1 (in the
block around pivot handling) before schur_complement() computes work_estimate
terms, which causes those terms to use the sentinel -1 and undercount work;
preserve the original pivot degree value until after all Schur work accounting
is finished by caching the degree (e.g., save old_degree = Cdegree[pivot_j]) or
by deferring the sentinel assignment until after schur_complement() and any
work_estimate calculations complete; update both spots where Cdegree[pivot_j] is
set to -1 (the one near pivot_j and the one around the other referenced
location) to use the cached value approach and only assign -1 after
work_estimate is computed.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: eb075e7a-fa49-46ae-b96a-caf985a7b449

📥 Commits

Reviewing files that changed from the base of the PR and between 5f377a2 and cc54886.

📒 Files selected for processing (2)
  • cpp/src/dual_simplex/crossover.cpp
  • cpp/src/dual_simplex/right_looking_lu.cpp

@rg20
Copy link
Contributor Author

rg20 commented Mar 16, 2026

/ok to test 3e2352b

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/src/dual_simplex/crossover.cpp (1)

421-453: Consider extracting sparse/dense delta_z computation into a shared utility.

As noted in prior review discussion, the sparse vs. dense computation of delta_z = -A^T * delta_y could be factored into a shared function for use in both dual simplex and crossover. This would reduce code duplication and ensure consistent threshold tuning.

This can be deferred to a follow-up PR as suggested by the reviewer.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/crossover.cpp` around lines 421 - 453, Extract the
sparse vs dense computation of delta_z (currently computing delta_zN using
delta_y_sparse / delta_y_dense, delta_expanded, nonbasic_list, lp.A and Arow)
into a shared utility function (e.g., compute_delta_z_from_delta_y) that accepts
delta_y in sparse or dense form, A (or Arow), nonbasic_list and outputs
delta_zN; move the sparse path (loop over delta_y_sparse.i/x and Arow to build
delta_expanded) and the dense path (delta_y_sparse.to_dense → dot product over
lp.A columns) into the single function, preserve the existing 30% sparsity
threshold outside the helper, and update callers in dual simplex and crossover
to call the new utility to avoid duplication and centralize threshold/tuning.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/dual_simplex/crossover.cpp`:
- Around line 462-466: The loop over sparse entries in crossover.cpp uses
delta_y_sparse.i.size() as the upper bound without casting, causing a
signed/unsigned comparison warning and inconsistency with the earlier loop;
change the loop condition to use static_cast<i_t>(delta_y_sparse.i.size()) so
the type of nnz_idx (i_t) matches the cast size, keeping the body unchanged
(variables: delta_y_sparse, nnz_idx, i_t, y, step_length).

---

Nitpick comments:
In `@cpp/src/dual_simplex/crossover.cpp`:
- Around line 421-453: Extract the sparse vs dense computation of delta_z
(currently computing delta_zN using delta_y_sparse / delta_y_dense,
delta_expanded, nonbasic_list, lp.A and Arow) into a shared utility function
(e.g., compute_delta_z_from_delta_y) that accepts delta_y in sparse or dense
form, A (or Arow), nonbasic_list and outputs delta_zN; move the sparse path
(loop over delta_y_sparse.i/x and Arow to build delta_expanded) and the dense
path (delta_y_sparse.to_dense → dot product over lp.A columns) into the single
function, preserve the existing 30% sparsity threshold outside the helper, and
update callers in dual simplex and crossover to call the new utility to avoid
duplication and centralize threshold/tuning.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cecd1046-6c96-4a21-a4a0-df1a91062ace

📥 Commits

Reviewing files that changed from the base of the PR and between cc54886 and 3e2352b.

📒 Files selected for processing (1)
  • cpp/src/dual_simplex/crossover.cpp

Comment on lines +462 to 466
// Optimized: Only update non-zero elements from sparse representation
for (i_t nnz_idx = 0; nnz_idx < delta_y_sparse.i.size(); ++nnz_idx) {
const i_t i = delta_y_sparse.i[nnz_idx];
y[i] += step_length * delta_y_sparse.x[nnz_idx];
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add explicit cast for consistency and to avoid signed/unsigned comparison warning.

Line 428 uses static_cast<i_t>(delta_y_sparse.i.size()) for the same pattern, but line 463 omits it. This inconsistency may trigger compiler warnings under strict settings.

Proposed fix
-    for (i_t nnz_idx = 0; nnz_idx < delta_y_sparse.i.size(); ++nnz_idx) {
+    for (i_t nnz_idx = 0; nnz_idx < static_cast<i_t>(delta_y_sparse.i.size()); ++nnz_idx) {
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Optimized: Only update non-zero elements from sparse representation
for (i_t nnz_idx = 0; nnz_idx < delta_y_sparse.i.size(); ++nnz_idx) {
const i_t i = delta_y_sparse.i[nnz_idx];
y[i] += step_length * delta_y_sparse.x[nnz_idx];
}
// Optimized: Only update non-zero elements from sparse representation
for (i_t nnz_idx = 0; nnz_idx < static_cast<i_t>(delta_y_sparse.i.size()); ++nnz_idx) {
const i_t i = delta_y_sparse.i[nnz_idx];
y[i] += step_length * delta_y_sparse.x[nnz_idx];
}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/dual_simplex/crossover.cpp` around lines 462 - 466, The loop over
sparse entries in crossover.cpp uses delta_y_sparse.i.size() as the upper bound
without casting, causing a signed/unsigned comparison warning and inconsistency
with the earlier loop; change the loop condition to use
static_cast<i_t>(delta_y_sparse.i.size()) so the type of nnz_idx (i_t) matches
the cast size, keeping the body unchanged (variables: delta_y_sparse, nnz_idx,
i_t, y, step_length).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants