C++: Divide number of bounds between branches for phi nodes#21329
C++: Divide number of bounds between branches for phi nodes#21329jketema merged 6 commits intogithub:mainfrom
Conversation
f041b27 to
e5c8f38
Compare
e5c8f38 to
d0681c6
Compare
There was a problem hiding this comment.
Pull request overview
This PR improves the heuristic for calculating the number of bounds for guard phi nodes in C++ range analysis. The change addresses an exponential growth issue that occurred when analyzing series of if-else statements with guards on the same variable.
Changes:
- Modified
nrOfBoundsPhiGuardfunction to use a new heuristic:(varBounds + 1) / 2instead of special-casing certain phi nodes - Added test case
repeated_if_else_statementsdemonstrating the fixed scenario with 11 consecutive if-else statements - Added helper predicates
countNrOfLowerBoundsandcountNrOfUpperBoundsinSimpleRangeAnalysisInternalmodule - Updated test query
nrOfBounds.qlto output actual bounds counts alongside estimates
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
cpp/ql/lib/semmle/code/cpp/rangeanalysis/SimpleRangeAnalysis.qll |
Changed bounds estimation heuristic in nrOfBoundsPhiGuard from special-casing to dividing by 2; added helper predicates for counting actual bounds |
cpp/ql/test/library-tests/rangeanalysis/SimpleRangeAnalysis/test.c |
Added repeated_if_else_statements function demonstrating the exponential growth scenario that is now fixed |
cpp/ql/test/library-tests/rangeanalysis/SimpleRangeAnalysis/nrOfBounds.ql |
Enhanced query to also output actual lower and upper bounds counts for validation |
cpp/ql/test/library-tests/rangeanalysis/SimpleRangeAnalysis/*.expected |
Updated expected test outputs to reflect the improved bounds estimates (line numbers shifted due to new test case) |
MathiasVP
left a comment
There was a problem hiding this comment.
The changes looks good to me! I think it would be a good idea to add a consistency check to check whether the estimated number of bounds is an upper bound on the actual number of bounds computed in the recursion. I realize this measure isn't genuinely an upper bound in all cases, but I still think it would be a good check to have. In nothing else, it makes it easy to validate changes to the measurements by running that consistency check on e.g., MRVA
| if (rhs < 19) { rhs << 1; } else { rhs << 2; } | ||
| if (rhs < 20) { rhs << 1; } else { rhs << 2; } | ||
| return rhs; // rhs has 12 bounds | ||
| } |
There was a problem hiding this comment.
FWIW, I would have appreciated putting this function in the bottom of the file to avoid conflating the diff in the .expected file (since these tests still aren't using inline expectations). No biggie! Just thought I would mention it
There was a problem hiding this comment.
I think it fits here as it's related to the repeated_if_statements function just above. You right that the diff gets bigger, but at least the QL changes are in a separate commit with a clean .expected diff :)
cpp/ql/lib/semmle/code/cpp/rangeanalysis/SimpleRangeAnalysis.qll
Outdated
Show resolved
Hide resolved
Co-authored-by: Mathias Vorreiter Pedersen <mathiasvp@github.com>
jketema
left a comment
There was a problem hiding this comment.
One question and one typo.
cpp/ql/lib/semmle/code/cpp/rangeanalysis/SimpleRangeAnalysis.qll
Outdated
Show resolved
Hide resolved
| // by the condition. In this case all lower bounds flow to `{ e1 }` and only | ||
| // lower bounds that are smaller than `c` flow to `{ e2 }`. |
There was a problem hiding this comment.
My knowledge of the simple range analysis library, so excuse my ignorance.
In this case all lower bounds flow to
{ e1 }and only
lower bounds that are smaller thancflow to{ e2 }.
This I don't understand, and it's not clear how relate this with the division by 2 you do below. Why do all lower bounds flow to e1 and why is the "smaller than" condition there in the case of e2?
There was a problem hiding this comment.
Good question. Maybe it's easiest to explain by making things a bit more concrete.
Suppose the lower bounds for x are {2, 11, 22} and c is the constant 5.
In the true branch we know x < 5. This is an upper bound and thus doesn't affect the lower bounds, hence the "all lower bounds flow to { e1 }".
In the false branch we know x >= 5 which allow us to prevent the lower bounds 11 and 22 from flowing to x (done here). So x's bounds will be {2, 5} hence the "only lower bounds that are smaller than c flow to { e2 }"
At the phi node after the if statement the bounds from both branches are joined and we end up with 5 lower bounds: {2, 5, 11, 22}.
It's important to get that final estimate correct, as inaccuracies there can compound (as in the coding standards test with a gazillion if statements after each other) and dividing by 2 and adding a half gets us there.
In general we can't know how many bounds will exist inside each branch. But the number of bounds will be at most the number of bounds on x and at least 1, so guessing right in the middle reduces how wrong we can be in the worst case.
Does that make sense and should I try and expand on the comment?
There was a problem hiding this comment.
Thanks for the explanation. I think it's fine to leave it as is. I clearly just don't have enough background on how the lower and upper bounds interact.
I'm happy to approve once you've fixed the typo I spotted.
Co-authored-by: Jeroen Ketema <93738568+jketema@users.noreply.github.com>
This PR changes the heuristic for how the number of bounds is calculated for a guard phi node.
This is an improvement in a few ways:
It avoids a special case where nodes that are both normal phi nodes and guard phi nodes needed special treatment.
I think this heuristic makes sense intuitively.
It fixes a problem where a series of if-else statements with guards on the same variable caused the number of bounds to exponentially increase while the actual range analysis did in fact not introduce any bounds.
The first commit adds an example of this. In the last commit the change to the expected file around
test.c:453totest.c:463show how we now handle this case better.This fixes a recent problem observed over in coding standards. See: Revert "C++: Accept test changes after github/codeql#21313." codeql-coding-standards#1041.
For reviewing the comment inside
nrOfBoundsPhiGuardshould help explain what is going on.