-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Shared: Generate more value-preserving flow summaries #19443
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shared: Generate more value-preserving flow summaries #19443
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a new value-preserving summary fallback between content-sensitive and taint-based flows. It updates test fixtures and QL modules to:
- Change
heuristic-summary
annotations fromtaint
tovalue
in Java, C#, and C++ test data. - Update
captureFlow
calls to the newcaptureHeuristicFlow
and expandedcaptureFlow(c,_,_)
signatures in inline test configurations. - Switch debug imports and modules from
PropagateFlow
toPropagateTaintFlow
, and removeexists()
guards in path queries.
Reviewed Changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
java/ql/test/utils/modelgenerator/dataflow/p/MultiPaths.java | Switched heuristic-summary tag from taint to value . |
java/ql/test/utils/modelgenerator/dataflow/p/InnerClasses.java | Updated nested-class heuristic-summary tags to value . |
java/ql/test/utils/modelgenerator/dataflow/p/Inheritance.java | Updated inheritance-test heuristic-summary tags to value . |
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java | Updated POJO heuristic-summary tag to value . |
java/ql/test/utils/modelgenerator/dataflow/p/FinalClass.java | Updated final-class heuristic-summary tag to value . |
java/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call. |
java/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload with two parameters. |
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql | Switched to PropagateTaintFlow::PathGraph import. |
java/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql | Updated FlowExplorationFwd to use PropagateTaintFlow . |
java/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in core module. |
csharp/ql/test/utils/modelgenerator/dataflow/Summaries.cs | Switched C# fixtures’ heuristic-summary tags to value . |
csharp/ql/test/utils/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call in C#. |
csharp/ql/test/utils/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in C#. |
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPath.ql | Switched to PropagateTaintFlow::PathGraph import in C#. |
csharp/ql/src/utils/modelgenerator/debug/CaptureSummaryModelsPartialPath.ql | Updated FlowExplorationFwd to use PropagateTaintFlow in C#. |
csharp/ql/src/utils/modelgenerator/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in core C# module. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp | Switched C++ fixtures’ heuristic-summary tags to value . |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureHeuristicSummaryModels.ql | Swapped captureFlow to captureHeuristicFlow call in C++. |
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/CaptureContentSummaryModels.ql | Expanded captureFlow to new overload in C++. |
Comments suppressed due to low confidence (3)
java/ql/test/utils/modelgenerator/dataflow/p/ImmutablePojo.java:28
- The heuristic-summary entry for
Argument[this]
still usestaint
; it should be updated tovalue
to match the new value-preserving fallback.
// heuristic-summary=p;ImmutablePojo;false;or;(String);;Argument[this];ReturnValue;taint;df-generated
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:212
- This
heuristic-summary
annotation still usestaint
; it should usevalue
to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*0];taint;df-generated
cpp/ql/test/library-tests/dataflow/modelgenerator/dataflow/summaries.cpp:213
- This
heuristic-summary
annotation still usestaint
; it should usevalue
to align with the updated fallback behavior.
// heuristic-summary=;;true;copy_struct;(HasInt *,const HasInt *);;Argument[1];Argument[*1];taint;df-generated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a really good idea @MathiasVP! 😄
I quickly checked locally, that this doesn't affect idempotency (tried on .NET Runtime) 🎉 .
Besides the inline questions/comments: Before the PR is merged, I would also like to at least re-generate the models for .NET Runtime and try and run our queries with the new models included.
string asLiftedTaintModel(Printing::SummaryApi api, string input, string output) { | ||
result = asModel(api, input, output, false, true) | ||
bindingset[input, output, preservesValue] | ||
string asLiftedTaintModel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to asLiftedModel
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 54f0eed
string captureFlow(DataFlowSummaryTargetApi api) { | ||
result = captureQualifierFlow(api) or | ||
result = captureThroughFlow(api) | ||
string captureHeuristicFlow(DataFlowSummaryTargetApi api, boolean preservesValue) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe keep the name captureFlow
(the predicate is in the Heuristic
module - rename was probably done before the re-factor you just rebased on top of :-) )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed in 4d2f2b8. I had to change a few tests to work around some ambiguities, but hopefully that's okay 🤞
result = Heuristic::captureFlow(api) and | ||
lift = true | ||
not exists(DataFlowSummaryTargetApi api0 | | ||
// If the heuristic summary is value-preserving then we keep both |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any cases where it is relevant to keep both?
If we go with this approach, we need to eliminate duplicates - as it is now the heuristic flow and content based flow will generate some identical summaries (from input to output) where the only difference is provenance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you find any cases where it is relevant to keep both?
I didn't, no. It's certainly not required for what I need it for (in OpenSSL). I'm happy to make this more restrictive 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds fine to me 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If it is ok with you - then maybe we should only use the heuristic models in case there is no content based. If we find a use-case where both is required, then we can consider changing it again.
That sounds fine to me 👍
Hm, wait a second. I'm looking at these three models we generate for the EVP_CIPHER_CTX_copy
function in OpenSSL:
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[*1]", "Argument[*0]", "value", "df-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*0]", "taint", "dfc-generated"]
- ["", "", True, "EVP_CIPHER_CTX_copy", "(EVP_CIPHER_CTX *,const EVP_CIPHER_CTX *)", "", "Argument[1]", "Argument[*1]", "taint", "dfc-generated"]
(the value-preserving model here is one of the models that motivated this work.)
As you can see, we get two taint models using the content-based approach and one value-preserving model using the heuristic-based approach. The heuristic one is the one I really care about. However, if we only keep the heuristic model if there is no content-based then we end up throwing the value-based one away 😭
So I think what I would prefer is:
- Taint-based heuristic-model and a taint-based content-model: Only use the content-based model
- Taint-based heuristic-model and a value-preserving content-model: Only use the content-based model
- Value-preserving heuristic-model and a taint-based content-model: Use both (we will need to deduplicate this)
- Value-preserving heuristic-model and a value-preserving content-model: Only use the content-based model
How does this sound, @michaelnebel ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's give that a try 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've done the deduplication in 37bc2bf now. Let me know what you think of it! I'll rerun the model generation experiments 🤞
Thanks! Do you want me to re-generate the models for .NET runtime as part of this PR? Or open another PR on top of this one that re-generates the models? I'm happy to follow whatever PR structure you think is best. ... I'm also happy to leave the .NET model generation to you, of course 😄 |
Usually I just make a separate PR with the update to the .NET Runtime models - as we also need to supply a change note. |
It would be fantastic if you could do it 👍 Thinking about it I'm not 100% sure which parameters you pass to |
result = ContentSensitive::captureFlow(api, _, _, lift, _) | ||
or | ||
exists(boolean preservesValue, string input, string output | | ||
not exists( | ||
DataFlowSummaryTargetApi api0, string input0, string output0, boolean preservesValue0 | ||
| | ||
// If the heuristic summary is taint-based, and we can generate a content-sensitive | ||
// summary that is value-preserving then we omit generating any heuristic summary. | ||
preservesValue = false and | ||
preservesValue0 = true | ||
or | ||
// However, if they're both value-preserving (or both taint-based) then we only | ||
// generate a heuristic summary if we didn't generate a content-sensitive summary. | ||
preservesValue = preservesValue0 and | ||
input0 = input and | ||
output0 = output | ||
| | ||
(api0 = api or api.lift() = api0) and | ||
exists(ContentSensitive::captureFlow(api0, false, _)) | ||
exists(ContentSensitive::captureFlow(api0, input0, output0, false, preservesValue0)) | ||
or | ||
api0.lift() = api.lift() and | ||
exists(ContentSensitive::captureFlow(api0, true, _)) | ||
exists(ContentSensitive::captureFlow(api0, input0, output0, true, preservesValue0)) | ||
) and | ||
result = Heuristic::captureFlow(api, preservesValue) and | ||
result = Heuristic::captureFlow(api, input, output, preservesValue) and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change does a bit more than de-duplicating the generated summaries.
Prior to the change, a heuristic summary (value or taint) wouldn't be generated (at all) in a case there existed any content sensitive summary for the API. With this change a heuristic summary is generated on an input/output pair in case there doesn't exist a content sensitive summary for the same input/output pair for the given API.
My original comment merely stated that we should try and remove duplicate entries (where the only difference was the provenance df-generated
/dfc-generated
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, that's a good point. I think the behavior you want is what we get if we get rid of this disjunct:
// If the heuristic summary is taint-based, and we can generate a content-sensitive
// summary that is value-preserving then we omit generating any heuristic summary.
preservesValue = false and
preservesValue0 = true
right? Then we'd only generate a heuristic summary if we fail to generate a content-sensitive with exactly the same (input, output, kind)
tuple.
I'm happy to do that if you'd rather want this behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realised that my previous comment is not entirely correct.
The issue only pertains to taint
like summaries and the cause is the other disjunkt:
preservesValue = preservesValue0 and
input0 = input and
output0 = output
That is, we only exclude a summary df-generated
taint summary in case there doesn't exist a dfc-generated
generated taint summary on the same input/output (string) pair. However, this causes a problem as the df-generated
taint summaries truncates field/property accesses in the input/output strings (but the dataflow configuration allows a "limited" set of field/property reads/stores - which means we start producing field conflation issues with taint
summaries).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Hm, yeah I can see how that's a problem. Should we delay deduplication until we see whether it's actually a problem in practice? I'd really like to somehow get this working as the taint summaries we get for OpenSSL isn't sufficient for what we need them for 😭
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried re-generating the .NET runtime models - and we do get some models that we don't want.
Also, I noticed that the printing of heuristic models might need to change for the exact value preserving models.
Is it ok if I experiment a bit and make some changes to the PR?
Considering to make the filtering like this:
// If the heuristic summary is taint-based, and we can generate a content-sensitive
// summary then we omit generating the heuristic summary.
preservesValue = false
or
// If they're both value-preserving then we only generate a heuristic summary if
// we didn't generate a content-sensitive summary on the same input/output pair.
preservesValue = true and
preservesValue0 = true and
input0 = input and
output0 = output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for regenerating the .NET runtime models 🙇
Yeah, feel free to experiment on the PR all you want. Let me know if I can do something to help out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I should also add: I think the filtering you propose there makes a lot of sense, and I also just checked that it still gives us the models we care about 🥳
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! 😄
I couldn't push the changes to this branch - could you cherry pick the last six commits from (when CI is green): #19481
The commits does the following:
- Filter summaries based on the idea proposed above, which means that
- For
value
based summaries we keep both heuristic and content-based (but de-duplicating based on provenance and keep thedfc-generated
). - For
taint
based heuristic summaries we only keep these in case there doesn't exist any (value
ortaint
) content-based summary for the same api.
- For
- Adjust the printing of
value
heuristic models. Taint steps allow (at least for C#) reading from a collection (as a taint step), which means that we used the type of the input to "guess" whether it was likely that we just read an element of a collection. However, this doesn't apply forvalue
models. - Fix an unrelated issue of the printing of source models (as these don't support
.Element
etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fantastic, Michael! I've pushed your commits now 🙇
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent - and thank you very much for contributing to the model generator @MathiasVP - I really appreciate it!
I will start some DCA runs to check whether this impacted performance.
…nt sensitive summary can be generated.
…inor issue with output printing in captureSink).
Ran DCA on .NET runtime for C# and Open JDK 17 and commons-io for Java. Performance doesn't appear to to be impacted. There are some changes in the produced tuples (but this is expected). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the this great contribution @MathiasVP !!
The failing integration tests are unrelated to the changes in this PR. |
Ideally, we want generated flow summaries to be content-sensitive. That is, a summary for a function such as:
should specify that we read the content
f
fromArgument[*0]
.However, if a function is super complex it may read/write many access paths, and this could cause an explosion in the number of summaries we generate.
To mitigate this, the flow summary generation library puts various restrictions on which callables receive content-sensitive summaries.
When a content-sensitive summary isn't generated, we currently fall back to a taint-configuration-based summary which means we only generate a taint summary and not a value-preserving summary.
This PR adds a "midpoint" in between the content-sensitive value-preserving summary and the taint-based summary so that we now:
This seems to generate much better models on OpenSSL in particular.