Add new query level trigger evaluation that doesn't depend on ScriptingService or _scripts API #2079
Add new query level trigger evaluation that doesn't depend on ScriptingService or _scripts API #2079eirsep wants to merge 8 commits intoopensearch-project:mainfrom
Conversation
45c682c to
c1faf0b
Compare
engechas
left a comment
There was a problem hiding this comment.
No blocking comments but it'd be great to have the test coverage include all existing cases for the non-remote monitor equivalents. See comment on parametrized test
| * These tests verify that trigger scripts are correctly evaluated remotely via filter aggregations | ||
| * on the customer's cluster instead of locally via ScriptService. | ||
| */ | ||
| class RemoteQueryLevelTriggerIT : AlertingRestTestCase() { |
There was a problem hiding this comment.
Is there a way we can run the full set of QueryLevelTriggerIT with and without the remote flag enabled? Something similar to Junit's @ParametrizedTest annotation. Since the input/output should be the same for both modes, that sort of test would give confidence that the correctness is fully intact in the new mode
There was a problem hiding this comment.
The OpenSearch test framework uses JUnit 4 (via RandomizedRunner), not JUnit 5, so @ParameterizedTest isn't available. I have added regression tests in this same class where I toggle the setting on and off and compare the results and validate that they are the same.
| monitorResult.inputResults.results.isNotEmpty() | ||
| ) { | ||
| val searchInput = monitor.inputs[0] as SearchInput | ||
| val queryLevelTriggers = monitor.triggers.filterIsInstance<QueryLevelTrigger>() |
There was a problem hiding this comment.
Is it possible for a trigger type to be non-QueryLevelTrigger if the monitor type is QueryLevel?
Wondering why the filterIsInstance is needed/if it excludes any valid triggers
There was a problem hiding this comment.
well ideologically it can't be anything but query level triggers but just for sanity leaving this here to guardrail against missing this translation fi we add any newer trigger types.
I added this because there is a possilibity we need to support a DataFusion engine instead of lucene engine and they have some limitations which might force us to introduce new triggers.
alerting/src/main/kotlin/org/opensearch/alerting/QueryLevelMonitorRunner.kt
Show resolved
Hide resolved
| * They are sent to the customer's cluster via a filter aggregation. | ||
| * The script context on the customer's cluster uses `params` instead of `ctx`, |
Register new boolean cluster setting to gate multi-tenant trigger evaluation code. Setting is dynamic, node-scoped, defaults to false. - Define MULTI_TENANT_TRIGGER_EVAL_ENABLED in AlertingSettings - Register in AlertingPlugin.getSettings() - Add multiTenantTriggerEvalEnabled field to MonitorRunnerExecutionContext - Wire initial read + dynamic update consumer in MonitorRunnerService - Add unit tests for default value and registration 🤖 Assisted by the code-assist SOP Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
…luator Add TriggerScriptRewriter that replaces ctx.results[0] with params.results_0 for remote Painless evaluation on customer's cluster. Add RemoteQueryLevelTriggerEvaluator that builds a filter-agg search request with one agg per trigger, sends it to the customer's cluster, and parses doc_count to determine trigger results. Multiple triggers are batched in a single request. Errors default to triggered (fail-open). 🤖 Assisted by the code-assist SOP Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
When multiTenantTriggerEvalEnabled is true and monitor type is QUERY_LEVEL_MONITOR, batch-evaluate all triggers via RemoteQueryLevelTriggerEvaluator before the trigger loop. Results are looked up per trigger from the pre-computed map. When flag is false, existing ScriptService-based evaluation path is completely unchanged. 🤖 Assisted by the code-assist SOP Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Add comprehensive integration tests for multi-tenant query-level trigger evaluation covering all trigger script patterns from the design doc: - Simple threshold (hits.total.value comparison) - Aggregation value (avg agg comparison) - Boolean logic (AND and OR conditions) - Loop over hits (for loop checking _source fields) - Multiple triggers (mixed fire/no-fire) - Script error (malformed Painless triggers on error) - Large response (50 docs, non-trivial response) - Dry run (verify no alerts persisted) Each test enables the multi_tenant_trigger_eval_enabled flag, creates an index with test data, builds a monitor with the appropriate trigger script, executes via the _execute API, and verifies trigger results. Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Add 3 integration tests to RemoteQueryLevelTriggerIT verifying that existing query-level alerting behavior is preserved when the multi_tenant_trigger_eval_enabled flag is disabled: - test query level trigger flag disabled: monitor executes correctly via ScriptService path when flag is off (default) - test query level trigger flag default is false: verify setting defaults to false via cluster settings REST API - test query level trigger toggle flag during execution: both ScriptService and remote eval paths produce correct results when flag is toggled between executions Also adds includeDefaults param to AlertingRestTestCase.getSettings() helper to support querying default cluster settings. 🤖 Assisted by the code-assist SOP Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
…igger_ Replace internal jargon in the filter aggregation key prefix used by RemoteQueryLevelTriggerEvaluator. The prefix _query_trigger_ better describes its scope (query-level monitor triggers). Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
On evaluation error or missing trigger results, default to triggered=false instead of triggered=true. Errors are still logged at WARN/ERROR level and surfaced in the trigger run result. This prevents false-positive alerts when the remote evaluation encounters issues. Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Use open source terminology consistently throughout the codebase. Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
f7b83de to
007a4a7
Compare
| val searchSource = buildEvalSearchSource(triggerData, searchResponse) | ||
|
|
||
| return try { | ||
| val evalRequest = SearchRequest(*indices.toTypedArray()).source(searchSource) |
There was a problem hiding this comment.
Not blocking:
Do we want to make use of the plugins.alerting.request_timeout or one of our other timeout-related cluster settings here?
| Monitor.MonitorType.valueOf(monitor.monitorType.uppercase(Locale.ROOT)) == Monitor.MonitorType.QUERY_LEVEL_MONITOR && | ||
| monitorResult.inputResults.results.isNotEmpty() | ||
| ) { | ||
| val searchInput = monitor.inputs[0] as SearchInput |
There was a problem hiding this comment.
nit: Query-level monitors should always have SearchInput, a ClassCastException here would be unhandled
Adds remote query-level trigger evaluation behind the plugins.alerting.multi_tenant_trigger_eval_enabled feature flag (default: false).
When enabled, query-level trigger scripts are evaluated remotely via filter aggregations on the data index instead of locally via ScriptService. This
rewrites Painless trigger conditions (e.g. ctx.results[0].hits.total.value > 5) into filter aggregations, batches all triggers into a single _search request
with size: 0, and determines trigger state from doc_count. On evaluation error, triggers default to triggered=true (fail-open) so users are notified.
When disabled (default), the existing ScriptService evaluation path is completely unchanged.
Changes:
AlertingSettings / AlertingPlugin / MonitorRunnerExecutionContext / MonitorRunnerService — register and wire the new boolean cluster setting
TriggerScriptRewriter — rewrites ctx.results[0] → params.results_0 in trigger scripts
RemoteQueryLevelTriggerEvaluator — builds filter-agg search request, parses response into QueryLevelTriggerRunResult
QueryLevelMonitorRunner — branches on feature flag: remote eval path vs existing ScriptService path
AlertingRestTestCase — adds includeDefaults param to getSettings() helper
10 integration tests covering: simple threshold, aggregation value, boolean logic (AND/OR), loop over hits, multiple triggers, script error (fail-open),
large response, dry run
3 regression tests verifying flag=false preserves existing behavior