Skip to content

Add new query level trigger evaluation that doesn't depend on ScriptingService or _scripts API #2079

Open
eirsep wants to merge 8 commits intoopensearch-project:mainfrom
eirsep:feat/multi-tenant-trigger-eval
Open

Add new query level trigger evaluation that doesn't depend on ScriptingService or _scripts API #2079
eirsep wants to merge 8 commits intoopensearch-project:mainfrom
eirsep:feat/multi-tenant-trigger-eval

Conversation

@eirsep
Copy link
Copy Markdown
Member

@eirsep eirsep commented Apr 1, 2026

Adds remote query-level trigger evaluation behind the plugins.alerting.multi_tenant_trigger_eval_enabled feature flag (default: false).

When enabled, query-level trigger scripts are evaluated remotely via filter aggregations on the data index instead of locally via ScriptService. This
rewrites Painless trigger conditions (e.g. ctx.results[0].hits.total.value > 5) into filter aggregations, batches all triggers into a single _search request
with size: 0, and determines trigger state from doc_count. On evaluation error, triggers default to triggered=true (fail-open) so users are notified.

When disabled (default), the existing ScriptService evaluation path is completely unchanged.

Changes:

  • AlertingSettings / AlertingPlugin / MonitorRunnerExecutionContext / MonitorRunnerService — register and wire the new boolean cluster setting

  • TriggerScriptRewriter — rewrites ctx.results[0] → params.results_0 in trigger scripts

  • RemoteQueryLevelTriggerEvaluator — builds filter-agg search request, parses response into QueryLevelTriggerRunResult

  • QueryLevelMonitorRunner — branches on feature flag: remote eval path vs existing ScriptService path

  • AlertingRestTestCase — adds includeDefaults param to getSettings() helper

  • 10 integration tests covering: simple threshold, aggregation value, boolean logic (AND/OR), loop over hits, multiple triggers, script error (fail-open),
    large response, dry run

  • 3 regression tests verifying flag=false preserves existing behavior

Copy link
Copy Markdown
Collaborator

@engechas engechas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No blocking comments but it'd be great to have the test coverage include all existing cases for the non-remote monitor equivalents. See comment on parametrized test

* These tests verify that trigger scripts are correctly evaluated remotely via filter aggregations
* on the customer's cluster instead of locally via ScriptService.
*/
class RemoteQueryLevelTriggerIT : AlertingRestTestCase() {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can run the full set of QueryLevelTriggerIT with and without the remote flag enabled? Something similar to Junit's @ParametrizedTest annotation. Since the input/output should be the same for both modes, that sort of test would give confidence that the correctness is fully intact in the new mode

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The OpenSearch test framework uses JUnit 4 (via RandomizedRunner), not JUnit 5, so @ParameterizedTest isn't available. I have added regression tests in this same class where I toggle the setting on and off and compare the results and validate that they are the same.

monitorResult.inputResults.results.isNotEmpty()
) {
val searchInput = monitor.inputs[0] as SearchInput
val queryLevelTriggers = monitor.triggers.filterIsInstance<QueryLevelTrigger>()
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible for a trigger type to be non-QueryLevelTrigger if the monitor type is QueryLevel?

Wondering why the filterIsInstance is needed/if it excludes any valid triggers

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well ideologically it can't be anything but query level triggers but just for sanity leaving this here to guardrail against missing this translation fi we add any newer trigger types.

I added this because there is a possilibity we need to support a DataFusion engine instead of lucene engine and they have some limitations which might force us to introduce new triggers.

Comment on lines +10 to +11
* They are sent to the customer's cluster via a filter aggregation.
* The script context on the customer's cluster uses `params` instead of `ctx`,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: customer -> user

eirsep added 8 commits April 2, 2026 00:55
Register new boolean cluster setting to gate multi-tenant trigger
evaluation code. Setting is dynamic, node-scoped, defaults to false.

- Define MULTI_TENANT_TRIGGER_EVAL_ENABLED in AlertingSettings
- Register in AlertingPlugin.getSettings()
- Add multiTenantTriggerEvalEnabled field to MonitorRunnerExecutionContext
- Wire initial read + dynamic update consumer in MonitorRunnerService
- Add unit tests for default value and registration

🤖 Assisted by the code-assist SOP

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
…luator

Add TriggerScriptRewriter that replaces ctx.results[0] with
params.results_0 for remote Painless evaluation on customer's cluster.

Add RemoteQueryLevelTriggerEvaluator that builds a filter-agg search
request with one agg per trigger, sends it to the customer's cluster,
and parses doc_count to determine trigger results. Multiple triggers
are batched in a single request. Errors default to triggered (fail-open).

🤖 Assisted by the code-assist SOP

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
When multiTenantTriggerEvalEnabled is true and monitor type is
QUERY_LEVEL_MONITOR, batch-evaluate all triggers via
RemoteQueryLevelTriggerEvaluator before the trigger loop. Results
are looked up per trigger from the pre-computed map.

When flag is false, existing ScriptService-based evaluation path
is completely unchanged.

🤖 Assisted by the code-assist SOP

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Add comprehensive integration tests for multi-tenant query-level trigger
evaluation covering all trigger script patterns from the design doc:

- Simple threshold (hits.total.value comparison)
- Aggregation value (avg agg comparison)
- Boolean logic (AND and OR conditions)
- Loop over hits (for loop checking _source fields)
- Multiple triggers (mixed fire/no-fire)
- Script error (malformed Painless triggers on error)
- Large response (50 docs, non-trivial response)
- Dry run (verify no alerts persisted)

Each test enables the multi_tenant_trigger_eval_enabled flag, creates
an index with test data, builds a monitor with the appropriate trigger
script, executes via the _execute API, and verifies trigger results.

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Add 3 integration tests to RemoteQueryLevelTriggerIT verifying that
existing query-level alerting behavior is preserved when the
multi_tenant_trigger_eval_enabled flag is disabled:

- test query level trigger flag disabled: monitor executes correctly
  via ScriptService path when flag is off (default)
- test query level trigger flag default is false: verify setting
  defaults to false via cluster settings REST API
- test query level trigger toggle flag during execution: both
  ScriptService and remote eval paths produce correct results when
  flag is toggled between executions

Also adds includeDefaults param to AlertingRestTestCase.getSettings()
helper to support querying default cluster settings.

🤖 Assisted by the code-assist SOP

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
…igger_

Replace internal jargon in the filter aggregation key prefix used by
RemoteQueryLevelTriggerEvaluator. The prefix _query_trigger_ better
describes its scope (query-level monitor triggers).

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
On evaluation error or missing trigger results, default to triggered=false
instead of triggered=true. Errors are still logged at WARN/ERROR level
and surfaced in the trigger run result. This prevents false-positive
alerts when the remote evaluation encounters issues.

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
Use open source terminology consistently throughout the codebase.

Signed-off-by: Surya Sashank Nistala <snistala@amazon.com>
@eirsep eirsep force-pushed the feat/multi-tenant-trigger-eval branch from f7b83de to 007a4a7 Compare April 2, 2026 07:55
val searchSource = buildEvalSearchSource(triggerData, searchResponse)

return try {
val evalRequest = SearchRequest(*indices.toTypedArray()).source(searchSource)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not blocking:
Do we want to make use of the plugins.alerting.request_timeout or one of our other timeout-related cluster settings here?

Monitor.MonitorType.valueOf(monitor.monitorType.uppercase(Locale.ROOT)) == Monitor.MonitorType.QUERY_LEVEL_MONITOR &&
monitorResult.inputResults.results.isNotEmpty()
) {
val searchInput = monitor.inputs[0] as SearchInput
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Query-level monitors should always have SearchInput, a ClassCastException here would be unhandled

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants