Skip to content

Conversation

@hashroute3
Copy link
Member

@hashroute3 hashroute3 commented Oct 9, 2025

Description

This PR introduces time-based cluster scheduling to Trino Gateway, allowing administrators to automatically activate and deactivate clusters based on cron schedules. This feature is particularly useful for:

This feature is implemented as a separate, optional module that is not included by default — it can be added on-demand for deployments that require time-based scheduling capabilities.

  • Cost optimization by dynamically activating or deactivating clusters during off-hours. While Trino Gateway itself does not shut down clusters, it can stop routing queries to them, allowing external autoscaling mechanisms to scale the clusters down naturally.
  • Workload management by activating specific clusters during peak hours
  • Maintenance windows by scheduling periods when clusters are excluded from routing.
  • Different timezone support for global deployments

Key features:

  • Cron-based scheduling with UNIX cron expression support
  • Configurable timezone support (defaults to America/Los_Angeles)
  • Flexible activation logic with activeDuringCron flag
  • Configurable check intervals for schedule evaluation
  • Graceful error handling for invalid configurations
  • Comprehensive logging for visibility into scheduling decisions and state transitions
  • Modular design — provided as a standalone module that is not enabled by default and can be included only when needed

Example configuration:

scheduleConfiguration:
  enabled: true
  checkInterval: 5m
  timezone: America/New_York  # Override default PST timezone if needed
  schedules:
    - clusterName: production-cluster
      cronExpression: "* 9-17 * * 1-5"  # Active during business hours (M-F)
      activeDuringCron: true
    - clusterName: dev-cluster
      cronExpression: "* 8-20 * * *"    # Active 8 AM to 8 PM daily
      activeDuringCron: true

Additional context and related issues

The scheduler works by:

  1. Parsing cron expressions at startup using the cronutils library
  2. Periodically checking each cluster's schedule against the current time
  3. Activating/deactivating clusters based on:
    • Whether the current time matches the cron schedule
    • The activeDuringCron flag (determines if cluster should be active during or outside the schedule)
    • The cluster's current state

Implementation details:

  • Implemented as a separate module that is not loaded by default — deployments can include it only
  • Uses ScheduledExecutorService for reliable scheduling
  • Thread-safe using ConcurrentHashMap for execution times
  • Graceful handling of configuration changes and errors
  • Comprehensive unit tests covering various scenarios
  • Default timezone set to America/Los_Angeles (PST/PDT) for backward compatibility

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
(x) Release notes are required, with the following suggested text:

*Add time-based cluster scheduling as an optional module, allowing clusters to be automatically activated or deactivated based on cron schedules.
This feature is not enabled by default and can be included on-demand for deployments that need time-based scheduling.
It enables cost optimization by aligning cluster routing availability with predictable usage patterns — while Trino Gateway does not directly shut down clusters, disabling routing allows external autoscaling systems to scale resources down naturally.
  Supports:
  - UNIX cron expressions for flexible scheduling
  - Configurable timezones (defaults to America/Los_Angeles)
  - Per-cluster schedules with customizable activation logic
  - Configurable check intervals

Summary by Sourcery

Implement an optional cron-based cluster scheduler that parses schedules from configuration, evaluates them at fixed intervals, and uses the backend manager to toggle routing for clusters according to defined time windows

New Features:

  • Introduce an optional time-based cluster scheduling module to automatically activate or deactivate clusters based on cron expressions
  • Support configurable timezones for schedule evaluation (defaulting to America/Los_Angeles)
  • Allow per-cluster schedules with flexible activation logic (activeDuringCron flag) and customizable check intervals

Enhancements:

  • Integrate the scheduler into the application via a Guice module and lifecycle hooks
  • Extend HaGatewayConfiguration to include ScheduleConfiguration for optional scheduling support

Build:

  • Add cron-utils, jakarta.inject-api, and slf4j-api dependencies to the POM

Documentation:

  • Provide an example schedule-config.yaml demonstrating time-based activation settings

Tests:

  • Add comprehensive unit tests for ClusterScheduler covering activation, deactivation, timezone handling, invalid configurations, and multiple clusters

@cla-bot
Copy link

cla-bot bot commented Oct 9, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@sourcery-ai
Copy link

sourcery-ai bot commented Oct 9, 2025

Reviewer's Guide

This PR introduces an optional time-based scheduling module that parses UNIX cron expressions at startup, uses a ScheduledExecutorService to periodically evaluate each cluster’s schedule against the current time (with configurable timezones), and invokes the GatewayBackendManager to activate or deactivate clusters based on cron matches and an activeDuringCron flag. It integrates via Guice providers, updates the main configuration API, and is packaged as a standalone module.

Sequence diagram for periodic cluster activation/deactivation

sequenceDiagram
    participant Scheduler as ClusterScheduler
    participant BackendManager as GatewayBackendManager
    participant Config as ScheduleConfiguration
    participant Cluster as ProxyBackendConfiguration
    Scheduler->>Config: Get schedules and timezone
    Scheduler->>Scheduler: Parse cron expressions
    Scheduler->>Scheduler: Start periodic check (ScheduledExecutorService)
    loop Every checkInterval
        Scheduler->>Config: For each ClusterSchedule
        Scheduler->>Scheduler: Evaluate cron match for current time
        Scheduler->>BackendManager: Get cluster by name
        alt Cluster found
            Scheduler->>Cluster: Check current active state
            alt State needs change
                Scheduler->>BackendManager: activateBackend()/deactivateBackend()
            else State unchanged
                Scheduler->>Scheduler: No action
            end
        else Cluster not found
            Scheduler->>Scheduler: Log warning
        end
    end
Loading

File-Level Changes

Change Details Files
Build and example configuration updates
  • Add cron-utils, jakarta.inject-api, and slf4j-api dependencies to the POM
  • Introduce an example YAML (schedule-config.yaml) demonstrating scheduling settings
gateway-ha/pom.xml
examples/schedule-config.yaml
Configuration API for scheduling
  • Define ScheduleConfiguration and ClusterSchedule for cron and timezone settings
  • Update HaGatewayConfiguration to include ScheduleConfiguration
  • Add ClusterSchedulerModule for Guice bindings based on configuration
  • Add ClusterSchedulerConfiguration to manage scheduler lifecycle
gateway-ha/src/main/java/io/trino/gateway/ha/config/ScheduleConfiguration.java
gateway-ha/src/main/java/io/trino/gateway/ha/config/HaGatewayConfiguration.java
gateway-ha/src/main/java/io/trino/gateway/ha/module/ClusterSchedulerModule.java
gateway-ha/src/main/java/io/trino/gateway/ha/config/ClusterSchedulerConfiguration.java
ClusterScheduler implementation
  • Use cronutils to parse cron expressions and compute next execution times
  • Schedule periodic checks via ScheduledExecutorService in the configured timezone
  • Compare cron matches and activeDuringCron to determine desired cluster state
  • Invoke activateBackend or deactivateBackend on the GatewayBackendManager
  • Handle invalid cron expressions, configuration errors, and graceful shutdown
gateway-ha/src/main/java/io/trino/gateway/ha/scheduler/ClusterScheduler.java
Comprehensive unit tests for scheduling logic
  • Add TestClusterScheduler covering activation, deactivation, no-op, disabled state, timezones, invalid cron, multiple clusters, and inverted logic
  • Use Mockito and AssertJ to verify interactions with GatewayBackendManager
gateway-ha/src/test/java/io/trino/gateway/ha/scheduler/TestClusterScheduler.java

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:

## Individual Comments

### Comment 1
<location> `gateway-ha/src/main/java/io/trino/gateway/ha/config/ScheduleConfiguration.java:74-77` </location>
<code_context>
+    }
+
+    @JsonProperty
+    public void setSchedules(List<ClusterSchedule> schedules)
+    {
+        this.schedules = schedules;
</code_context>

<issue_to_address>
**suggestion:** Consider defensive copying of the schedules list in the setter.

Assigning the list directly exposes internal state. Creating a copy (e.g., new ArrayList<>(schedules)) protects against external changes.

```suggestion
    public void setSchedules(List<ClusterSchedule> schedules)
    {
        this.schedules = (schedules == null) ? null : new ArrayList<>(schedules);
    }
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@cla-bot
Copy link

cla-bot bot commented Oct 9, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@hashroute3 hashroute3 force-pushed the feature/cluster-scheduler-v2 branch from de5cb2b to ffbc046 Compare October 9, 2025 03:13
@cla-bot
Copy link

cla-bot bot commented Oct 9, 2025

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to [email protected]. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@Chaho12
Copy link
Member

Chaho12 commented Oct 15, 2025

@andythsu
Copy link
Member

dang, the sequence diagram from sourcery.ai is 💯

@andythsu
Copy link
Member

andythsu commented Oct 15, 2025

To answer some of your motivations behind this change:

Cost optimization by dynamically activating or deactivating clusters during off-hours. While Trino Gateway itself does not shut down clusters, it can stop routing queries to them, allowing external autoscaling mechanisms to scale the clusters down naturally.

We used to let Trino Gateway handle activation/deactivation of the clusters but then we found some issues

  1. If the clusters are deactivated, they are considered gone from Trino Gateway's perspective, and admins will have to intervene to activate the clusters, which is easily forgotten and therefore causing more problems (no queries can be routed)
  2. If the clusters are activated/deactivated automatically by Trino Gateway (whether intentional or unintentional), often times we don't get any notification, or even if we do, those notifications will be overlooked. This will cause problems for query routing as well.

Therefore, we've decided to disable automatic activation/deactivation from Trino Gateway completely since it's a very important switch that only admins should decide when to make the decision.

Even if in the ideal world we fix these issues, it would still be easier and more straightforward to just shut down the clusters instead of "deactivating" from Trino Gateway because at the end of the day Trino Gateway is just routing requests to the clusters. If you want to optimize cost, you should just turn off the clusters. Blinding Trino Gateway from the cluster isn't going to optimize your cost

Workload management by activating specific clusters during peak hours

We talked about dynamically bringing up Trino clusters during peak hours and killing them during non-peak hours, but this feature was not implemented. It may not even be a good feature in Trino Gateway. IMO, the best approach is to have some sort of monitoring system that monitors your traffic, brings up more Trino clusters if needed, and registers the Trino clusters to Trino Gateway.

For some context/background knowledge, current Trino Gateway has a health check that checks Trino clusters' health. It can be HEALTHY, UNHEALTHY, PENDING. Definition on each type can be found https://trinodb.github.io/trino-gateway/routing-rules/?h=healthy#trinostatus

tl;dr, if you shut down the Trino cluster, Trino Gateway will not deactivate the cluster, but will mark the cluster as UNHEALTHY. Trino Gateway will not route the requests to UNHEALTHY clusters, but will continue probing them. However, if you deactivate the clusters on Trino Gateway, it loses the ability to "revive" the cluster without admin's intervention.

@hashroute3
Copy link
Member Author

Thanks for the detailed context, helps clarify the rationale behind disabling automatic activation/deactivation in Gateway.

To clarify, this feature does not start or stop Trino clusters. It simply provides time-based routing control, letting Gateway temporarily include or exclude clusters from routing based on configurable schedules. This aligns Gateway behavior with autoscaling systems that already bring clusters up/down during predictable windows, reducing unnecessary routing attempts or failures.
few points for clarification:
The module is optional and disabled by default, preserving existing behavior.
It doesn’t modify cluster activation state — only routing eligibility.

From testing cluster lifecycle management relying solely on HEALTHY/UNHEALTHY states, I observed a few challenges:

  • Graceful termination only applies to workers, shutting down a cluster kills in-flight queries since the coordinator also goes down.

  • When a cluster restarts, Gateway often marks it HEALTHY before all workers are ready, briefly increasing latency compared to other clusters.

We are already using this mechanism and helped mitigate both by coordinating routing availability with known cluster schedules.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants