Skip to content
/ epoch Public

A time based container job scheduler for Drove

Notifications You must be signed in to change notification settings

PhonePe/epoch

Repository files navigation

E P O C H

Quality Gate Status

The task scheduler on Drove

Description

Epoch is task runner that allows you to run stand-alone Tasks on DROVE
Think of it being very similar to Chronos, but on Drove and not Mesos.

Features

  • Self-serve UI for creating and managing tasks
  • Ability to create a topology and schedule a recurring task
  • Ability to run the above topology instantly

Internals

Terminology

TASK: The primary unit of execution in Epoch is a Task. A task is a stand-alone unit of execution that run on Drove.
TOPOLOGY: A topology is a definition of how the task is run - it can be scheduled to run at a particular time or at a recurring intervals.

  • Instant Run: A topology can be run instantly at a particular time
  • Scheduled Run: Every topology is created with a QUARTZ cron expression. This expression determines the schedule of the topology.

⚠️ Tasks On Drove: Today, Drove only supports running tasks that are packaged as docker images. So, the task that you want to run on Drove should be packaged as a docker image and pushed to a Docker registry.

ℹ️ Quartz Cron: Epoch uses Quartz cron expressions to schedule tasks. You can read more about Quartz cron expressions here

Architecture

The first important aspect is to understand Leader Election and how requests are routed across different epoch instances.
A single Epoch node is the leader to all requests from the UI, and is responsible for running all Topologies/Tasks, and doing the various state management.
The remaining nodes route requests to the leader node.
This is done using a simple Zookeeper based leader election, and a Routing Servlet Filter.

Why did we choose to do all of this on a single leader node, isn't it less scalable ?

The actual execution of the task is done on Drove. Epoch is only responsible for managing the state of the task, and the topology.
So the resource intensive operation (of actually running the docker task) is outside of Epoch. Since only tracking these tasks through a scheduler is the only thing that Epoch does, one node of epoch should scale for 10s of thousands of task runs. Should we need to scale this further, we can always distribute the tasks across multiple nodes.
But until then, with the choice of not distributing, tracking all of it through a single instance is more scalable.
Internally, Epoch uses Kaal for scheduling the status checks and the topology runs in memory.

A TLDR version of the request routing is represented below

Understanding the various states

As explained earlier, a task is a stand-alone unit of execution.
The following diagram shows the various states of a task. This is inline with the states of a task in Drove.

The state of the task determines the state of a topology run.
The following diagram shows the various states of a specific run of the Topology

And finally, the above is only applicable if the Topology is not PAUSED. This is purely determined by the state of the Topology, set using the UI
The following shows the various states of a topology

What does a full create flow look like?

Zookeeper for storing tasks, runs and topologies

Epoch uses Zookeeper to store the tasks and topologies. The following diagram shows the structure of the data in Zookeeper

Usage

The Epoch server container is available at ghcr.io.

The container is intended to be run on a Drove cluster.

Environment Variables

The following environment variables are understood by the container:

Variable Name Required without external config Description
ZK_CONNECTION_STRING Yes. Unnecessary if config is being injected. Connection String for the Zookeeper Cluster
DROVE_ENDPOINT Yes. Unnecessary if config is being injected. HTTP(S) endpoint for the Drove cluster
DROVE_APP_NAME Injected by Drove App name for the container on the Drove cluster.
Do not keep changing this as it will lose stored job context.
CONFIG_FILE_PATH To use custom config file. Can be put on some executor and volume mounted in the container. By default config file in /home/default/config.yml is used.
ADMIN_PASSWORD Optional but Recommended. Unnecessary if config is injected. Password for the user admin which has read/write permissions. Default value is admin.
GUEST_PASSWORD Optional but Recommended. Unnecessary if config is injected. Password for the user guest which has read only permissions. Default value is guest.
GC_ALGO Optional GC to be used by JVM. By default G1GC is used.
JAVA_PROCESS_MIN_HEAP Optional Minimum Java Heap size. Set to 1 GB by default.
JAVA_PROCESS_MAX_HEAP Optional Maximum Java Heap size. Set to 1 GB by default.
JAVA_OPTS Optional Additional java options.
DEBUG Optional Set to a non-zero value to print environment variables etc. Note: This will print all env variables.

Deploying Epoch on Drove

The following is a sample app specification for deploying the epoch container on drove.

{
  "name": "epoch",
  "version": "1",
  "executable": {
    "type": "DOCKER",
    "url": "ghcr.io/phonepe/epoch:latest",
    "dockerPullTimeout": "2 minute"
  },
  "exposedPorts": [
    {
      "name": "main",
      "port": 8080,
      "type": "HTTP"
    },
    {
      "name": "admin",
      "port": 8081,
      "type": "HTTP"
    }
  ],
  "volumes": [],
  "type": "SERVICE",
  "logging": {
    "type": "LOCAL",
    "maxSize": "10m",
    "maxFiles": 3,
    "compress": true
  },
  "resources": [
    {
      "type": "MEMORY",
      "sizeInMB": 4096
    },
    {
      "type": "CPU",
      "count": 1
    }
  ],
  "placementPolicy": {
    "type": "ANY"
  },
  "healthcheck": {
    "mode": {
      "type": "HTTP",
      "protocol": "HTTP",
      "portName": "admin",
      "path": "/healthcheck",
      "verb": "GET",
      "successCodes": [
        200
      ],
      "payload": "",
      "connectionTimeout": "20 seconds"
    },
    "timeout": "5 seconds",
    "interval": "20 seconds",
    "attempts": 3,
    "initialDelay": "0 seconds"
  },
  "readiness": {
    "mode": {
      "type": "HTTP",
      "protocol": "HTTP",
      "portName": "admin",
      "path": "/healthcheck",
      "verb": "GET",
      "successCodes": [
        200
      ],
      "payload": "",
      "connectionTimeout": "3 seconds"
    },
    "timeout": "3 seconds",
    "interval": "10 seconds",
    "attempts": 3,
    "initialDelay": "10 seconds"
  },
  "tags": {},
  "env": {
    "JAVA_PROCESS_MIN_HEAP": "2g",
    "JAVA_PROCESS_MAX_HEAP": "2g",
    "ADMIN_PASSWORD" : "adminpassword",
    "GUEST_PASSWORD" : "guestpassword",
    "ZK_CONNECTION_STRING" : "<YOUR_ZK_CONNECTION_STRING>",
    "DROVE_ENDPOINT" : "<YOUR_DROVE_ENDPOINT>"
  },
  "exposureSpec": {
    "vhost": "epoch.<YOUR_DOMAIN>",
    "portName": "main",
    "mode": "ALL"
  },
  "preShutdown": {
    "hooks": [],
    "waitBeforeKill": "30 seconds"
  },
  "instances" : 1
}

Replace the following in the above:

  • YOUR_ZK_CONNECTION_STRING - Connection string for your zookeeper cluster
  • YOUR_DROVE_ENDPOINT - HTTP(S) endpoint for Drove
  • YOUR_DOMAIN - Domain on which you want drove-nixy to configure nginx vhost

Save the json in some file say epoch.json

Deploy to drove using the following command:

drove -c testing apps create epoch.json

This will create an app called epoch-1.

Scale the app to one instance.

drove -c oss apps deploy epoch-1 1 -w

The drove app should be available at http://epoch.<YOU_DOMAIN>.

License

Apache 2