Spawn persistence compaction workers separately

**Is your feature request related to a problem? Please describe.**

Pathway's persistence layer stores state in a KV storage backend (e.g. S3). Over time, older state snapshots become obsolete as they are superseded by newer ones. A compaction worker is responsible for identifying and deleting this stale state (obsolete objects can be identified by their naming conventions).

Currently, a compaction worker is started alongside every persistent operator as part of the main pipeline process. This means compaction runs wherever the pipeline runs. When the pipeline and the storage bucket are in different availability zones (a common production setup, e.g. pipeline on cheap compute, state on S3 in another region), every compaction cycle generates cross-zone traffic, which is expensive. As a result, compaction frequency is intentionally kept low to control costs — which means stale state accumulates and storage costs grow over time. This is a lose-lose tradeoff.

**Describe the solution you'd like**

Add a `spawn_compaction_worker` command to the Pathway CLI, alongside the existing `spawn` command. When invoked, it takes the same pipeline file as input but starts only the compaction worker — no pipeline operators, source connectors, or data processing are initialized. It is purely a storage maintenance process that connects to the configured KV backend, inspects persisted objects, and deletes stale ones using the same naming-convention logic already used by the embedded worker.

Usage would look like:

    # Start the pipeline as usual
    pathway spawn python pipeline.py

    # Start only the compaction worker, e.g. as a cron job or daemon co-located with the bucket
    pathway spawn_compaction_worker python pipeline.py

Both commands share the same pipeline file, so no separate configuration format or duplication is needed. The intended deployment pattern is to run `spawn_compaction_worker` in the same availability zone as the storage bucket (e.g. same AZ as the S3 bucket), while the pipeline itself runs wherever compute is cheapest. This means the compaction worker can run frequently and aggressively at minimal cross-zone traffic cost, while the pipeline is not penalized for storage operations.

**Describe alternatives you've considered**

The current approach of running the compaction worker inside the pipeline process is the only available option. Reducing compaction frequency is the only cost mitigation today, but it trades off storage bloat for cost savings. There is no way to get both low cost and low storage bloat with the current architecture.

**Additional context**

- The compaction worker process should be safe to run concurrently with the main pipeline. Any necessary coordination (e.g. ensuring the worker does not delete state that the pipeline is currently writing) should be handled correctly — this likely requires reviewing the existing locking or object-naming guarantees already in place for the embedded worker.
- The worker should be deployable as a simple cron job or a long-running daemon, depending on the user's preference.
- No changes to the persistence storage format or the main pipeline's behavior are required; this is purely an operational improvement.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spawn persistence compaction workers separately #213

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Spawn persistence compaction workers separately #213

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions