Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
20 commits
Select commit Hold shift + click to select a range
469219b
Add example_atproto_plugins: live JetStream sample
haileyok Apr 30, 2026
2c7edba
Address PR review: align with atproto-ruleset structure
haileyok Apr 30, 2026
1f6a8de
Fix mypy: split per-connection streaming into helper
haileyok Apr 30, 2026
b28092f
Use JetStream-native paths in sample rules and Action shape
haileyok Apr 30, 2026
c2c0746
Address review feedback on JetStream sample
haileyok May 6, 2026
7cccb4b
Fix _stream_one_connection test: catch close exception, use _item attr
haileyok May 6, 2026
232c449
Tighten _event_to_action time_us check to int only
haileyok May 6, 2026
0c8f462
Re-lock uv.lock at revision 3 to match main and fix Docker build
haileyok May 6, 2026
260e462
Merge remote-tracking branch 'origin/main' into hailey/atproto-jetstr…
haileyok May 7, 2026
6bf2841
Per-collection+operation action_names; UI default features per event
haileyok May 7, 2026
74ae20d
Mint action_ids in plugin to fix UI duplication
haileyok May 7, 2026
b650d94
Unify FollowSubject + LikeSubjectUri into one Subject feature
haileyok May 7, 2026
58d1bcf
wording tweaks
haileyok May 7, 2026
853aa74
fix
haileyok May 7, 2026
2cdb462
add a changelog
haileyok May 8, 2026
a591e1c
Switch to WebSocketApp.run_forever for keepalive
haileyok May 8, 2026
bd44b1d
Exponential reconnect backoff in JetStream input stream
haileyok May 8, 2026
ff18575
Drop Sentry usage and split JSON-decode error handling
haileyok May 8, 2026
fbee1a8
Use osprey.worker.lib.backoff.Backoff instead of hand-rolled state
haileyok May 8, 2026
f5b23ff
Merge branch 'main' into hailey/atproto-jetstream-sample
cassidyjames May 15, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,16 @@ Top-level modules:
- `osprey_coordinator/` — Rust gRPC coordinator (tokio, tonic, etcd, rdkafka). Rust code belongs here.
- `proto/osprey/rpc/` — protobuf source of truth for `osprey_rpc` and `osprey_coordinator` types.
- `example_plugins/` — reference plugins (UDFs, output sinks, labels service) using the pluggy-based plugin system. Do not add production code here.
- `example_atproto_plugins/` — reference plugin demonstrating a custom input stream that consumes the Bluesky firehose. Stack `docker-compose.atproto.yaml` on top of the main compose file (or use `./run-atproto.sh`) to run Osprey against live ATProto traffic. Do not add production code here.
- `example_rules/` — sample SML rules and YAML config.
- `example_atproto_rules/` — sample SML rules paired with `example_atproto_plugins/`.

Reference files: `docs/DEVELOPMENT.md` (setup), `example_plugins/src/register_plugins.py` (plugin patterns), `example_plugins/src/services/labels_service.py` (labels service example).

## Design

- API: gRPC between `osprey_coordinator` and workers; HTTP/Flask for `osprey-ui-api` (port 5004); protobuf definitions under `proto/osprey/rpc/` are authoritative.
- Rules: SML (Osprey's rule language) with user-defined functions registered via pluggy hooks (`@hookimpl_osprey`): `register_udfs`, `register_output_sinks`, `register_labels_service_or_provider`.
- Rules: SML (Osprey's rule language) with user-defined functions registered via pluggy hooks (`@hookimpl_osprey`): `register_udfs`, `register_output_sinks`, `register_labels_service_or_provider`, `register_input_stream` (custom event source; see `example_atproto_plugins/`).
- Data model conventions: Pydantic for models, SQLAlchemy for persistence (versions pinned in `pyproject.toml`).

## Build and run
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- Add `ParseInt` UDF — converts a numeric string to an integer ([#190](https://github.com/roostorg/osprey/pull/190) by [@bealsbe](https://github.com/bealsbe))
- Add `StringSlice` UDF which extracts a substring by index range ([#189](https://github.com/roostorg/osprey/pull/189) by [@bealsbe](https://github.com/bealsbe))
- Add `InExperiment` UDF which checks if an entity is in an experiment ([#203](https://github.com/roostorg/osprey/pull/203) by [@bealsbe](https://github.com/bealsbe))
- Add ATProto JetStream example plugins and rules ([#236](https://github.com/roostorg/osprey/pull/236) by [@haileyok](https://github.com/haileyok))

### 🐛 Bug fixes
- Default to selecting all for event stream ([#194](https://github.com/roostorg/osprey/pull/194) by [@chimosky](https://github.com/chimosky))
21 changes: 21 additions & 0 deletions docker-compose.atproto.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Override that swaps the synthetic Kafka producer for the live Bluesky JetStream
# firehose, via the example_atproto_plugins package. Stack on top of the main
# compose file:
#
# docker compose -f docker-compose.yaml -f docker-compose.atproto.yaml up
#
# Or use the convenience wrapper: ./run-atproto.sh
services:
osprey-worker:
environment:
OSPREY_INPUT_STREAM_SOURCE: plugin
OSPREY_RULES_PATH: ./example_atproto_rules
volumes:
- ./example_atproto_rules:/osprey/example_atproto_rules
- ./example_atproto_plugins:/osprey/example_atproto_plugins

osprey-ui-api:
environment:
OSPREY_RULES_PATH: /osprey/example_atproto_rules
volumes:
- ./example_atproto_rules:/osprey/example_atproto_rules
72 changes: 72 additions & 0 deletions example_atproto_plugins/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# example_atproto_plugins

A sample Osprey plugin that consumes ATProto's [JetStream](https://docs.bsky.app/blog/jetstream) as the input event source. It gives you:

- a `register_input_stream` hook implementation that subscribes to JetStream over WebSocket and yields Osprey `Action`s with the JetStream JSON event passed through as-is,
- realistic per-second event volume from the live Bluesky network, which is useful for load and soak testing changes that the synthetic 1-event/second producer doesn't exercise,
- a companion `example_atproto_rules/` tree showing how to organize rules against ATProto event shapes, with file structure modeled on [haileyok/atproto-ruleset](https://github.com/haileyok/atproto-ruleset).

## Running

From the repo root:

```sh
./run-atproto.sh
```

This brings up the full Osprey local stack (Druid, Postgres, MinIO, Kafka) along with a JetStream websocket override, and swaps the worker's input source from Kafka to the JetStream plugin, pointing it at `example_atproto_rules` instead of `example_rules`. First-run startup takes a few minutes.

## Configuration

| Env var | Default | Description |
| --- | --- | --- |
| `OSPREY_INPUT_STREAM_SOURCE` | (must be) `plugin` | Selects the plugin-provided stream. |
| `OSPREY_JETSTREAM_ENDPOINT` | `wss://jetstream2.us-west.bsky.network/subscribe` | JetStream WebSocket URL. |
| `OSPREY_JETSTREAM_WANTED_COLLECTIONS` | `app.bsky.feed.post,app.bsky.feed.like,app.bsky.feed.repost,app.bsky.graph.follow,app.bsky.actor.profile` | Comma-separated collections to subscribe to (server-side filter). |

## Action shape

The JetStream JSON event is passed through unchanged as the Action's `data` dict, so rules read JetStream-native paths directly. `action_name` is `<operation>_<short>` for commit events (`create_post`, `delete_like`, `update_profile`, …) using the short names defined in `COLLECTION_NAMES`, or `identity` for identity events.

### Commit events (e.g. `create_post`, `delete_like`)

```
{
"did": "did:plc:...",
"time_us": 1714500000000000,
"kind": "commit",
"commit": {
"rev": "...",
"operation": "create" | "update" | "delete",
"collection": "app.bsky.feed.post",
"rkey": "...",
"cid": "...",
"record": { ... raw ATProto record ... }
}
}
```

### Identity events (`action_name='identity'`)

```
{
"did": "did:plc:...",
"time_us": ...,
"kind": "identity",
"identity": {"did": "...", "handle": "...", "seq": ..., "time": "..."}
}
```

Account events, commits for collections not in `COLLECTION_NAMES`, and commits with operations other than `create` / `update` / `delete` are skipped.

### UI default features

`example_atproto_rules/config/ui_config.yaml` declares the per-action default features the Osprey UI surfaces in the event stream — e.g. `PostText` for `create_post`, `IdentityHandle` for `identity`, `Subject` for like / repost / follow events. Add new entries there to expose more fields without touching rule code.

`action_id` is minted from `snowflake-id-worker` in batches of 250. The plugin therefore needs `SNOWFLAKE_API_ENDPOINT` to be set (the local docker-compose stack provides it).

## Caveats

- **Not production-ready.** No durable cursor on process restart, no zstd compression, no DID-level filtering. Good for sample / load-testing purposes; not a drop-in for a real ATProto deployment.
- **No event enrichment.** JetStream only carries what's in the commit itself; rulesets that depend on handle / profile / account age (such as much of [atproto-ruleset](https://github.com/haileyok/atproto-ruleset)) are fed by a separate enrichment pipeline, not JetStream directly. This plugin emits JetStream-native paths ($.did, $.commit.collection, etc.); enrichment-fed rulesets would need an enrichment service in front of this one or a different plugin.
- **Connection health.** WebSocket-level PING/PONG keepalive runs every 20s with a 10s pong timeout (`websocket-client`'s `WebSocketApp.run_forever(ping_interval, ping_timeout)`). A stalled or dead connection is detected within ~30s and triggers a reconnect from the last seen `time_us` cursor.
Empty file.
18 changes: 18 additions & 0 deletions example_atproto_plugins/pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
[project]
name = "example_atproto_plugins"
version = "0.1.0"
description = "Example Osprey plugin that consumes Bluesky's ATProto JetStream firehose"
requires-python = ">=3.11"
dependencies = [
"pluggy==1.5.0",
"websocket-client==1.8.0",
]

[tool.setuptools]
package-dir = {"" = "src"}

[tool.setuptools.packages.find]
where = ["src"]

[project.entry-points.osprey_plugin]
atproto_plugins = "atproto_plugin.register_plugins"
Empty file.
187 changes: 187 additions & 0 deletions example_atproto_plugins/src/atproto_plugin/jetstream_input_stream.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,187 @@
import json
import time
from datetime import datetime, timezone
from typing import Any, Dict, Iterator, List, Optional
from urllib.parse import urlencode

import gevent
import websocket
from gevent.queue import Queue
from osprey.engine.executor.execution_context import Action
from osprey.worker.lib.backoff import Backoff
from osprey.worker.lib.instruments import metrics
from osprey.worker.lib.osprey_shared.logging import get_logger
from osprey.worker.lib.snowflake import generate_snowflake_batch
from osprey.worker.sinks.sink.input_stream import BaseInputStream
from osprey.worker.sinks.utils.acking_contexts import BaseAckingContext, NoopAckingContext

logger = get_logger()


DEFAULT_ENDPOINT = 'wss://jetstream2.us-west.bsky.network/subscribe'
COLLECTION_NAMES = {
'app.bsky.feed.post': 'post',
'app.bsky.feed.like': 'like',
'app.bsky.feed.repost': 'repost',
'app.bsky.graph.follow': 'follow',
'app.bsky.actor.profile': 'profile',
}
DEFAULT_COLLECTIONS = tuple(COLLECTION_NAMES.keys())
SNOWFLAKE_BATCH_SIZE = 250
PING_INTERVAL_SECONDS = 20
PING_TIMEOUT_SECONDS = 10


class JetStreamInputStream(BaseInputStream[BaseAckingContext[Action]]):
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the main thing i'd love to get eyes on if anyone has python websocket experience...it doesn't need to be production stable since its really a testing/example guy (not to mention jetstream shouldn't be used for real-world moderation tasks anyway) but it should be at least mostly stable...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i did want to opt for a non-kafka option in here, because doing this helps break the misconception that osprey can only be used with kafka

"""An Osprey event input stream that subscribes to the ATProto JetStream websocket and yields
Osprey actions.

The JetStream JSON event is passed through unchanged as the Action's data dict, so rules may
target the JetStream-native paths directly, like $.did, $.kind, or $.commit.operation.
"""

def __init__(
self,
endpoint: Optional[str] = None,
wanted_collections: Optional[List[str]] = None,
reconnect_seconds: float = 2.0,
max_reconnect_seconds: float = 60.0,
):
super().__init__()
self._endpoint = endpoint or DEFAULT_ENDPOINT
self._wanted_collections = list(wanted_collections) if wanted_collections else list(DEFAULT_COLLECTIONS)
self._backoff = Backoff(min_delay=reconnect_seconds, max_delay=max_reconnect_seconds)
self._last_time_us: Optional[int] = None
self._snowflake_buffer: List[int] = []

def _next_action_id(self) -> int:
if not self._snowflake_buffer:
batch = generate_snowflake_batch(count=SNOWFLAKE_BATCH_SIZE, retries=3)
self._snowflake_buffer = [s.to_int() for s in batch]
return self._snowflake_buffer.pop()

def _build_url(self) -> str:
params = [('wantedCollections', c) for c in self._wanted_collections]
if self._last_time_us is not None:
params.append(('cursor', str(self._last_time_us)))
return f'{self._endpoint}?{urlencode(params)}'

def _gen(self) -> Iterator[BaseAckingContext[Action]]:
while True:
Comment on lines +69 to +70
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should provably want to add a backoff. If JetStream is down or rate-limiting, you reconnect every 2 seconds in a tight loop and trigger escalating errors, potential higher rate limits and sentry capture on every event. We can start with a simple

backoff = self._reconnect_seconds

while True:
.....
    backoff = min(backoff * 2, 60.0)

had_event = False
try:
url = self._build_url()
for ctx in self._stream_one_connection(url):
had_event = True
yield ctx
except Exception as e:
logger.exception(f'JetStream stream error: {e}')

if had_event:
self._backoff.succeed()
delay = self._backoff.current
else:
delay = self._backoff.fail()
logger.info(f'Reconnecting in {delay:.1f}s')
time.sleep(delay)

def _stream_one_connection(self, url: str) -> Iterator[BaseAckingContext[Action]]:
# WebSocketApp drives PING/PONG keepalive on its own greenlet; we bridge its
# callback API into this generator via a gevent.queue.Queue. The 'done' sentinel
Comment on lines +89 to +90
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

man websocket with gevent scares me

# is pushed by on_close/on_error, and also as a safety net by greenlet.link in
# case run_forever exits without firing on_close (e.g., uncaught exception).
queue: 'Queue[tuple[str, Optional[bytes]]]' = Queue()

def on_open(ws: Any) -> None:
logger.info('JetStream connection established')

def on_message(ws: Any, raw: Any) -> None:
queue.put(('message', raw))

def on_close(ws: Any, status: Any, msg: Any) -> None:
logger.info(f'JetStream connection closed (status={status}); will reconnect')
queue.put(('done', None))

def on_error(ws: Any, err: Any) -> None:
logger.warning(f'JetStream connection error: {err}; will reconnect')
queue.put(('done', None))

logger.info(f'Connecting to JetStream at {url}')
app = websocket.WebSocketApp(
url,
on_open=on_open,
on_message=on_message,
on_close=on_close,
on_error=on_error,
)
runner = gevent.spawn(
app.run_forever, ping_interval=PING_INTERVAL_SECONDS, ping_timeout=PING_TIMEOUT_SECONDS
)
runner.link(lambda _g: queue.put(('done', None)))

try:
while True:
kind, raw = queue.get()
if kind == 'done':
return
if not raw:
continue
try:
event = json.loads(raw)
except json.JSONDecodeError as e:
raw_bytes = raw if isinstance(raw, bytes) else str(raw).encode('utf-8', errors='replace')
logger.warning(
f'JetStream payload was not valid JSON ({e}); '
f'first 200 bytes: {raw_bytes[:200]!r}'
)
continue
if not isinstance(event, dict):
logger.warning(
f'JetStream payload parsed to non-object JSON '
f'(got {type(event).__name__}); skipping'
)
continue
try:
action_id = self._next_action_id()
except Exception:
logger.exception('failed to mint action_id from snowflake-id-worker; skipping event')
continue
Comment on lines +146 to +148
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
except Exception:
logger.exception('skipping malformed JetStream event')
sentry_sdk.capture_exception()
continue
except json.JSONDecodeError:
logger.warning('skipping malformed JetStream JSON')
continue

we should be more specific on the error.

action = _event_to_action(event, action_id=action_id)
if action is None:
continue
time_us = event.get('time_us')
if time_us and isinstance(time_us, int) and time_us > 0:
self._last_time_us = time_us
metrics.increment('jetstream_input_stream.events', tags=[f'action_name:{action.action_name}'])
yield NoopAckingContext(action)
finally:
try:
app.close()
except Exception:
logger.info('ignored error while closing JetStream WebSocketApp', exc_info=True)
runner.join(timeout=5)


def _event_to_action(event: Dict[str, Any], action_id: int) -> Optional[Action]:
"""Wraps a JetStream event as an Osprey action, or returns None if it should be skipped."""
kind = event.get('kind')
if kind not in ('commit', 'identity'):
return None
time_us = event.get('time_us')
if not isinstance(time_us, int) or time_us <= 0:
return None
if kind == 'commit':
commit = event.get('commit') or {}
operation = commit.get('operation')
short = COLLECTION_NAMES.get(commit.get('collection', ''))
if short is None or operation not in ('create', 'update', 'delete'):
return None
action_name = f'{operation}_{short}'
else:
action_name = 'identity'
return Action(
action_id=action_id,
action_name=action_name,
data=event,
timestamp=datetime.fromtimestamp(time_us / 1_000_000, tz=timezone.utc),
)
15 changes: 15 additions & 0 deletions example_atproto_plugins/src/atproto_plugin/register_plugins.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
from osprey.engine.executor.execution_context import Action
from osprey.worker.adaptor.plugin_manager import hookimpl_osprey
from osprey.worker.lib.config import Config
from osprey.worker.sinks.sink.input_stream import BaseInputStream
from osprey.worker.sinks.utils.acking_contexts import BaseAckingContext

from atproto_plugin.jetstream_input_stream import JetStreamInputStream


@hookimpl_osprey
def register_input_stream(config: Config) -> BaseInputStream[BaseAckingContext[Action]]:
endpoint = config.get_optional_str('OSPREY_JETSTREAM_ENDPOINT')
raw_collections = config.get_optional_str('OSPREY_JETSTREAM_WANTED_COLLECTIONS')
wanted = [c.strip() for c in raw_collections.split(',') if c.strip()] if raw_collections else None
return JetStreamInputStream(endpoint=endpoint, wanted_collections=wanted)
Empty file.
Loading
Loading