Collection directory #929

brianolson · 2025-02-04T03:10:55Z

New service for indexing (did,collection) pairs.
The primary query is "which repos have any data for some collection?"
Also keeps statistics on how many repos contain data for any collection, and how many repos see traffic in a day on some collection. (collection daily-active-users across atproto firehose)

lookup repos by collection (who has app.bsky.feed.post records ?) firehose consumer and crawl PDS by listRepos,describeRepo daily-active-users collections

list-collections adds badwords filter and hide below 5 users

dau fix

bnewbold

Pretty far along, left a bunch of comments.

The main thing is to get the primary endpoint over to /xrpc/com.atproto.sync.listReposByCollection. You can do that just by implementing it as an HTTP endpoint; we did that for the search service (IIRC). Also need to get that Lexicon endpoint written and reviewed in the atproto (typescript) repo. I can take that task if you want.

bnewbold · 2025-02-11T06:04:04Z

cmd/collectiondir/serve.go

+			EnvVars: []string{"COLLECTIONS_METRICS_LISTEN"},
+		},
+		&cli.StringFlag{
+			Name:     "pebble",


maybe database-dir? use of pebble seems like an internal/implementation detail you could hide.

would be good to have an env var for this also... and a default value?

working on scylla-relay got me in the habit of naming the back end database type used in the argument; it's not just the path to 'the database' but the path to 'the pebble database'

this leaves more room if we decide to add a PostgreSQL backend etc

bnewbold · 2025-02-11T06:04:48Z

cmd/collectiondir/serve.go

+			Required: true,
+		},
+		&cli.StringFlag{
+			Name:    "upstream",


we call this relay-host and ATP_RELAY_HOST in a few places. firehose-host or sync-host could also work? don't feel super strong about it, but I don't think we use "upstream" as an arg/variable for any other services.

bnewbold · 2025-02-11T06:05:57Z

cmd/collectiondir/serve.go

+		},
+		&cli.Float64Flag{
+			Name:  "crawl-qps",
+			Usage: "per-PDS crawl queries-per-second limit",


I couldn't tell what this meant at first. "pds-backfill-rate-limit"?

it's the limit on the number of queries per second we do to a PDS while crawling it

bnewbold · 2025-02-11T06:11:29Z

cmd/collectiondir/serve.go

+	cs.log = log
+	cs.ctx = cctx.Context
+	cs.AdminToken = cctx.String("admin-token")
+	cs.ExepctedAuthHeader = "Bearer " + cs.AdminToken


the way we usually do "admin auth" is HTTP Basic with the username admin and the token as the password.

https://atproto.com/specs/xrpc#authentication

there is a client-side code snippet here:
https://github.com/bluesky-social/indigo/blob/main/xrpc/xrpc.go#L188

this is copied from relay auth

cmd/collectiondir/serve.go

cmd/collectiondir/crawl.go

cmd/collectiondir/pebble.go

bnewbold · 2025-02-11T06:53:00Z

cmd/collectiondir/collectiondir.go

@@ -0,0 +1,348 @@
+package main
+
+import (


I usually add _ "github.com/joho/godotenv/autoload" to imports to pick up .env in development

bnewbold · 2025-02-11T07:10:16Z

cmd/collectiondir/serve.go

+const statsCacheDuration = time.Second * 300
+
+type GetDidsForCollectionResponse struct {
+	Dids   []string `json:"dids"`


returning "just a list of strings" is the simple thing here. I think we should consider doing an array of objects, each with the field did though. This is less efficient today, but we have found that we tend to want to add flags or other metadata in the future, and that is way easier if we do arrays-of-objects to start.

I would maybe lean towards "repos" as the top level key? "dids" reads weird.

(we can re-review and tweak API schema details in the atproto lexicon PR instead of here)

"dids" is just a list of strings, "repos" invites further fields. could leave it as "dids" for now, and replace/extend it to "repos" objects later?

bnewbold · 2025-02-11T07:15:09Z

The main things I looked at are the public APIs (the CLI "interface" and API endpoints), and a skim for concurrency stuff. I didn't dig in too deeply on the pebble schema yet.

brianolson added 2 commits February 3, 2025 22:08

collection directory service

69cdd71

lookup repos by collection (who has app.bsky.feed.post records ?) firehose consumer and crawl PDS by listRepos,describeRepo daily-active-users collections

export command; dau logging; comments

e8dfa5e

brianolson requested review from bnewbold and devinivy February 5, 2025 15:46

brianolson added 3 commits February 7, 2025 15:34

limit dids to 1000 collections each

143d99c

list-collections adds badwords filter and hide below 5 users

better crawl stats endpoint

0e417f0

admin crawl command (start PDS crawls from text or csv list of hosts)

f98e654

dau fix

bnewbold reviewed Feb 11, 2025

View reviewed changes

PR feedback

f163377

bnewbold mentioned this pull request Feb 12, 2025

com.atproto.sync.listReposByCollection Lexicon, for collections directory bluesky-social/atproto#3524

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Collection directory #929

Collection directory #929

brianolson commented Feb 4, 2025

bnewbold left a comment

bnewbold Feb 11, 2025

brianolson Feb 11, 2025

brianolson Feb 11, 2025

bnewbold Feb 11, 2025

bnewbold Feb 11, 2025

brianolson Feb 11, 2025

bnewbold Feb 11, 2025

brianolson Feb 11, 2025

bnewbold Feb 11, 2025

bnewbold Feb 11, 2025

brianolson Feb 11, 2025

bnewbold commented Feb 11, 2025

Collection directory #929

Are you sure you want to change the base?

Collection directory #929

Conversation

brianolson commented Feb 4, 2025

bnewbold left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bnewbold commented Feb 11, 2025