Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collection directory #929

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

brianolson
Copy link
Contributor

New service for indexing (did,collection) pairs.
The primary query is "which repos have any data for some collection?"
Also keeps statistics on how many repos contain data for any collection, and how many repos see traffic in a day on some collection. (collection daily-active-users across atproto firehose)

lookup repos by collection (who has app.bsky.feed.post records ?)
firehose consumer and crawl PDS by listRepos,describeRepo
daily-active-users collections
Copy link
Collaborator

@bnewbold bnewbold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty far along, left a bunch of comments.

The main thing is to get the primary endpoint over to /xrpc/com.atproto.sync.listReposByCollection. You can do that just by implementing it as an HTTP endpoint; we did that for the search service (IIRC). Also need to get that Lexicon endpoint written and reviewed in the atproto (typescript) repo. I can take that task if you want.

EnvVars: []string{"COLLECTIONS_METRICS_LISTEN"},
},
&cli.StringFlag{
Name: "pebble",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe database-dir? use of pebble seems like an internal/implementation detail you could hide.

would be good to have an env var for this also... and a default value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on scylla-relay got me in the habit of naming the back end database type used in the argument; it's not just the path to 'the database' but the path to 'the pebble database'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this leaves more room if we decide to add a PostgreSQL backend etc

Required: true,
},
&cli.StringFlag{
Name: "upstream",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we call this relay-host and ATP_RELAY_HOST in a few places. firehose-host or sync-host could also work? don't feel super strong about it, but I don't think we use "upstream" as an arg/variable for any other services.

},
&cli.Float64Flag{
Name: "crawl-qps",
Usage: "per-PDS crawl queries-per-second limit",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't tell what this meant at first. "pds-backfill-rate-limit"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's the limit on the number of queries per second we do to a PDS while crawling it

cs.log = log
cs.ctx = cctx.Context
cs.AdminToken = cctx.String("admin-token")
cs.ExepctedAuthHeader = "Bearer " + cs.AdminToken
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the way we usually do "admin auth" is HTTP Basic with the username admin and the token as the password.

https://atproto.com/specs/xrpc#authentication

there is a client-side code snippet here:
https://github.com/bluesky-social/indigo/blob/main/xrpc/xrpc.go#L188

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is copied from relay auth

cmd/collectiondir/serve.go Outdated Show resolved Hide resolved
cmd/collectiondir/crawl.go Outdated Show resolved Hide resolved
cmd/collectiondir/pebble.go Show resolved Hide resolved
cmd/collectiondir/pebble.go Show resolved Hide resolved
@@ -0,0 +1,348 @@
package main

import (
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I usually add _ "github.com/joho/godotenv/autoload" to imports to pick up .env in development

const statsCacheDuration = time.Second * 300

type GetDidsForCollectionResponse struct {
Dids []string `json:"dids"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

returning "just a list of strings" is the simple thing here. I think we should consider doing an array of objects, each with the field did though. This is less efficient today, but we have found that we tend to want to add flags or other metadata in the future, and that is way easier if we do arrays-of-objects to start.

I would maybe lean towards "repos" as the top level key? "dids" reads weird.

(we can re-review and tweak API schema details in the atproto lexicon PR instead of here)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"dids" is just a list of strings, "repos" invites further fields. could leave it as "dids" for now, and replace/extend it to "repos" objects later?

@bnewbold
Copy link
Collaborator

The main things I looked at are the public APIs (the CLI "interface" and API endpoints), and a skim for concurrency stuff. I didn't dig in too deeply on the pebble schema yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants