-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collection directory #929
base: main
Are you sure you want to change the base?
Collection directory #929
Conversation
lookup repos by collection (who has app.bsky.feed.post records ?) firehose consumer and crawl PDS by listRepos,describeRepo daily-active-users collections
list-collections adds badwords filter and hide below 5 users
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty far along, left a bunch of comments.
The main thing is to get the primary endpoint over to /xrpc/com.atproto.sync.listReposByCollection
. You can do that just by implementing it as an HTTP endpoint; we did that for the search service (IIRC). Also need to get that Lexicon endpoint written and reviewed in the atproto
(typescript) repo. I can take that task if you want.
EnvVars: []string{"COLLECTIONS_METRICS_LISTEN"}, | ||
}, | ||
&cli.StringFlag{ | ||
Name: "pebble", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe database-dir
? use of pebble seems like an internal/implementation detail you could hide.
would be good to have an env var for this also... and a default value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
working on scylla-relay got me in the habit of naming the back end database type used in the argument; it's not just the path to 'the database' but the path to 'the pebble database'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this leaves more room if we decide to add a PostgreSQL backend etc
Required: true, | ||
}, | ||
&cli.StringFlag{ | ||
Name: "upstream", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we call this relay-host
and ATP_RELAY_HOST
in a few places. firehose-host
or sync-host
could also work? don't feel super strong about it, but I don't think we use "upstream" as an arg/variable for any other services.
}, | ||
&cli.Float64Flag{ | ||
Name: "crawl-qps", | ||
Usage: "per-PDS crawl queries-per-second limit", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't tell what this meant at first. "pds-backfill-rate-limit"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's the limit on the number of queries per second we do to a PDS while crawling it
cs.log = log | ||
cs.ctx = cctx.Context | ||
cs.AdminToken = cctx.String("admin-token") | ||
cs.ExepctedAuthHeader = "Bearer " + cs.AdminToken |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the way we usually do "admin auth" is HTTP Basic with the username admin
and the token as the password.
https://atproto.com/specs/xrpc#authentication
there is a client-side code snippet here:
https://github.com/bluesky-social/indigo/blob/main/xrpc/xrpc.go#L188
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is copied from relay auth
@@ -0,0 +1,348 @@ | |||
package main | |||
|
|||
import ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I usually add _ "github.com/joho/godotenv/autoload"
to imports to pick up .env
in development
const statsCacheDuration = time.Second * 300 | ||
|
||
type GetDidsForCollectionResponse struct { | ||
Dids []string `json:"dids"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
returning "just a list of strings" is the simple thing here. I think we should consider doing an array of objects, each with the field did
though. This is less efficient today, but we have found that we tend to want to add flags or other metadata in the future, and that is way easier if we do arrays-of-objects to start.
I would maybe lean towards "repos" as the top level key? "dids" reads weird.
(we can re-review and tweak API schema details in the atproto
lexicon PR instead of here)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"dids" is just a list of strings, "repos" invites further fields. could leave it as "dids" for now, and replace/extend it to "repos" objects later?
The main things I looked at are the public APIs (the CLI "interface" and API endpoints), and a skim for concurrency stuff. I didn't dig in too deeply on the pebble schema yet. |
New service for indexing (did,collection) pairs.
The primary query is "which repos have any data for some collection?"
Also keeps statistics on how many repos contain data for any collection, and how many repos see traffic in a day on some collection. (collection daily-active-users across atproto firehose)