Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single call to notification.useData results in 100+ queries to the database #27

Open
lauri865 opened this issue Nov 8, 2024 · 1 comment

Comments

@lauri865
Copy link

lauri865 commented Nov 8, 2024

We had a development environment with a hosted database, and we noticed how we somehow ploughed throught 1.5gb of egress per day per single developer only for this Cord instance.

Mind you, this database has not much data at all, mostly short testing messages:

  • 2 users
  • 13 threads
  • 77 messages
  • 24 thread participants

Seems like a single call to notification.useData from the frontend results in ~100 queries (30 queries of threads, even though the number of total threads is less) to the database, only to return 10 notifications back. On every single page load.

Now, I wonder if anyone else has witnessed this, or have we misconfigured something and missing some caching or anything else?

A single page load with an empty cord context (completely empty page, no re-renderings or anything), results in 7 queries of the below to the database:

SELECT
	"id",
	"name",
	"sharedSecret",
	"createdTimestamp",
	"customEmailTemplate",
	"enableEmailNotifications",
	"customLinks",
	"customS3Bucket",
	"segmentWriteKey",
	"customNUX",
	"iconURL",
	"type",
	"environment",
	"supportOrgID",
	"supportBotID",
	"supportSlackChannelID",
	"redirectURI",
	"customerID",
	"slackConnectAllOrgs",
	"eventWebhookURL",
	"eventWebhookSubscriptions",
	"customSlackAppID",
	"customSlackAppDetails"
FROM
	"cord"."applications" AS "ApplicationEntity"
WHERE
	"ApplicationEntity"."id" = 'REDACTED'
@jwatzman
Copy link

jwatzman commented Feb 5, 2025

Just saw this. Former Cord engineer here with a few thoughts.

Yeah, this sort of issue is one that Cord ran into from time to time. We'd typically fix it as problems presented themselves -- and since we were running on a pretty beefy AWS RDS instance which could take a lot of abuse before things got bad, and notification.useData was a relatively less-used API, it's likely we never got around to flushing out all the issues like this.

GraphQL, when naively used, causes issues like this (and the actual API that the Cord server speaks is largely GraphQL) -- each notif has an application field (or has an author field who has an application field or something like that) and so you load the same app O(n) times.

GraphQL has a standard solution to this, which is dataloader, which Cord uses. These dataloaders coalesce queries together so that the O(n) requests for data become a single DB query, with the results split back up to the calling code. This... mostly works, with a few problems that Cord consistently ran into:

  1. Our code is not consistent about actually using dataloaders. This is entirely Cord's fault, and we'd fix it where it mattered, and most places do use them consistently -- the one that comes to mind that I didn't get around to fixing is around orgs, which I don't think is the issue here, but could be.
  2. Our dataloaders (almost) all have caching disabled, since it's not clear when it's safe to drop the cache. There's a bunch of custom logic I built into the Apollo wrappers (even including a patch to the graphql library itself!) which tries to keep some notion of "request" going, so you can drop the loaders (and their caches) at the end of the current query -- but you also need to drop them within a query if something gets written which would affect that loader, so that read-after-write works correctly. This last bit was a pain to wire up and was only done for a handful of important loaders. So there isn't a lot of caching going on, but also things should still not exhibit the behaviour you're seeing even without caching.
  3. The dataloader only coalesces queries together within the same "tick" of the JS event loop. This is a bit hard to explain. But imagine you're loading O(n) notifications, and each notif has an author, and each author has an application. First tick, single query for notifications, single DB query, O(n) results. Second tick, each of those O(n) results asks for the author, O(n) author requests coalesced by user data loader into 1 DB query. Third tick, same for application. All is good. But -- what if there isn't a user data loader for some reason? (A few spots don't/can't have data loaders because the queries don't coalesce well -- it's difficult to stick "interesting threads at this location" together, for example, to return useful results for multiple locations. Can be done but a huge pain, especially when the "interesting" logic is in JS and not SQL.) Then, on that second tick, the O(n) pieces of work going on get "out of sync" since they need different amounts of information, and then when they all request their next thing, that doesn't get coalesced by the dataloader since it's in different ticks of the event loop. This is a huge pain and we were only starting to get a grasp on the problem when Cord shut down. There's a dataloader somewhere which doesn't actually coalesce queries, it just serves to keep things in lockstep! (I don't remember which one.) This is a more esoteric problem, but could be going on here.

Hopefully that is useful context on where to start looking at this. My guess would be the problem is (1), we're just not dataloader'ing something somewhere that we should be. But, it could be the more esoteric (3). And I mention (2) as background here, in case you see it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants