Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change order by clause, add multi-column index to speed up queries #172

Merged
merged 1 commit into from
Aug 2, 2021

Conversation

nick-jones-gov
Copy link
Contributor

Current state

There is a bunch of discussion around this here: #161

The tl;dr is that all endpoints are extremely slow right now. Based on the investigation above, I feel pretty confident that it's an issue with DB queries being extremely slow. According to New Relic, in the last week:

  • get /v1.1/domain/:domain/reports/:reportName/data averages 112 seconds response time
  • get /v1.1/agencies/:reportAgency/reports/:reportName/data averages 7 seconds response time
  • get /v1.1/reports/:reportName/data averages 4 seconds response time

The diagnosing of the source of the slowness is mostly documented in the issue linked to above. The ordering used in all queries now is:

order by "date" desc NULLS LAST,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc

The tl;dr is that this ordering isn't able to use any existing indexes, and there's no way (from what I could find) to add a multi-column index that includes specific JSON fields (i.e. the data->>'total_events' and data->>'visits'` fields).

Proposed changes

  1. Changes the order by clause to use id as a secondary sort key (after date)
  2. Adds a multi column index to support the ORDER BY date desc NULLS LAST, id clause

Note that changing the order by clause can alter the ordering of data from what happens today. In practice, it seems like the data order won't change, because data happens to be inserted into the database in decreasing order by visits or total_events.

I am open to other ideas of addressing these performance issues, but this seemed to be the most promising. In the longer term, I would suggest we consider moving data out of the generic data JSON column into specific columns for each field (i.e. total_events and visits should each have their own columns).

Note that this change is reversible; if for any reason we want to go back to using the previous ORDER BY clause, we can do that with no problem.

Copy link
Member

@mgwalker mgwalker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good and the comments make sense. As long as it's correct that rows are (coincidentally) inserted already ordered according to total_events and visits, this seems like it ought to dramatically simplify the database lookup. I also strongly agree with your longer-term solution to break those out into distinct fields.

@tdlowden tdlowden merged commit 72e6e49 into master Aug 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants