Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate slow endpoints / performance #161

Closed
nick-jones-gov opened this issue Jun 29, 2021 · 6 comments
Closed

Investigate slow endpoints / performance #161

nick-jones-gov opened this issue Jun 29, 2021 · 6 comments

Comments

@nick-jones-gov
Copy link
Contributor

As part of looking into #122 and looking at metrics in New Relic, it appears that many standard queries (i.e. ones that are run as part of handling the core 3 endpoints) are much slower than they should be. There are seemingly appropriate indexes in place, but requests still seem a lot slower than they should be.

The 502s reported in the issue linked above are likely just the worst case scenarios for queries that are slow in general; so if we can figure out why those are slow in general, we'll likely also address the 502s issue.

@nick-jones-gov
Copy link
Contributor Author

nick-jones-gov commented Jun 29, 2021

(Adding some notes from investigation I did the past few days)

I looked into the 502s specifically, pulling logs from cloud.gov - there doesn't seem to be any obvious pattern to which endpoints return 502s, suggesting that it's a widespread problem (and not specific to a report type or an agency).

While we don't have New Relic setup to track DB queries (because of an issue with NR and Knex, our query library - see: #158), the app code for these endpoints is not super complicated - so it seems very likely that any performance issues are coming from the database. We can see what queries the app generates by running the app locally with knex debugging enabled:

export DEBUG=knex:query
npm start
... in another tab ...
curl localhost:4444/v1.1/domain/benefits.gov/reports/site/data
... back in the original tab, we see ...
knex:query select * from "analytics_data" where "report_name" = $1 and data->> 'domain' = $2 and true order by date desc NULLS LAST, CAST(data->>'total_events' AS INTEGER) desc, CAST(data->>'visits' AS INTEGER) desc limit $3 undefined +0ms
... some more irrelevant stuff....

if we format it nicely, the. query looks like:

select * from "analytics_data"
where "report_name" = $1
and data->> 'domain' = $2
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc 
limit $3 

and

$1 = 'site'
$2 = 'benefits.gov'
$3 = 1000

(note that this endpoint and query return nothing, because my local DB has no data in it)

@nick-jones-gov
Copy link
Contributor Author

The DB's schema looks like this (the result of running the 3 migrations in the migrations/ directory):

\d analytics_data
                                        Table "public.analytics_data"
    Column     |           Type           | Collation | Nullable |                  Default
---------------+--------------------------+-----------+----------+--------------------------------------------
 id            | integer                  |           | not null | nextval('analytics_data_id_seq'::regclass)
 report_name   | character varying(255)   |           |          |
 report_agency | character varying(255)   |           |          |
 date          | date                     |           |          |
 data          | jsonb                    |           |          |
 created_at    | timestamp with time zone |           | not null | now()
 updated_at    | timestamp with time zone |           | not null | now()
Indexes:
    "analytics_data_pkey" PRIMARY KEY, btree (id)
    "analytics_data_date_time_desc" btree (date DESC NULLS LAST)
    "analytics_data_report_name_report_agency_index" btree (report_name, report_agency)

Notably, there is a multi-column index on (report_name, report_agency), and an index on date.

Two things that stick out when looking back at the DB query from above (copied here):

select * from "analytics_data"
where "report_name" = $1
and data->> 'domain' = $2
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc 
limit $3 
  1. report_name is in the where clause and has an index (good!)
  2. data->> 'domain' is in the where clause but has no index
  3. date is used in the order by, and has an index
  4. two fields within the data column (a JSONB column) are used in the order by as well, but have no index

@nick-jones-gov
Copy link
Contributor Author

My initial thought was that the data->> 'domain' part of the where clause was the issue, because it has no index. I ran a few EXPLAIN (ANALYZE, BUFFERS) queries to see what postgres is doing, and it seems like the issue is actually with the order by clause. Here's the ANALYZE output for the exact type of query used by the API:

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
where "report_name" = 'site'
and data->> 'domain' = 'benefits.gov'
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2677295.43..2677295.46 rows=10 width=185) (actual time=135377.280..135377.286 rows=10 loops=1)
   Buffers: shared hit=918 read=231127
   ->  Sort  (cost=2677295.43..2677376.67 rows=32496 width=185) (actual time=135377.278..135377.282 rows=10 loops=1)
         Sort Key: date DESC, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=918 read=231127
         ->  Bitmap Heap Scan on analytics_data  (cost=231105.31..2676593.21 rows=32496 width=185) (actual time=24760.030..135366.619 rows=3051 loops=1)
               Recheck Cond: ((report_name)::text = 'site'::text)
               Rows Removed by Index Recheck: 3253069
               Filter: ((data ->> 'domain'::text) = 'benefits.gov'::text)
               Rows Removed by Filter: 6289638
               Heap Blocks: exact=31732 lossy=166695
               Buffers: shared hit=918 read=231127
               ->  Bitmap Index Scan on analytics_data_report_name_report_agency_index  (cost=0.00..231097.19 rows=6499283 width=0) (actual time=24709.043..24709.043 rows=6320763 loops=1)
                     Index Cond: ((report_name)::text = 'site'::text)
                     Buffers: shared hit=1 read=33617
 Planning time: 0.143 ms
 Execution time: 135377.652 ms
(18 rows)

The main thing to note is that the outermost blocks deal with sorting, not with satisfying the WHERE clause. One might expect the date index to be used for that sorting, but it doesn't seem to be.

It turns out that the index on date (shown in the previous comment) is defined on not just the date column, but more specifically (date DESC NULLS LAST). This stackoverflow question gets at the crux of the issue - the index is defined specifically with NULLS LAST, but the query itself does not specify that behavior. As a result, Postgres doesn't use the index. We can see that behavior by using a simpler query - just trying to get the 2 latest rows in the analytics_data, i.e.

select * from "analytics_data"
order by "date" desc
limit 2;

The query plans/analyses for that query, and the same one (but specifying NULLS LAST) is shown below - note that the index is only used when we specify NULLS LAST.

  1. just order by "desc" desc
EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc
limit 2;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4065175.66..4065175.67 rows=2 width=177) (actual time=491505.347..491505.349 rows=2 loops=1)
   Buffers: shared hit=113945 read=2217318
   ->  Sort  (cost=4065175.66..4281914.74 rows=86695633 width=177) (actual time=491505.345..491505.347 rows=2 loops=1)
         Sort Key: date DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=113945 read=2217318
         ->  Seq Scan on analytics_data  (cost=0.00..3198219.33 rows=86695633 width=177) (actual time=0.748..464555.338 rows=86566753 loops=1)
               Buffers: shared hit=113945 read=2217318
 Planning time: 0.081 ms
 Execution time: 491505.375 ms
(10 rows)

Postgres is sorting the entire table here! Which is, unsurprisingly, extremely slow and resource intensive, and therefore the query takes a very very long time (491 seconds, or a little over 8 minutes) just to fetch the two most recent rows (as defined by the date column) in the table.

  1. with NULLS LAST
 EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc NULLS LAST
limit 2;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..0.68 rows=2 width=177) (actual time=80.347..80.350 rows=2 loops=1)
   Buffers: shared read=5
   ->  Index Scan using analytics_data_date_time_desc on analytics_data  (cost=0.57..4882568.86 rows=86695633 width=177) (actual time=80.345..80.347 rows=2 loops=1)
         Buffers: shared read=5
 Planning time: 0.085 ms
 Execution time: 80.376 ms
(6 rows)

With NULLS LAST, the query uses the index, and instead takes just 80ms.

This clearly helps a lot! But the query used by the API is more complicated, and just adding the NULLS LAST doesn't fix everything. Will explain more on that later today.

@nick-jones-gov
Copy link
Contributor Author

If the only thing we replace in the query is using NULLS LAST, the resulting query is still slow - my guess/interpretation of the query plan is that the multiple ORDER BYs are the problem.

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
where "report_name" = 'site'
and data->> 'domain' = 'benefits.gov'
and true
order by "date" desc NULLS LAST,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2670011.53..2670011.55 rows=10 width=184) (actual time=150771.126..150771.134 rows=10 loops=1)
   Buffers: shared read=231878
   ->  Sort  (cost=2670011.53..2670091.33 rows=31920 width=184) (actual time=150771.124..150771.128 rows=10 loops=1)
         Sort Key: date DESC NULLS LAST, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared read=231878
         ->  Bitmap Heap Scan on analytics_data  (cost=227125.27..2669321.75 rows=31920 width=184) (actual time=22811.950..150759.986 rows=3049 loops=1)
               Recheck Cond: ((report_name)::text = 'site'::text)
               Rows Removed by Index Recheck: 3251542
               Filter: ((data ->> 'domain'::text) = 'benefits.gov'::text)
               Rows Removed by Filter: 6284507
               Heap Blocks: exact=31603 lossy=166690
               Buffers: shared read=231878
               ->  Bitmap Index Scan on analytics_data_report_name_report_agency_index  (cost=0.00..227117.29 rows=6384096 width=0) (actual time=22752.132..22752.132 rows=6315629 loops=1)
                     Index Cond: ((report_name)::text = 'site'::text)
                     Buffers: shared read=33585
 Planning time: 0.125 ms
 Execution time: 150771.542 ms
(18 rows)

if we simplify the query a little bit by removing the WHERE clause altogether, it's still slow due to the sorting on multiple fields - postgres can't use the index on date:

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc NULLS LAST,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6397464.20..6397464.23 rows=10 width=185) (actual time=500519.300..500519.307 rows=10 loops=1)
   Buffers: shared hit=98232 read=2242361
   ->  Sort  (cost=6397464.20..6615065.50 rows=87040517 width=185) (actual time=500519.296..500519.298 rows=10 loops=1)
         Sort Key: date DESC NULLS LAST, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=98232 read=2242361
         ->  Seq Scan on analytics_data  (cost=0.00..4516549.92 rows=87040517 width=185) (actual time=0.670..448189.035 rows=86932072 loops=1)
               Buffers: shared hit=98226 read=2242361
 Planning time: 16.648 ms
 Execution time: 500519.938 ms
(10 rows)

To speed this up, I think we need the ORDER BY to be on a multi-column index of the same fields - i.e. if we have ORDER BY x, y then there should be an index on (x, y).

@nick-jones-gov
Copy link
Contributor Author

One (relatively simple) option would be to change the second part of the ORDER BY, and add a new index. As a reminder, right now the ORDER BY is as follows:

order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc

It's important that we have some stable ordering, because users can paginate through results by specifying the page query param. If we just dropped the second two fields (data->>'total_events' and data-->'visits'), it's possible that the query result wouldn't be stable, since there's no explicit tie-breaker for the order when two rows have the same date.

Assuming that the ordering by total_events and visits is arbitrary (I don't see anything in the API docs that suggest that ordering matters), we could instead order by (date, id); so rather than using total_events and visits as a sorting tie-breaker, just use the id of the row. This would change the order of results returned now, but only for entries in the data that have the same date.

To speed up that ordering, we'd add a new multi-column index on (date DESC NULLS LAST, id DESC). So in total, the changes would be 1. a few line change in the API code (to use id as the tiebreaker in sorting) and 2. a DB migration to add the new multi-columns index.

@tdlowden - do you have thoughts on the impact of changing the ordering of results returned to be arbitrary (beyond the date)? I can explore adding a different index that would keep the ordering the same as it works today, but I think that may get a bit messier because the total_events and visits fields are stored in a JSON column. Happy to discuss more if the explanation of user impact doesn't make sense!

@levinmr
Copy link
Contributor

levinmr commented Aug 21, 2024

Fixed this for all domain queries except the downloads query by adding an index on the jsonb column. Added a new ticket for the domain download query as it's a little different and needs a different implementation

@levinmr levinmr closed this as completed Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants