Investigate slow endpoints / performance #161

nick-jones-gov · 2021-06-29T18:13:16Z

As part of looking into #122 and looking at metrics in New Relic, it appears that many standard queries (i.e. ones that are run as part of handling the core 3 endpoints) are much slower than they should be. There are seemingly appropriate indexes in place, but requests still seem a lot slower than they should be.

The 502s reported in the issue linked above are likely just the worst case scenarios for queries that are slow in general; so if we can figure out why those are slow in general, we'll likely also address the 502s issue.

nick-jones-gov · 2021-06-29T18:23:20Z

(Adding some notes from investigation I did the past few days)

I looked into the 502s specifically, pulling logs from cloud.gov - there doesn't seem to be any obvious pattern to which endpoints return 502s, suggesting that it's a widespread problem (and not specific to a report type or an agency).

While we don't have New Relic setup to track DB queries (because of an issue with NR and Knex, our query library - see: #158), the app code for these endpoints is not super complicated - so it seems very likely that any performance issues are coming from the database. We can see what queries the app generates by running the app locally with knex debugging enabled:

export DEBUG=knex:query
npm start
... in another tab ...
curl localhost:4444/v1.1/domain/benefits.gov/reports/site/data
... back in the original tab, we see ...
knex:query select * from "analytics_data" where "report_name" = $1 and data->> 'domain' = $2 and true order by date desc NULLS LAST, CAST(data->>'total_events' AS INTEGER) desc, CAST(data->>'visits' AS INTEGER) desc limit $3 undefined +0ms
... some more irrelevant stuff....

if we format it nicely, the. query looks like:

select * from "analytics_data"
where "report_name" = $1
and data->> 'domain' = $2
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc 
limit $3

and

$1 = 'site'
$2 = 'benefits.gov'
$3 = 1000

(note that this endpoint and query return nothing, because my local DB has no data in it)

nick-jones-gov · 2021-06-29T19:48:22Z

The DB's schema looks like this (the result of running the 3 migrations in the migrations/ directory):

\d analytics_data
                                        Table "public.analytics_data"
    Column     |           Type           | Collation | Nullable |                  Default
---------------+--------------------------+-----------+----------+--------------------------------------------
 id            | integer                  |           | not null | nextval('analytics_data_id_seq'::regclass)
 report_name   | character varying(255)   |           |          |
 report_agency | character varying(255)   |           |          |
 date          | date                     |           |          |
 data          | jsonb                    |           |          |
 created_at    | timestamp with time zone |           | not null | now()
 updated_at    | timestamp with time zone |           | not null | now()
Indexes:
    "analytics_data_pkey" PRIMARY KEY, btree (id)
    "analytics_data_date_time_desc" btree (date DESC NULLS LAST)
    "analytics_data_report_name_report_agency_index" btree (report_name, report_agency)

Notably, there is a multi-column index on (report_name, report_agency), and an index on date.

Two things that stick out when looking back at the DB query from above (copied here):

select * from "analytics_data"
where "report_name" = $1
and data->> 'domain' = $2
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc 
limit $3

report_name is in the where clause and has an index (good!)
data->> 'domain' is in the where clause but has no index
date is used in the order by, and has an index
two fields within the data column (a JSONB column) are used in the order by as well, but have no index

nick-jones-gov · 2021-06-30T19:21:37Z

My initial thought was that the data->> 'domain' part of the where clause was the issue, because it has no index. I ran a few EXPLAIN (ANALYZE, BUFFERS) queries to see what postgres is doing, and it seems like the issue is actually with the order by clause. Here's the ANALYZE output for the exact type of query used by the API:

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
where "report_name" = 'site'
and data->> 'domain' = 'benefits.gov'
and true
order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2677295.43..2677295.46 rows=10 width=185) (actual time=135377.280..135377.286 rows=10 loops=1)
   Buffers: shared hit=918 read=231127
   ->  Sort  (cost=2677295.43..2677376.67 rows=32496 width=185) (actual time=135377.278..135377.282 rows=10 loops=1)
         Sort Key: date DESC, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=918 read=231127
         ->  Bitmap Heap Scan on analytics_data  (cost=231105.31..2676593.21 rows=32496 width=185) (actual time=24760.030..135366.619 rows=3051 loops=1)
               Recheck Cond: ((report_name)::text = 'site'::text)
               Rows Removed by Index Recheck: 3253069
               Filter: ((data ->> 'domain'::text) = 'benefits.gov'::text)
               Rows Removed by Filter: 6289638
               Heap Blocks: exact=31732 lossy=166695
               Buffers: shared hit=918 read=231127
               ->  Bitmap Index Scan on analytics_data_report_name_report_agency_index  (cost=0.00..231097.19 rows=6499283 width=0) (actual time=24709.043..24709.043 rows=6320763 loops=1)
                     Index Cond: ((report_name)::text = 'site'::text)
                     Buffers: shared hit=1 read=33617
 Planning time: 0.143 ms
 Execution time: 135377.652 ms
(18 rows)

The main thing to note is that the outermost blocks deal with sorting, not with satisfying the WHERE clause. One might expect the date index to be used for that sorting, but it doesn't seem to be.

It turns out that the index on date (shown in the previous comment) is defined on not just the date column, but more specifically (date DESC NULLS LAST). This stackoverflow question gets at the crux of the issue - the index is defined specifically with NULLS LAST, but the query itself does not specify that behavior. As a result, Postgres doesn't use the index. We can see that behavior by using a simpler query - just trying to get the 2 latest rows in the analytics_data, i.e.

select * from "analytics_data"
order by "date" desc
limit 2;

The query plans/analyses for that query, and the same one (but specifying NULLS LAST) is shown below - note that the index is only used when we specify NULLS LAST.

just order by "desc" desc

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc
limit 2;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=4065175.66..4065175.67 rows=2 width=177) (actual time=491505.347..491505.349 rows=2 loops=1)
   Buffers: shared hit=113945 read=2217318
   ->  Sort  (cost=4065175.66..4281914.74 rows=86695633 width=177) (actual time=491505.345..491505.347 rows=2 loops=1)
         Sort Key: date DESC
         Sort Method: top-N heapsort  Memory: 25kB
         Buffers: shared hit=113945 read=2217318
         ->  Seq Scan on analytics_data  (cost=0.00..3198219.33 rows=86695633 width=177) (actual time=0.748..464555.338 rows=86566753 loops=1)
               Buffers: shared hit=113945 read=2217318
 Planning time: 0.081 ms
 Execution time: 491505.375 ms
(10 rows)

Postgres is sorting the entire table here! Which is, unsurprisingly, extremely slow and resource intensive, and therefore the query takes a very very long time (491 seconds, or a little over 8 minutes) just to fetch the two most recent rows (as defined by the date column) in the table.

with NULLS LAST

 EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc NULLS LAST
limit 2;
                                                                             QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.57..0.68 rows=2 width=177) (actual time=80.347..80.350 rows=2 loops=1)
   Buffers: shared read=5
   ->  Index Scan using analytics_data_date_time_desc on analytics_data  (cost=0.57..4882568.86 rows=86695633 width=177) (actual time=80.345..80.347 rows=2 loops=1)
         Buffers: shared read=5
 Planning time: 0.085 ms
 Execution time: 80.376 ms
(6 rows)

With NULLS LAST, the query uses the index, and instead takes just 80ms.

This clearly helps a lot! But the query used by the API is more complicated, and just adding the NULLS LAST doesn't fix everything. Will explain more on that later today.

nick-jones-gov · 2021-07-06T15:29:14Z

If the only thing we replace in the query is using NULLS LAST, the resulting query is still slow - my guess/interpretation of the query plan is that the multiple ORDER BYs are the problem.

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
where "report_name" = 'site'
and data->> 'domain' = 'benefits.gov'
and true
order by "date" desc NULLS LAST,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                                         QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=2670011.53..2670011.55 rows=10 width=184) (actual time=150771.126..150771.134 rows=10 loops=1)
   Buffers: shared read=231878
   ->  Sort  (cost=2670011.53..2670091.33 rows=31920 width=184) (actual time=150771.124..150771.128 rows=10 loops=1)
         Sort Key: date DESC NULLS LAST, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared read=231878
         ->  Bitmap Heap Scan on analytics_data  (cost=227125.27..2669321.75 rows=31920 width=184) (actual time=22811.950..150759.986 rows=3049 loops=1)
               Recheck Cond: ((report_name)::text = 'site'::text)
               Rows Removed by Index Recheck: 3251542
               Filter: ((data ->> 'domain'::text) = 'benefits.gov'::text)
               Rows Removed by Filter: 6284507
               Heap Blocks: exact=31603 lossy=166690
               Buffers: shared read=231878
               ->  Bitmap Index Scan on analytics_data_report_name_report_agency_index  (cost=0.00..227117.29 rows=6384096 width=0) (actual time=22752.132..22752.132 rows=6315629 loops=1)
                     Index Cond: ((report_name)::text = 'site'::text)
                     Buffers: shared read=33585
 Planning time: 0.125 ms
 Execution time: 150771.542 ms
(18 rows)

if we simplify the query a little bit by removing the WHERE clause altogether, it's still slow due to the sorting on multiple fields - postgres can't use the index on date:

EXPLAIN (ANALYZE, BUFFERS)
select * from "analytics_data"
order by "date" desc NULLS LAST,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc
limit 10;
                                                                  QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=6397464.20..6397464.23 rows=10 width=185) (actual time=500519.300..500519.307 rows=10 loops=1)
   Buffers: shared hit=98232 read=2242361
   ->  Sort  (cost=6397464.20..6615065.50 rows=87040517 width=185) (actual time=500519.296..500519.298 rows=10 loops=1)
         Sort Key: date DESC NULLS LAST, (((data ->> 'total_events'::text))::integer) DESC, (((data ->> 'visits'::text))::integer) DESC
         Sort Method: top-N heapsort  Memory: 26kB
         Buffers: shared hit=98232 read=2242361
         ->  Seq Scan on analytics_data  (cost=0.00..4516549.92 rows=87040517 width=185) (actual time=0.670..448189.035 rows=86932072 loops=1)
               Buffers: shared hit=98226 read=2242361
 Planning time: 16.648 ms
 Execution time: 500519.938 ms
(10 rows)

To speed this up, I think we need the ORDER BY to be on a multi-column index of the same fields - i.e. if we have ORDER BY x, y then there should be an index on (x, y).

nick-jones-gov · 2021-07-08T19:35:42Z

One (relatively simple) option would be to change the second part of the ORDER BY, and add a new index. As a reminder, right now the ORDER BY is as follows:

order by "date" desc,
CAST(data->>'total_events' AS INTEGER) desc,
CAST(data->>'visits' AS INTEGER) desc

It's important that we have some stable ordering, because users can paginate through results by specifying the page query param. If we just dropped the second two fields (data->>'total_events' and data-->'visits'), it's possible that the query result wouldn't be stable, since there's no explicit tie-breaker for the order when two rows have the same date.

Assuming that the ordering by total_events and visits is arbitrary (I don't see anything in the API docs that suggest that ordering matters), we could instead order by (date, id); so rather than using total_events and visits as a sorting tie-breaker, just use the id of the row. This would change the order of results returned now, but only for entries in the data that have the same date.

To speed up that ordering, we'd add a new multi-column index on (date DESC NULLS LAST, id DESC). So in total, the changes would be 1. a few line change in the API code (to use id as the tiebreaker in sorting) and 2. a DB migration to add the new multi-columns index.

@tdlowden - do you have thoughts on the impact of changing the ordering of results returned to be arbitrary (beyond the date)? I can explore adding a different index that would keep the ordering the same as it works today, but I think that may get a bit messier because the total_events and visits fields are stored in a JSON column. Happy to discuss more if the explanation of user impact doesn't make sense!

levinmr · 2024-08-21T20:40:48Z

Fixed this for all domain queries except the downloads query by adding an index on the jsonb column. Added a new ticket for the domain download query as it's a little different and needs a different implementation

nick-jones-gov mentioned this issue Jul 19, 2021

Change order by clause, add multi-column index to speed up queries #172

Merged

mm225022 mentioned this issue Jul 20, 2021

502 Bad Gateway: Registered endpoint failed to handle the request #122

Open

levinmr closed this as completed Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate slow endpoints / performance #161

Investigate slow endpoints / performance #161

nick-jones-gov commented Jun 29, 2021

nick-jones-gov commented Jun 29, 2021 •

edited

Loading

nick-jones-gov commented Jun 29, 2021

nick-jones-gov commented Jun 30, 2021

nick-jones-gov commented Jul 6, 2021

nick-jones-gov commented Jul 8, 2021

levinmr commented Aug 21, 2024

Investigate slow endpoints / performance #161

Investigate slow endpoints / performance #161

Comments

nick-jones-gov commented Jun 29, 2021

nick-jones-gov commented Jun 29, 2021 • edited Loading

nick-jones-gov commented Jun 29, 2021

nick-jones-gov commented Jun 30, 2021

nick-jones-gov commented Jul 6, 2021

nick-jones-gov commented Jul 8, 2021

levinmr commented Aug 21, 2024

nick-jones-gov commented Jun 29, 2021 •

edited

Loading