Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnection of pageserver causes assertion failure in endpoint #5734

Open
alexanderlaw opened this issue Oct 31, 2023 · 5 comments · May be fixed by #10666
Open

Disconnection of pageserver causes assertion failure in endpoint #5734

alexanderlaw opened this issue Oct 31, 2023 · 5 comments · May be fixed by #10666
Assignees
Labels
c/compute Component: compute, excluding postgres itself c/PostgreSQL Component: PostgreSQL features and bugs m/good_first_issue Moment: when doing your first Neon contributions t/bug Issue Type: Bug

Comments

@alexanderlaw
Copy link
Contributor

Steps to reproduce

cargo neon init --force
cargo neon start
cargo neon tenant create --pg-version 16 --set-default
cargo neon endpoint create --pg-version 16 main

echo "shared_buffers = 1024MB" >> ".neon/endpoints/main/postgresql.conf"
cargo neon endpoint start main

createdb test
( sleep 1; cargo neon pageserver stop ) &
echo "
select pg_sleep(3);

create table t(b box);
create index t_idx on t using gist(b);
" | psql test

Result

CREATE TABLE
ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
ERROR:  [NEON_SMGR] could not establish connection to pageserver
DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
WARNING:  AbortTransaction while in ABORT state
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
connection to server was lost

Logs, links

.neon/endpoints/main/compute.log contains:

...
2023-10-31 07:49:43.264 GMT [459970] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:49:43.264 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:49:43.264 GMT [459970] STATEMENT:  create index t_idx on t using gist(b);
...
2023-10-31 07:50:42.377 GMT [468868] LOG:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:42.377 GMT [468868] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.307 GMT [459970] ERROR:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:43.307 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.307 GMT [459970] STATEMENT:  create index t_idx on t using gist(b);
2023-10-31 07:50:43.308 GMT [459970] ERROR:  [NEON_SMGR] could not establish connection to pageserver
2023-10-31 07:50:43.308 GMT [459970] DETAIL:  connection to server at "127.0.0.1", port 64000 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?
2023-10-31 07:50:43.308 GMT [459970] WARNING:  AbortTransaction while in ABORT state
TRAP: failed Assert("TransactionIdIsValid(proc->xid)"), File: ".../neon//vendor/postgres-v16/src/backend/storage/ipc/procarray.c", Line: 680, PID: 459970
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(__interceptor_backtrace+0x5e)[0x55e5c9cab08e]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(ExceptionalCondition+0x1c9)[0x55e5cce73dd9]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(ProcArrayEndTransaction+0xf0)[0x55e5cc17e910]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x2208aa6)[0x55e5ca349aa6]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(AbortCurrentTransaction+0x30f)[0x55e5ca35174f]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(PostgresMain+0xe57)[0x55e5cc2d8e67]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c269c1)[0x55e5cbd679c1]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c1c840)[0x55e5cbd5d840]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(+0x3c1619a)[0x55e5cbd5719a]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(PostmasterMain+0x3b2e)[0x55e5cbd53dde]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(main+0xd1e)[0x55e5cb3aacce]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7f41a15aed90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7f41a15aee40]
postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX(_start+0x25)[0x55e5c9c65615]
...
2023-10-31 07:50:56.860 GMT [459679] LOG:  server process (PID 459970) was terminated by signal 6: Aborted
2023-10-31 07:50:56.860 GMT [459679] DETAIL:  Failed process was running: create index t_idx on t using gist(b);
...

2023-10-31T07:50:57.204194Z  INFO checking for core dumps in .neon/endpoints/main/pgdata
2023-10-31T07:50:57.204321Z  WARN core dump found: .neon/endpoints/main/pgdata/core.459970, collecting backtrace
2023-10-31T07:50:58.038844Z  WARN core dump backtrace: [New LWP 459970]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: cloud_admin test 127.0.0.1(49600) CREATE INDEX            '.
Program terminated with signal SIGABRT, Aborted.
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:44
#0  __pthread_kill_implementation (no_tid=0, signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:44
#1  __pthread_kill_internal (signo=6, threadid=139919872374720) at ./nptl/pthread_kill.c:78
#2  __GI___pthread_kill (threadid=139919872374720, signo=signo@entry=6) at ./nptl/pthread_kill.c:89
#3  0x00007f41a15c7476 in __GI_raise (sig=sig@entry=6) at ../sysdeps/posix/raise.c:26
#4  0x00007f41a15ad7f3 in __GI_abort () at ./stdlib/abort.c:79
#5  0x000055e5cce73e5b in ExceptionalCondition (conditionName=0x55e5cd52cda0 <str> "TransactionIdIsValid(proc->xid)", fileName=0x55e5cd52bea7 ".../neon//vendor/postgres-v16/src/backend/storage/ipc/procarray.c", lineNumber=680) at .../neon/vendor/postgres-v16/src/backend/utils/error/assert.c:66
#6  0x000055e5cc17e910 in ProcArrayEndTransaction (proc=0x7f419c7dd1f0, latestXid=1026) at .../neon/vendor/postgres-v16/src/backend/storage/ipc/procarray.c:680
#7  0x000055e5ca349aa6 in AbortTransaction () at .../neon/vendor/postgres-v16/src/backend/access/transam/xact.c:2834
#8  0x000055e5ca35174f in AbortCurrentTransaction () at .../neon/vendor/postgres-v16/src/backend/access/transam/xact.c:3338
#9  0x000055e5cc2d8e67 in PostgresMain (dbname=0x52900001b378 "test", username=0x52900001b358 "cloud_admin") at .../neon/vendor/postgres-v16/src/backend/tcop/postgres.c:4364
#10 0x000055e5cbd679c1 in BackendRun (port=0x515000003500) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:4464
#11 0x000055e5cbd5d840 in BackendStartup (port=0x515000003500) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:4192
#12 0x000055e5cbd5719a in ServerLoop () at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:1782
#13 0x000055e5cbd53dde in PostmasterMain (argc=3, argv=0x503000000a60) at .../neon/vendor/postgres-v16/src/backend/postmaster/postmaster.c:1466
#14 0x000055e5cb3aacce in main (argc=3, argv=0x503000000a60) at .../neon/vendor/postgres-v16/src/backend/main/main.c:238

With backtrace_functions = 'pageserver_connect' I see the following call stacks:
For the first error: ... index_build -> index_update_stats -> RelationGetNumberOfBlocksInFork -> smgrnblocks -> neon_nblocks -> page_server_request -> pageserver_send -> pageserver_connect.
For the second error: ... AbortCurrentTransaction -> AbortTransaction -> smgrDoPendingDeletes -> smgrdounlinkall -> DropRelationsAllBuffers -> smgrexists -> neon_exists -> page_server_request -> pageserver_send -> pageserver_connect.

@alexanderlaw alexanderlaw added the t/bug Issue Type: Bug label Oct 31, 2023
@kelvich
Copy link
Contributor

kelvich commented Oct 31, 2023

thank you @alexanderlaw! cc @save-buffer

@hlinnaka
Copy link
Contributor

Here's the repro as a python test:

diff --git a/test_runner/regress/test_pageserver_restart.py b/test_runner/regress/test_pageserver_restart.py
index 4ce53df21..4f2882335 100644
--- a/test_runner/regress/test_pageserver_restart.py
+++ b/test_runner/regress/test_pageserver_restart.py
@@ -218,3 +218,23 @@ def test_pageserver_chaos(
         # Check that all the updates are visible
         num_updates = endpoint.safe_psql("SELECT sum(updates) FROM foo")[0][0]
         assert num_updates == i * 100000
+
+# Aborting out of a transaction that has created new relations causes
+# a PANIC, if the pageserver cannot be reached. (And if the pageserver cannot
+# be reached, that causes the creation to abort in the first place.)
+#
+# repro for https://github.com/neondatabase/neon/issues/5734
+def test_pageserver_repro_5734(neon_env_builder: NeonEnvBuilder):
+    env = neon_env_builder.init_start()
+
+    endpoint = env.endpoints.create_start("main", config_lines=["shared_buffers='1GB'"])
+
+    with closing(endpoint.connect()) as conn:
+        with conn.cursor() as cur:
+            cur.execute("CREATE DATABASE test")
+
+    with closing(endpoint.connect(dbname="test")) as conn:
+        with conn.cursor() as cur:
+            env.pageserver.stop()
+            cur.execute("create table t(b box);")
+            cur.execute("create index t_idx on t using gist(b);")

@hlinnaka
Copy link
Contributor

When pageserver cannot be reached, the CREATE INDEX fails while trying to update the relpages/reltuples after the index has otherwise already been built. The error causes the transaction to abort. During abort processing, the backend checks if the relation exists on disk:

#0  errstart (elevel=21, domain=0x0) at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/utils/error/elog.c:364
#1  0x00007ff20b1f332c in pageserver_connect (shard_no=0, elevel=21) at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:458
#2  0x00007ff20b1f40ec in pageserver_send (shard_no=0, request=0x7fff7139ba30)
    at /home/heikki/git-sandbox/neon//pgxn/neon/libpagestore.c:746
#3  0x00007ff20b1faee2 in page_server_request (req=0x7fff7139ba30) at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:972
#4  0x00007ff20b1fca63 in neon_exists (reln=0x561c1136fab8, forkNum=MAIN_FORKNUM)
    at /home/heikki/git-sandbox/neon//pgxn/neon/pagestore_smgr.c:1940
#5  0x0000561c0fa71ab8 in smgrexists (reln=0x561c1136fab8, forknum=MAIN_FORKNUM)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:262
#6  0x0000561c0fa29893 in DropRelationsAllBuffers (smgr_reln=0x561c11349ba8, nlocators=1)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/buffer/bufmgr.c:3821
#7  0x0000561c0fa71e53 in smgrdounlinkall (rels=0x561c11349ba8, nrels=1, isRedo=false)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/storage/smgr/smgr.c:445
#8  0x0000561c0f6722ce in smgrDoPendingDeletes (isCommit=false)
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/catalog/storage.c:707
#9  0x0000561c0f5ef884 in AbortTransaction () at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:2861
#10 0x0000561c0f5f0196 in AbortCurrentTransaction ()
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/access/transam/xact.c:3339
#11 0x0000561c0fa7a1eb in PostgresMain (dbname=0x561c112f24c8 "test", username=0x561c112f24a8 "cloud_admin")
    at /home/heikki/git-sandbox/neon//vendor/postgres-v16/src/backend/tcop/postgres.c:4364

That also fails because the pagserver cannot be reached. The error during abort processing causes the assertion failure. In a release build with assertions disabled, it's a warning:

PG:2024-06-14 18:56:16.851 GMT [1023354] WARNING:  AbortTransaction while in ABORT state

This could in theory happen with vanilla Postgres too, if there was a disk failure so that the calls to check if the just-created file exists fails. That's highly unlikely with a local disk though.

This would be nice to fix somehow, but it's pretty low priority.

@hlinnaka hlinnaka added c/PostgreSQL Component: PostgreSQL features and bugs c/cloud/compute labels Jun 14, 2024
@jcsp jcsp added c/compute Component: compute, excluding postgres itself and removed c/cloud/compute labels Jun 19, 2024
@ololobus ololobus added the m/good_first_issue Moment: when doing your first Neon contributions label Jan 7, 2025
@ololobus
Copy link
Member

ololobus commented Jan 28, 2025

This week:

  • @myrrc plans to investigate this week

@ololobus
Copy link
Member

ololobus commented Feb 4, 2025

This week

  • @myrrc managed to repro, going to open a PR with fix + test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c/compute Component: compute, excluding postgres itself c/PostgreSQL Component: PostgreSQL features and bugs m/good_first_issue Moment: when doing your first Neon contributions t/bug Issue Type: Bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants