-
Notifications
You must be signed in to change notification settings - Fork 488
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(neon): Add tests clarifying postgres sigabrt on pageserver unavailability #10666
base: main
Are you sure you want to change the base?
Conversation
7458 tests run: 7077 passed, 0 failed, 381 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
2918c89 at 2025-02-12T01:24:53.615Z :recycle: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that we should not only track the transaction abort but also transaction commit.
The following test illustrates the problem:
def test_pageserver_stop_and_commit_transaction(neon_env_builder: NeonEnvBuilder):
"""
If pageserver is unavailable during a transaction, the transaction is aborted.
If relation that's part of transaction is not found in cache, we query pageserver during
transaction abort to delete it which in turn triggers a sigabrt.
"""
env = neon_env_builder.init_start()
endpoint = env.endpoints.create_start("main", config_lines=["neon.relsize_hash_size=0"])
with closing(endpoint.connect()) as conn, conn.cursor() as cur:
cur.execute("CREATE DATABASE test")
with endpoint.connect(dbname="test") as conn, conn.cursor() as cur:
cur.execute("create table t(b box)")
try:
with endpoint.connect(dbname="test") as conn, conn.cursor() as cur:
cur.execute("drop table t")
env.pageserver.stop()
except pgerr.InternalError:
pass # pageserver is unavailable
except pgerr.InterfaceError as err:
raise AssertionError("Postgres aborted") from err
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe pretending that the relation doesn't exist during abort is wrong. This happens when abort processing calls DropRelationsAllBuffers() for relations that were created in the same transaction. If we pretend that the relation doesn't exist, DropRelationsAllBuffers() will not drop the buffers belonging to the relation.
I'm not sure if 'true' is safe either; I could imagine that going wrong elsewhere.
Overall, I feel that this is not worth the effort. I believe we have pretty long timeouts if the pageserver connection is lost, before it becomes an ERROR. And if the pageserver is unreachable, the compute is pretty much stuck anyway, and PANIC is not much worse than merely rolling back the transaction. It might even be better to PANIC. PANIC causes it to restart, which just might unwedge the compute <-> pageserver connection.
I suspect that there is wring approach to solving the problem. This PR propose workaround: do not send request to PS the. we transaction is in abort state. The problem is that operation which can not fail inVanilla Postgres (like smgrexists), can actually fail in Neon. What we can do?
|
Option 3: Ensure that the smgrexists() calls that happen in this codepath are always cached. We have a relsize cache, and IIRC we use it for smgrexists() calls too. When you create a new relation, we could ensure that the cache contains all the forks of the new relation, including negative cache entries for the forks that don't exist. That way, the smgrexists() calls at end-of-transaction would never need to make a request to the pageserver. Not sure it's worth the trouble to do that just for this issue, but might be worth exploring a little. |
adde5ac
to
6d99bd4
Compare
After internal discussion we decided to stick with (2) |
d28f94e
to
e070ff6
Compare
e070ff6
to
90928cd
Compare
Resolves: #5734
When we query pageserver and it's unavailable after some retries,
postgres sigabrt's. This is intended behavior so I've added tests checking it