Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(neon): Add tests clarifying postgres sigabrt on pageserver unavailability #10666

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

myrrc
Copy link
Contributor

@myrrc myrrc commented Feb 5, 2025

Resolves: #5734

When we query pageserver and it's unavailable after some retries,
postgres sigabrt's. This is intended behavior so I've added tests checking it

@myrrc myrrc requested review from a team as code owners February 5, 2025 12:02
Copy link

github-actions bot commented Feb 5, 2025

7458 tests run: 7077 passed, 0 failed, 381 skipped (full report)


Code coverage* (full report)

  • functions: 33.2% (8583 of 25872 functions)
  • lines: 49.1% (72292 of 147344 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2918c89 at 2025-02-12T01:24:53.615Z :recycle:

Copy link
Contributor

@knizhnik knizhnik left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that we should not only track the transaction abort but also transaction commit.
The following test illustrates the problem:

def test_pageserver_stop_and_commit_transaction(neon_env_builder: NeonEnvBuilder):
    """
    If pageserver is unavailable during a transaction, the transaction is aborted.
    If relation that's part of transaction is not found in cache, we query pageserver during
    transaction abort to delete it which in turn triggers a sigabrt.
    """
    env = neon_env_builder.init_start()
    endpoint = env.endpoints.create_start("main", config_lines=["neon.relsize_hash_size=0"])
    with closing(endpoint.connect()) as conn, conn.cursor() as cur:
        cur.execute("CREATE DATABASE test")
    with endpoint.connect(dbname="test") as conn, conn.cursor() as cur:
        cur.execute("create table t(b box)")
    try:
        with endpoint.connect(dbname="test") as conn, conn.cursor() as cur:
            cur.execute("drop table t")
            env.pageserver.stop()
    except pgerr.InternalError:
        pass  # pageserver is unavailable
    except pgerr.InterfaceError as err:
        raise AssertionError("Postgres aborted") from err

Copy link
Contributor

@hlinnaka hlinnaka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe pretending that the relation doesn't exist during abort is wrong. This happens when abort processing calls DropRelationsAllBuffers() for relations that were created in the same transaction. If we pretend that the relation doesn't exist, DropRelationsAllBuffers() will not drop the buffers belonging to the relation.

I'm not sure if 'true' is safe either; I could imagine that going wrong elsewhere.

Overall, I feel that this is not worth the effort. I believe we have pretty long timeouts if the pageserver connection is lost, before it becomes an ERROR. And if the pageserver is unreachable, the compute is pretty much stuck anyway, and PANIC is not much worse than merely rolling back the transaction. It might even be better to PANIC. PANIC causes it to restart, which just might unwedge the compute <-> pageserver connection.

pgxn/neon/pagestore_smgr.c Outdated Show resolved Hide resolved
@knizhnik
Copy link
Contributor

I suspect that there is wring approach to solving the problem.
So what is the actual problem? During transaction commit or abort we have to perform some actions which need to access PS. And PS may be not available at this moment and after some attempts to reconnect we give-up and throw error.
And it cause panic because we are try to abort already aborted or committed transaction.

This PR propose workaround: do not send request to PS the. we transaction is in abort state.
But it seems to be
a) wrong - we have not complete some required actions in this case
b) not enough - the similar problem with commit

The problem is that operation which can not fail inVanilla Postgres (like smgrexists), can actually fail in Neon.

What we can do?

  1. Never give-up:) Or at least if transaction is not in progress. What PS forever until it is restarted. I wonder if we can do it always: no returning error after N attempts to reconnect. Yes, request and client application can be blocked for longer time in this case. But even now it will wait more than several minutes. It seems to be too much for most applications. So not limiting wait time unlikely change something.
  2. Report fatal error and exist. This is what actually happen now. It is assumed to happen very rarely (we had to disabler resize cache to be able to reproduce the problem). And I do not see something criminal in abnormal Postgres termination if it can not cause data corruption.

@hlinnaka
Copy link
Contributor

Option 3: Ensure that the smgrexists() calls that happen in this codepath are always cached.

We have a relsize cache, and IIRC we use it for smgrexists() calls too. When you create a new relation, we could ensure that the cache contains all the forks of the new relation, including negative cache entries for the forks that don't exist. That way, the smgrexists() calls at end-of-transaction would never need to make a request to the pageserver.

Not sure it's worth the trouble to do that just for this issue, but might be worth exploring a little.

@myrrc myrrc force-pushed the myrrc/5734-abort-transaction branch from adde5ac to 6d99bd4 Compare February 11, 2025 16:50
@myrrc myrrc changed the title fix(neon): transaction double abort on pageserver missing fix(neon): Add tests clarifying postgres sigabrt on pageserver unavailability Feb 11, 2025
@myrrc
Copy link
Contributor Author

myrrc commented Feb 11, 2025

After internal discussion we decided to stick with (2)

@myrrc myrrc requested review from knizhnik and hlinnaka February 11, 2025 17:24
@myrrc myrrc force-pushed the myrrc/5734-abort-transaction branch from d28f94e to e070ff6 Compare February 11, 2025 18:38
@myrrc myrrc force-pushed the myrrc/5734-abort-transaction branch from e070ff6 to 90928cd Compare February 11, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Disconnection of pageserver causes assertion failure in endpoint
3 participants