Skip to content

Connecting to the database blocks the main thread #19115

@MadLittleMods

Description

@MadLittleMods

Problem

If Synapse starts while the PostgreSQL database is unavailable, the process can go unresponsive (unable to Ctrl + C to send a SIGINT and kill the process) until it times out after 2+ minutes. This happens to me locally whenever my Postgres database isn't up and running yet. Probably a bad interaction because I have IPv6 disabled (see details below).

The psycopg2.connect(...) call blocks the main thread until it connects or times out.

This problem is more relevant to Synapse Pro for small hosts (multiple instances of Synapse in the same Python process) as we don't want to block the main thread for 2+ minutes waiting for the timeout (which would block other homeserver tenants running in the same Python process).

homeserver.log

synapse.config.logger - 377 - WARNING - main - ***** STARTING SERVER *****
synapse.config.logger - 378 - WARNING - main - Server /home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.13/bin/synapse_homeserver version 1.141.0rc1 (b=develop,c0b9437ab6)
synapse.config.logger - 383 - WARNING - main - Copyright (c) 2023 New Vector, Inc
synapse.config.logger - 384 - WARNING - main - Licensed under the AGPL 3.0 license. Website: https://github.com/element-hq/synapse
synapse.config.logger - 387 - INFO - main - Server hostname: my.synapse.linux.server
synapse.config.logger - 388 - INFO - main - Public Base URL: http://localhost:8008/
synapse.config.logger - 389 - INFO - main - Instance name: master
synapse.config.logger - 390 - INFO - main - Twisted reactor: EPollReactor
synapse.app.homeserver - 412 - INFO - main - Setting up server
jaeger_tracing - 463 - INFO - main - Initializing Jaeger Tracer with UDP reporter
asyncio - 64 - DEBUG - sentinel - Using selector: EpollSelector
jaeger_tracing - 398 - INFO - main - Using sampler ConstSampler(True)
jaeger_tracing - 452 - INFO - main - opentracing.tracer initialized to <jaeger_client.tracer.Tracer object at 0x7fd4e2b0b0e0>[app_name=my.synapse.linux.server master]
synapse.server - 608 - INFO - main - Setting up.
synapse.app._base - 247 - ERROR - main - Exception during startup
Traceback (most recent call last):
  File "synapse/synapse/app/homeserver.py", line 418, in setup
    hs.setup()
    ~~~~~~~~^^
  File "synapse/synapse/server.py", line 610, in setup
    self.datastores = Databases(self.DATASTORE_CLASS, self)
                      ~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "synapse/synapse/storage/databases/__init__.py", line 86, in __init__
    with make_conn(
         ~~~~~~~~~^
        db_config=database_config,
        ^^^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        server_name=server_name,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ) as db_conn:
    ^
  File "synapse/synapse/storage/database.py", line 188, in make_conn
    native_db_conn = engine.module.connect(**db_params)
  File "/home/eric/.cache/pypoetry/virtualenvs/matrix-synapse-xCtC9ulO-py3.13/lib/python3.13/site-packages/psycopg2/__init__.py", line 122, in connect
    conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "localhost" (127.0.0.1), port 5432 failed: Connection refused
	Is the server running on that host and accepting TCP/IP connections?
connection to server at "localhost" (::1), port 5432 failed: Connection timed out
	Is the server running on that host and accepting TCP/IP connections?

Timeout

psycopg2 is a libpq wrapper and their docs says the default is "wait indefinitely".

connect_timeout - Maximum wait for connection, in seconds. Zero or not specified means wait indefinitely.

-- https://pkg.go.dev/github.com/lib/pq#hdr-Connection_String_Parameters

Even in the more verbose Postgres libpq docs, the default seems to be to "wait indefinitely",

connect_timeout

Maximum time to wait while connecting, in seconds (write as a decimal integer, e.g., 10). Zero, negative, or not specified means wait indefinitely. This timeout applies separately to each host name or IP address. For example, if you specify two hosts and connect_timeout is 5, each host will time out if no connection is made within 5 seconds, so the total time spent waiting for a connection might be up to 10 seconds.

-- Postgres libpq database connection docs)

It's possible to configure connect_timeout in the Synapse homeserver config which does make things timeout quicker:

homeserver.yaml

database:
  name: psycopg2
  args:
    user: postgres
    database: synapse
    host: localhost
    sslmode: disable
    # The maximum number of seconds to wait while connecting
    connect_timeout: 10

But since we weren't configuring connect_timeout at all, why are we seeing our connection timeout?

From the logs, we can see that the IPv4 address localhost" (127.0.0.1), port 5432 failed: Connection refused (probably a fast failure)

While the IPv6 address localhost" (::1), port 5432 failed: Connection timed out (took a while to timeout)

If I change my database config to only use only the IPv4 address host: 127.0.0.1, Synapse is able to abort immediately. So we can narrow down the culprit to the IPv6 attempts.

homeserver.yaml

database:
  name: psycopg2
  args:
    user: postgres
    database: synapse
    host: 127.0.0.1
    sslmode: disable

For reference, I have IPv6 disabled on my system via net.ipv6.conf.all.disable_ipv6 = 1

Asking an LLM about why we're timing out, provides a plausible reason about exponential backoff behavior and points to net.ipv4.tcp_syn_retries = 6. So for 6 retries (7 attempts), we get 2^7 - 1 = 127s (perhaps slightly more complicated based on the descriptions below)

$ sysctl net.ipv4.tcp_syn_retries
net.ipv4.tcp_syn_retries = 6

tcp_syn_retries - INTEGER

Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value is 6, which corresponds to 67seconds (with tcp_syn_linear_timeouts = 4) till the last retransmission with the current initial RTO of 1second. With this the final timeout for an active TCP connection attempt will happen after 131seconds.

-- https://docs.kernel.org/networking/ip-sysctl.html

tcp_syn_linear_timeouts - INTEGER

The number of times for an active TCP connection to retransmit SYNs with a linear backoff timeout before defaulting to an exponential backoff timeout. This has no effect on SYNACK at the passive TCP side.

With an initial RTO of 1 and tcp_syn_linear_timeouts = 4 we would expect SYN RTOs to be: 1, 1, 1, 1, 1, 2, 4, ... (4 linear timeouts, and the first exponential backoff using 2^0 * initial_RTO). Default: 4

-- https://docs.kernel.org/networking/ip-sysctl.html

These are IPv4 settings but I'm going to assume these are being inherited somewhere since I don't see separate IPv6 settings for this.

Workaround

For me locally, I can workaround the problem by configuring an IPv4 address host: 127.0.0.1 instead of localhost.

homeserver.yaml

database:
  name: psycopg2
  args:
    user: postgres
    database: synapse
    host: 127.0.0.1
    sslmode: disable

While this reduces the problem immensely, there is still a problem of blocking the main thread until we connect to the database. We don't want to unnecessarily block the main thread at all (especially in the context of running multiple Synapse instances in the same thread, c.f. Synapse Pro for small hosts).

Dev notes

DB-API 2.0 interface/specification

In Synapse, we use the DB-API 2.0 interface specification defined in PEP 249 that provides a consistent way to interact with different databases.

class BaseDatabaseEngine(Generic[ConnectionType, CursorType], metaclass=abc.ABCMeta):
def __init__(self, module: DBAPI2Module, config: Mapping[str, Any]):
self.module = module

For SQLite, Python has a builtin sqlite3 library that is DB-API 2.0 interface for SQLite databases.

class Sqlite3Engine(BaseDatabaseEngine[sqlite3.Connection, sqlite3.Cursor]):
def __init__(self, database_config: Mapping[str, Any]):
super().__init__(sqlite3, database_config)

For Postgres, we use psycopg2 which complies with the Python DB API 2.0 specification.

class PostgresEngine(
BaseDatabaseEngine[psycopg2.extensions.connection, psycopg2.extensions.cursor]
):
def __init__(self, database_config: Mapping[str, Any]):
super().__init__(psycopg2, database_config)

Synapse startup sequence -> database connection

  1. main()
  2. setup(hs)
  3. hs.setup()
  4. Databases(self.DATASTORE_CLASS, self)
  5. make_conn(...)
  6. engine.module.connect(**db_params)
  7. Which is calling either psycopg2.connect(...) or sqlite3.connect(...) depending on what DB-API 2.0 database module we provided when we created the engine. This blocks until it connects or times out.

Connect asynchronously

Postgres does have "asynchronous support" by using psycopg2.connect(async=True) but this comes with other baggage of needing to handle everything async then.

psycopg3 has an AsyncConnection.connect(...) method alongside the normal psycopg.connect(...) method.

#18999 is adding support for psycopg3 but doesn't change how we're connecting at all.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions