You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -79,6 +82,25 @@ If you have an environment available, with at least Postgres and whichever modul
79
82
docker-compose -f build/docker/docker-compose.yml up
80
83
```
81
84
85
+
The docker-compose file has profiles that can be used in order to bring up only the relevant containers. If for example you only want to run PostgreSQL to PostgreSQL pgstream replication you can use the `pg2pg` profile as follows:
86
+
87
+
```
88
+
docker-compose -f build/docker/docker-compose.yml --profile pg2pg up
89
+
```
90
+
91
+
You can also run multiple profiles. For example to start two PostgreSQL instances and Kafka:
92
+
93
+
```
94
+
docker-compose -f build/docker/docker-compose.yml --profile pg2pg --profile kafka up
95
+
```
96
+
97
+
List of supported docker profiles:
98
+
99
+
- pg2pg
100
+
- pg2os
101
+
- pg2webhook
102
+
- kafka
103
+
82
104
#### Prepare the database
83
105
84
106
This will create the `pgstream` schema in the configured Postgres database, along with the tables/functions/triggers required to keep track of the schema changes. See [Tracking schema changes](#tracking-schema-changes) section for more details. It will also create a replication slot for the configured database which will be used by the pgstream service.
@@ -118,6 +140,18 @@ pgstream run -c pg2kafka.env --log-level trace
118
140
pgstream run -c kafka2os.env --log-level trace
119
141
```
120
142
143
+
Example running pgstream with PostgreSQL -> PostgreSQL with initial snapshot enabled:
144
+
145
+
```
146
+
pgstream run -c pg2pg.env --log-level trace
147
+
```
148
+
149
+
Example running pgstream with PostgreSQL snapshot only mode -> PostgreSQL:
150
+
151
+
```
152
+
pgstream run -c snapshot2pg.env --log-level trace
153
+
```
154
+
121
155
The run command will parse the configuration provided, and initialise the configured modules. It requires at least one listener and one processor.
122
156
123
157
## Configuration
@@ -227,6 +261,21 @@ One of exponential/constant backoff policies can be provided for the search stor
| PGSTREAM_POSTGRES_WRITER_TARGET_URL | N/A | Yes | URL for the PostgreSQL store to connect to |
270
+
| PGSTREAM_POSTGRES_WRITER_BATCH_TIMEOUT | 1s | No | Max time interval at which the batch sending to PostgreSQL is triggered. |
271
+
| PGSTREAM_POSTGRES_WRITER_BATCH_SIZE | 100 | No | Max number of messages to be sent per batch. When this size is reached, the batch is sent to PostgreSQL. |
272
+
| PGSTREAM_POSTGRES_WRITER_MAX_QUEUE_BYTES | 100MiB | No | Max memory used by the postgres batch writer for inflight batches. ||
273
+
| PGSTREAM_POSTGRES_WRITER_BATCH_BYTES | 1572864 | No | Max size in bytes for a given batch. When this size is reached, the batch is sent to PostgreSQL. |
274
+
| PGSTREAM_POSTGRES_WRITER_SCHEMALOG_STORE_URL | N/A | No | URL of the store where the pgstream schemalog table which keeps track of schema changes is. |
275
+
| PGSTREAM_POSTGRES_WRITER_DISABLE_TRIGGERS | False | No | Option to disable triggers on the target PostgreSQL database while performing the snaphot/replication streaming. |
276
+
277
+
</details>
278
+
230
279
<details>
231
280
<summary>Injector</summary>
232
281
@@ -253,6 +302,14 @@ The detailed SQL used can be found in the [migrations folder](https://github.com
253
302
254
303
The schema and data changes are part of the same linear stream - the downstream consumers always observe the schema changes as soon as they happen, before any data arrives that relies on the new schema. This prevents data loss and manual intervention.
255
304
305
+
## Snapshots
306
+
307
+
`pgstream` can handle the generation of PostgreSQL snapshots, including both schema and data. The current implementations for each are:
308
+
309
+
- Schema: depending on the configuration, it can use either the pgstream `schema_log` table to get the schema view and process it as events downstream, or rely on the `pg_dump`/`pg_restore` PostgreSQL utilities.
310
+
311
+
- Data: it relies on transaction snapshot ids to obtain a stable view of the database, and paralellises the read of all the rows by dividing them into ranges using the `ctid`.
312
+
256
313
## Architecture
257
314
258
315
`pgstream` is constructed as a streaming pipeline, where data from one module streams into the next, eventually reaching the configured output plugins. It keeps track of schema changes and replicates them along with the data changes to ensure a consistent view of the source data downstream. This modular approach makes adding and integrating output plugin implementations simple and painless.
@@ -267,7 +324,9 @@ A listener is anything that listens for WAL data, regardless of the source. It h
267
324
268
325
There are currently two implementations of the listener:
269
326
270
-
-**Postgres listener**: listens to WAL events directly from the replication slot. Since the WAL replication slot is sequential, the Postgres WAL listener is limited to run as a single process. The associated Postgres checkpointer will sync the LSN so that the replication lag doesn't grow indefinitely.
327
+
-**Postgres listener**: listens to WAL events directly from the replication slot. Since the WAL replication slot is sequential, the Postgres WAL listener is limited to run as a single process. The associated Postgres checkpointer will sync the LSN so that the replication lag doesn't grow indefinitely. It can be configured to perform an initial snapshot when pgstream is first connected to the source PostgreSQL database (see details in the [snapshots section](#snapshots)).
328
+
329
+
-**Postgres Snapshoter**: produces events by performing a snapshot of the configured PostgreSQL database, as described in the [snapshots section](#snapshots). It doesn't start continuous replication, so once all the snapshotted data has been processed, the pgstream process will stop.
271
330
272
331
-**Kafka reader**: reads WAL events from a Kafka topic. It can be configured to run concurrently by using partitions and Kafka consumer groups, applying a fan-out strategy to the WAL events. The data will be partitioned by database schema by default, but can be configured when using `pgstream` as a library. The associated Kafka checkpointer will commit the message offsets per topic/partition so that the consumer group doesn't process the same message twice.
273
332
@@ -283,6 +342,8 @@ There are currently two implementations of the processor:
283
342
284
343
-**Webhook notifier**: it sends a notification to any webhooks that have subscribed to the relevant wal event. It relies on a subscription HTTP server receiving the subscription requests and storing them in the shared subscription store which is accessed whenever a wal event is processed. It sends the notifications to the different subscribed webhook urls in parallel based on a configurable number of workers (client timeouts apply). Similar to the two previous processor implementations, it uses a memory guarded buffering system internally, which allows to separate the wal event processing from the webhook url sending, optimising the processor latency.
285
344
345
+
-**Postgres batch writer**: it writes the WAL events into a PostgreSQL compatible database. It implements the same kind of mechanism than the Kafka and the search batch writers to ensure continuous processing from the listener, and it also uses a batching mechanism to minimise PostgreSQL IO traffic.
346
+
286
347
In addition to the implementations described above, there are optional processor decorators, which work in conjunction with one of the main processor implementations described above. Their goal is to act as modifiers to enrich the wal event being processed.
287
348
288
349
There are currently two implementations of the processor that act as decorators:
0 commit comments