-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Document/Improve: Continuous batching/chunking of a stream #118
Comments
Regarding the "antipattern" that is mentioned in #118 (comment)
Calling
and pipe into the new rotated stream until the previous stream emits 'finish'. In the code, what happens is :
when what you want in order to leave more breathing room to the backend may be
|
Just to make it clear this issue is for discussing a pattern that attracts some people because of the possibilities given by Some people want to create a long-lived ingestion mechanism to ingest data into postgresql. For example, you could imagine that you have a never ending stream of logs that you want to ingest into postgresql. this could be done by
there are probably other ways to do that with postgresql, including mechanisms linked with replication. but still it can be interesting to see if and how COPY can help for this kind of scenarios. first it is necessary to understand that a COPY FROM operation ingest rows. During the operation, the rows are not visible. They will become visible only once the COPY operation is terminated. So Once this is said, the idea of rotated COPY operations appear : ingest N rows via COPY and then create a new COPY operation. It could be interesting to agree on advantages and disadvantages of the different solutions that can be used to do that and see how it impacts the performance of postgresql depending on the throughput of the ingested data, and how the throughput of out-of-band queries that need to be executed on the data. I will leave this issue open for several months to see if people want to share their experience, have ideas around this, and discuss this further. |
@jeromew: Howdy, here I am bulk ingesting data into PostgreSQL again 😃 ... Multi-INSERT commands seems slow compared to copy but I haven't benchmarked it yet. I'm realizing that a library that wraps If there was any interest in linking to it from this library, I'd gladly consider implementing it. Thoughts? @jeromew @brianc |
Hello, I understand the need to find a nice streaming ingestion mechanism for long-lived streams and mega-huge ingested filed with regular visibility on the data beeing ingested (using one COPY operation only gives visibility after the COPY is commited). it would indeed be interesting to benchmark COPY vs Multi-Insert. by Multi-Insert for N rows, I do not mean N INSERT operations, "1 insert per row", but I mean the ability that postgres has to insert several rows as a batch in one operation - cf https://www.postgresql.org/docs/current/dml-insert.html and thus send 1 INSERT with N rows
The TIP on this page says
I fully agree for 1 COPY versus INSERTs but if the regular visibility is important to you I would be surprised that the time spent setting up regular COPY operation for N rows will outperform a regular INSERT with N rows, and even more so if the INSERT N rows is a A specificity also when using COPY is that Note also that if the stream is an object stream with 1 row per chunk, it is important performance-wise to recombine the rows into bigger network chunks as per #127 |
I'm creating this issue to discuss solutions (documentation or otherwise) to help folks avoid some of the non-memory related pitfalls on #116 when continuously batching COPY commands.
See:
#116 (comment)
The text was updated successfully, but these errors were encountered: