Support large tables which don't fit in RAM #13

akaihola · 2020-07-04T07:59:52Z

Fixes #1

Use merge sort with temporary files to put a cap to memory usage.

akaihola · 2020-07-04T10:39:51Z

@oldcai could you try whether this branch

sorts your large database without getting killed for using too much memory
has acceptable performance

akaihola · 2024-04-20T08:30:23Z

@oldcai I've now finalized this fix. Would you accept my invitation as collaborator for this repository and review the pull request?

Uses up to 100MB of memory by default before splitting table data into files and merge sorting them.

Also sorted imports.

In case of failure, this provides us the actual output before checking the position.

oldcai · 2024-04-21T06:33:51Z

@oldcai I've now finalized this fix. Would you accept my invitation as collaborator for this repository and review the pull request?

Thank you for your efforts. I'm honoured to do that.
I'll review the code and do some manual tests.

oldcai · 2024-04-21T16:09:21Z

✅ I downloaded the code and ran it on my dev environment with some test data successfully.

⚠️ Python version requirement is too high. When I was trying to test it in the product environment, the minimal Python version requirement is 3.7 or even 3.8 when the other MR is merged, which is much higher than the CentOS stable version 3.6.8.

Then I removed the annotation code and successfully ran it.

I'm currently doing a benchmark on a 167G file and will continue reviewing the code after it finishes.

❓ Additionally, I found an interesting fact that although the memory limit is set to 10**8, which should be 100M, the actual memory use and the temporary file sizes are just about 50MB. But I don't think it's an issue since the file count won't affect the performance of heapq.merge too much.

akaihola · 2024-04-21T16:22:55Z

✅ I downloaded the code and ran it on my dev environment with some test data successfully.

Good to hear that it works at least in a simple case!

⚠️ Python version requirement is too high. When I was trying to test it in the product environment, the minimal Python version requirement is 3.7 or even 3.8 when the other MR is merged, which is much higher than the CentOS stable version 3.6.8.

Ah, sorry about that. I'm trying to be a good Python community citizen and encourage upgrading from unsupported Python versions, but I do recognize that in RedHat/CentOS land the versions drag a bit behind... Would a Docker image help you?

❓ Additionally, I found an interesting fact that although the memory limit is set to 10**8, which should be 100M, the actual memory use and the temporary file sizes are just about 50MB. But I don't think it's an issue since the file count won't affect the performance of heapq.merge() too much.

Interesting! Documentation for sys.getsizeof() clearly says that it should return the size of objects in bytes. And this patch is counting both each string stored in the buffer as well as the overhead size of the buffer list object itself.

Are you going to experiment with different memory limits? Would it make sense to make the limit configurable by command line option?

add max_memory command line option.

oldcai · 2024-04-22T05:11:18Z

I changed the benchmark file size to around 20G and it finished in 53m54.040s with the default setting.

When I 10 times the memory limit to 1G, it reached 26m53.996s, about halving the time consumption as with the 100M limit. Which is as expected since the number of files would affect the heapq.merge() in a log(n) manner.

Later, I found an issue that might make the sort result unstable and slightly affect the performance: the sorting in MergeSort would not use self._key while it fits the memory.

Since memory sorting is fast, it won't add up too much computing time, I assume. I fixed this issue and modified the unit test for this case and I'll run another benchmark later to see if there is a difference in performance.

Update: 52m16.218s for 100M memory limit, slightly faster. 29m5.635s for 1G memory limit. Setting the key slightly slows down the memory sorting. But it would ensure the sorting result to be the same in all cases.

I think that's all I got for now.

Thank you for your great project.

oldcai

I believe the memory inaccuracy is due to the sys.getsizeof API and won't affect the performance too much.

I made a patch for the unstable sorting issue, and it needs your review @akaihola

akaihola · 2024-04-22T07:23:06Z

Thanks @oldcai, looks good! I changed the command line to use an optional -m/--max-memory argument, which also supports memory units like k, mb, GiB etc.

I rebased the commits a bit and force pushed.

oldcai

Nice implementation! Testing successfully on small data.

Thank you for your great work!

oldcai · 2024-04-23T23:56:54Z

I tested it with a 444G .sql file and it finished within 523m22.189s.

The archive size of a full backup with compression is 64.585 GB, while an incremental backup costs only 7.948 GB, which is a significant reduction. @akaihola Thank you for your excellent work!

akaihola · 2024-04-24T19:47:45Z

I tested it with a 444G .sql file and it finished within 523m22.189s.

I wonder how much performance we could squeeze out of Python with some tricks. It would also be interesting to look at how much running with PyPy or compiling using mypyc, nuitka or Shed Skin, or even Cython could improve performance..

The archive size of a full backup with compression is 64.585 GB, while an incremental backup costs only 7.948 GB, which is a significant reduction.

Cool! Could I mention this as an example in the README?

oldcai · 2024-04-24T20:00:07Z

I wonder how much performance we could squeeze out of Python with some tricks. It would also be interesting to look at how much running with PyPy or compiling using mypyc, nuitka or Shed Skin, or even Cython could improve performance..

It's already acceptable to me since I only do a couple of backups a week.

I believe the 3.12 GIL per interpreter feature would help more, but there must be a long time before it comes to the third-world OS. 😆

Cool! Could I mention this as an example in the README?

Sure, I'd be glad if it could help convince more people to use it to save some carbon footprint.

akaihola mentioned this pull request Jul 4, 2020

Use Too Much Memory, Killed by System #1

Closed

akaihola force-pushed the large-tables branch from ce65045 to fd86852 Compare July 4, 2020 08:06

akaihola linked an issue Sep 12, 2021 that may be closed by this pull request

Use Too Much Memory, Killed by System #1

Closed

akaihola force-pushed the large-tables branch 2 times, most recently from fdc0909 to 4a13dea Compare September 12, 2021 08:49

akaihola force-pushed the large-tables branch 2 times, most recently from d9685d7 to 4fbe76f Compare April 19, 2024 20:32

akaihola added this to the 1.0.1 milestone Apr 20, 2024

akaihola self-assigned this Apr 20, 2024

akaihola added the enhancement label Apr 20, 2024

akaihola force-pushed the large-tables branch from de92c9a to 245adab Compare April 20, 2024 08:39

Antti Kaihola and others added 11 commits April 20, 2024 11:46

Add merge sort tool

34a292d

Allow custom key for merge sort

128ecec

Use merge sort for pg_dump_splitsort

c0bf8f5

Uses up to 100MB of memory by default before splitting table data into files and merge sorting them.

Avoid buffer flushes writing sorted COPY lines

85eebab

Write all COPY data in one go

127dbbf

Better varname for merge sorter.

459bb7e

Also sorted imports.

Fix Python 3.8 compatibility

952fa77

Fix tests compatibility with Windows

a19b2bd

Test merge sort partition position an end of test

cceee25

In case of failure, this provides us the actual output before checking the position.

Stop newline conversion on Win. Test all newlines.

d746c2d

Update the change log

8a599ba

akaihola force-pushed the large-tables branch from 245adab to 8a599ba Compare April 20, 2024 08:46

akaihola requested a review from oldcai April 21, 2024 13:55

use self._key to sort in MergeSort even if it fits the memory.

efd3980

add max_memory command line option.

oldcai approved these changes Apr 22, 2024

View reviewed changes

akaihola force-pushed the large-tables branch from d00175a to 1dee086 Compare April 22, 2024 07:21

oldcai approved these changes Apr 22, 2024

View reviewed changes

Xun Cai and others added 2 commits April 22, 2024 16:22

code style

f8aedf8

Support memory size units, use -m option

84d50b9

akaihola force-pushed the large-tables branch from 1dee086 to 84d50b9 Compare April 22, 2024 13:22

akaihola merged commit 230787a into master Apr 22, 2024
19 checks passed

akaihola deleted the large-tables branch April 22, 2024 13:35

akaihola modified the milestones: 1.0.1, 1.1.0 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support large tables which don't fit in RAM #13

Support large tables which don't fit in RAM #13

akaihola commented Jul 4, 2020 •

edited

Loading

akaihola commented Jul 4, 2020

akaihola commented Apr 20, 2024

oldcai commented Apr 21, 2024

oldcai commented Apr 21, 2024

akaihola commented Apr 21, 2024

oldcai commented Apr 22, 2024 •

edited

Loading

oldcai left a comment

akaihola commented Apr 22, 2024

oldcai left a comment

oldcai commented Apr 23, 2024

akaihola commented Apr 24, 2024

oldcai commented Apr 24, 2024 •

edited

Loading

Support large tables which don't fit in RAM #13

Support large tables which don't fit in RAM #13

Conversation

akaihola commented Jul 4, 2020 • edited Loading

akaihola commented Jul 4, 2020

akaihola commented Apr 20, 2024

oldcai commented Apr 21, 2024

oldcai commented Apr 21, 2024

akaihola commented Apr 21, 2024

oldcai commented Apr 22, 2024 • edited Loading

oldcai left a comment

Choose a reason for hiding this comment

akaihola commented Apr 22, 2024

oldcai left a comment

Choose a reason for hiding this comment

oldcai commented Apr 23, 2024

akaihola commented Apr 24, 2024

oldcai commented Apr 24, 2024 • edited Loading

akaihola commented Jul 4, 2020 •

edited

Loading

oldcai commented Apr 22, 2024 •

edited

Loading

oldcai commented Apr 24, 2024 •

edited

Loading