Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support large tables which don't fit in RAM #13

Merged
merged 14 commits into from
Apr 22, 2024
Merged

Support large tables which don't fit in RAM #13

merged 14 commits into from
Apr 22, 2024

Conversation

akaihola
Copy link
Owner

@akaihola akaihola commented Jul 4, 2020

Fixes #1

Use merge sort with temporary files to put a cap to memory usage.

@akaihola
Copy link
Owner Author

akaihola commented Jul 4, 2020

@oldcai could you try whether this branch

  • sorts your large database without getting killed for using too much memory
  • has acceptable performance

@akaihola akaihola linked an issue Sep 12, 2021 that may be closed by this pull request
@akaihola akaihola force-pushed the large-tables branch 2 times, most recently from fdc0909 to 4a13dea Compare September 12, 2021 08:49
@akaihola akaihola force-pushed the large-tables branch 2 times, most recently from d9685d7 to 4fbe76f Compare April 19, 2024 20:32
@akaihola akaihola added this to the 1.0.1 milestone Apr 20, 2024
@akaihola
Copy link
Owner Author

@oldcai I've now finalized this fix. Would you accept my invitation as collaborator for this repository and review the pull request?

@oldcai
Copy link
Collaborator

oldcai commented Apr 21, 2024

@oldcai I've now finalized this fix. Would you accept my invitation as collaborator for this repository and review the pull request?

Thank you for your efforts. I'm honoured to do that.
I'll review the code and do some manual tests.

@akaihola akaihola requested a review from oldcai April 21, 2024 13:55
@oldcai
Copy link
Collaborator

oldcai commented Apr 21, 2024

✅ I downloaded the code and ran it on my dev environment with some test data successfully.

⚠️ Python version requirement is too high. When I was trying to test it in the product environment, the minimal Python version requirement is 3.7 or even 3.8 when the other MR is merged, which is much higher than the CentOS stable version 3.6.8.

Then I removed the annotation code and successfully ran it.

I'm currently doing a benchmark on a 167G file and will continue reviewing the code after it finishes.

❓ Additionally, I found an interesting fact that although the memory limit is set to 10**8, which should be 100M, the actual memory use and the temporary file sizes are just about 50MB. But I don't think it's an issue since the file count won't affect the performance of heapq.merge too much.

@akaihola
Copy link
Owner Author

✅ I downloaded the code and ran it on my dev environment with some test data successfully.

Good to hear that it works at least in a simple case!

⚠️ Python version requirement is too high. When I was trying to test it in the product environment, the minimal Python version requirement is 3.7 or even 3.8 when the other MR is merged, which is much higher than the CentOS stable version 3.6.8.

Ah, sorry about that. I'm trying to be a good Python community citizen and encourage upgrading from unsupported Python versions, but I do recognize that in RedHat/CentOS land the versions drag a bit behind... Would a Docker image help you?

❓ Additionally, I found an interesting fact that although the memory limit is set to 10**8, which should be 100M, the actual memory use and the temporary file sizes are just about 50MB. But I don't think it's an issue since the file count won't affect the performance of heapq.merge() too much.

Interesting! Documentation for sys.getsizeof() clearly says that it should return the size of objects in bytes. And this patch is counting both each string stored in the buffer as well as the overhead size of the buffer list object itself.

Are you going to experiment with different memory limits? Would it make sense to make the limit configurable by command line option?

@oldcai
Copy link
Collaborator

oldcai commented Apr 22, 2024

I changed the benchmark file size to around 20G and it finished in 53m54.040s with the default setting.

When I 10 times the memory limit to 1G, it reached 26m53.996s, about halving the time consumption as with the 100M limit. Which is as expected since the number of files would affect the heapq.merge() in a log(n) manner.

Later, I found an issue that might make the sort result unstable and slightly affect the performance: the sorting in MergeSort would not use self._key while it fits the memory.

Since memory sorting is fast, it won't add up too much computing time, I assume. I fixed this issue and modified the unit test for this case and I'll run another benchmark later to see if there is a difference in performance.

Update: 52m16.218s for 100M memory limit, slightly faster. 29m5.635s for 1G memory limit. Setting the key slightly slows down the memory sorting. But it would ensure the sorting result to be the same in all cases.

I think that's all I got for now.

Thank you for your great project.

Copy link
Collaborator

@oldcai oldcai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the memory inaccuracy is due to the sys.getsizeof API and won't affect the performance too much.

I made a patch for the unstable sorting issue, and it needs your review @akaihola

@akaihola
Copy link
Owner Author

Thanks @oldcai, looks good! I changed the command line to use an optional -m/--max-memory argument, which also supports memory units like k, mb, GiB etc.

I rebased the commits a bit and force pushed.

Copy link
Collaborator

@oldcai oldcai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice implementation! Testing successfully on small data.

Thank you for your great work!

@akaihola akaihola merged commit 230787a into master Apr 22, 2024
19 checks passed
@akaihola akaihola deleted the large-tables branch April 22, 2024 13:35
@akaihola akaihola modified the milestones: 1.0.1, 1.1.0 Apr 22, 2024
@oldcai
Copy link
Collaborator

oldcai commented Apr 23, 2024

I tested it with a 444G .sql file and it finished within 523m22.189s.

The archive size of a full backup with compression is 64.585 GB, while an incremental backup costs only 7.948 GB, which is a significant reduction. @akaihola Thank you for your excellent work!

@akaihola
Copy link
Owner Author

I tested it with a 444G .sql file and it finished within 523m22.189s.

I wonder how much performance we could squeeze out of Python with some tricks. It would also be interesting to look at how much running with PyPy or compiling using mypyc, nuitka or Shed Skin, or even Cython could improve performance..

The archive size of a full backup with compression is 64.585 GB, while an incremental backup costs only 7.948 GB, which is a significant reduction.

Cool! Could I mention this as an example in the README?

@oldcai
Copy link
Collaborator

oldcai commented Apr 24, 2024

I wonder how much performance we could squeeze out of Python with some tricks. It would also be interesting to look at how much running with PyPy or compiling using mypyc, nuitka or Shed Skin, or even Cython could improve performance..

It's already acceptable to me since I only do a couple of backups a week.

I believe the 3.12 GIL per interpreter feature would help more, but there must be a long time before it comes to the third-world OS. 😆

Cool! Could I mention this as an example in the README?

Sure, I'd be glad if it could help convince more people to use it to save some carbon footprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use Too Much Memory, Killed by System
2 participants