-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support large tables which don't fit in RAM #13
Conversation
@oldcai could you try whether this branch
|
fdc0909
to
4a13dea
Compare
d9685d7
to
4fbe76f
Compare
@oldcai I've now finalized this fix. Would you accept my invitation as collaborator for this repository and review the pull request? |
Uses up to 100MB of memory by default before splitting table data into files and merge sorting them.
Also sorted imports.
In case of failure, this provides us the actual output before checking the position.
Thank you for your efforts. I'm honoured to do that. |
✅ I downloaded the code and ran it on my dev environment with some test data successfully. Then I removed the annotation code and successfully ran it. I'm currently doing a benchmark on a 167G file and will continue reviewing the code after it finishes. ❓ Additionally, I found an interesting fact that although the memory limit is set to |
Good to hear that it works at least in a simple case!
Ah, sorry about that. I'm trying to be a good Python community citizen and encourage upgrading from unsupported Python versions, but I do recognize that in RedHat/CentOS land the versions drag a bit behind... Would a Docker image help you?
Interesting! Documentation for sys.getsizeof() clearly says that it should return the size of objects in bytes. And this patch is counting both each string stored in the buffer as well as the overhead size of the buffer list object itself. Are you going to experiment with different memory limits? Would it make sense to make the limit configurable by command line option? |
add max_memory command line option.
Later, I found an issue that might make the sort result unstable and slightly affect the performance: the sorting in MergeSort would not use Since memory sorting is fast, it won't add up too much computing time, I assume. I fixed this issue and modified the unit test for this case and I'll run another benchmark later to see if there is a difference in performance.
I think that's all I got for now. Thank you for your great project. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the memory inaccuracy is due to the sys.getsizeof
API and won't affect the performance too much.
I made a patch for the unstable sorting issue, and it needs your review @akaihola
Thanks @oldcai, looks good! I changed the command line to use an optional I rebased the commits a bit and force pushed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice implementation! Testing successfully on small data.
Thank you for your great work!
I tested it with a 444G The archive size of a full backup with compression is |
I wonder how much performance we could squeeze out of Python with some tricks. It would also be interesting to look at how much running with PyPy or compiling using mypyc, nuitka or Shed Skin, or even Cython could improve performance..
Cool! Could I mention this as an example in the README? |
It's already acceptable to me since I only do a couple of backups a week. I believe the 3.12 GIL per interpreter feature would help more, but there must be a long time before it comes to the third-world OS. 😆
Sure, I'd be glad if it could help convince more people to use it to save some carbon footprint. |
Fixes #1
Use merge sort with temporary files to put a cap to memory usage.