-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Git Compatible Mastodon Storage Format #21
Comments
Usage of ZARR file format has been checked by @maarzt. There is too little existing libraries in Java to use ZARR for the purpose of this issue. |
Simple counter seems to be the better solution for the implementation. Reasons are mentioned above. |
Compare File Reading Performance UUID vs int32 performance The results where very counter intuitive. Reading a 128-bit from a file plus lookup in a |
This is a part of: #12
Currently Mastodon provides two storage formats. As folder or as *.mastodon. Both file formats might have their problems when used inside a git repository:
Possible solutions:
Text basedBinary files divided into blocks
Introduce a key Mastodon rewrites the spot ids when saving a projects. Non constant spot ids are problematic. A small change in the ModelGraph can easily change a large number of spot ids. This is a problem for efficient storage (delta compression) of multiple versions with git. It is therefor necessary to have a key value that normally doesn't change.
The data in a mastodon project could be expressed in two tables
Additionally, tables for properties should be created. These cover tagsets and features
These tables could be stored in a simple chunked binary format (easiest solution using DataOutputStream, which would be Java-specific) or as ZARR tables (could potentially be read also using Python scripts). A table should be sorted by the key and be chunked. For example keys 0-1000 are written into the first file, keys 1000-2000 are written into the second file etc.
Using UUID as key value
unevenly distributedPossible alternative to UUID use a simple counter (integer or long). Each new spot receives an id from this counter. When spots are deleted this id will never be assigned again.
If spots are deleted, "wholes" are not filled. Spots need subsequently to be removed from link-tables, tag-tables and feature tables.
Text format using UUID (not-preferred)
Each spot and link gets a UUID. The spot table and link tables are stored in text files with rows:
The text format has the advantages:
and disadvantages:
It's probably still necessary to divide the text file into chunks in order to save memory when using git.
TODOs:
The text was updated successfully, but these errors were encountered: