Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate dir_archive #56

Open
koreiter opened this issue Jan 5, 2018 · 4 comments
Open

validate dir_archive #56

koreiter opened this issue Jan 5, 2018 · 4 comments

Comments

@koreiter
Copy link

koreiter commented Jan 5, 2018

Great job and thanks for this lib!
Short description:
I need some tool that will validate dir_archive and remove invalid key-value pairs and make whole archive in valid state (may loose some data).

Long descrption:
I use dir_archive to store my app cache.
This cache is about 4k elements and whole takes around 400MB so dumping takes several minutes.
When something happen with application during dumping (ie. computer shutdown) archive may be left in invalid state.
This may be some additional parameter in constructor that is False by default (like remove_invalid_records_silently=False)
Of course changing lib so that archive is always in valid state would be appreciated but I guess that it is not so easy.

@mmckerns
Copy link
Member

@koreiter: can you give some more details of what you are looking for by providing some examples or pseudocode or something like that?

What would an "invalid" key-value pair look like? Would it always be that the file is somehow invalid? You'd like to somehow detect the file is in an invalid state (e.g. the file is corrupt or missing) -- maybe by using metadata about the file, or by reading it, or something... and then do what?

@koreiter
Copy link
Author

My code can be simplified to:

from klepto.archives import dir_archive

class LargeLists:
    def __init__(self, number):
        self.load = [number] * 10000

lists = [LargeLists(i/13) for i in range(1000)]
archive = dir_archive('path')
for key, obj in enumerate(lists):
    archive[key] = obj
archive.dump()

I stopped executing this after a while (simulating unexpected app crash/stop etc.)
I've attached zip of last succesfull and the one crashed object.
path.zip

Here we can use folder name to spot an invalid object (have to ensure that no one uses "I_*" pattern for his keys). Of course we have to make sure that we don't try to delete currently creating object.
But this is not the only way that object can fail. Right now i can't remember how it looked like but it always raises KeyErrors. I've written code that protect me from such errors:

def remove_key_from_archive(key, archive_path):
    dir_name = "K_{}".format(key)
    full_path = os.path.join(archive_path, dir_name)
    try:
        shutil.rmtree(full_path)
    except FileNotFoundError:
        pass

def find_invalid_folders_from_archive(archive_path):
    pattern = "K_I_.*"
    prog = re.compile(pattern, flags=re.IGNORECASE)
    try:
        for elem in os.listdir(archive_path):
            if prog.fullmatch(elem) is not None:
                yield elem
    except FileNotFoundError:
        raise StopIteration

def remove_old_invalid_folders(archive_path, seconds_old=10):
    cnt = 0
    invalid_folders = find_invalid_folders_from_archive(archive_path)
    for folder in invalid_folders:
        full_path = os.path.join(archive_path, folder)
        try:
            time_diff = time() - os.path.getmtime(full_path)
            if time_diff > seconds_old:
                shutil.rmtree(full_path)
                cnt += 1
        except FileNotFoundError:
            continue
    return cnt

def get_loaded_and_validated_archive(path):
    for i in range(10000):
        remove_old_invalid_folders(path)
        archive = dir_archive(path)
        try:
            archive.load()
            return archive
        except KeyError as e:
            remove_key_from_archive(e.args[0], path)
    raise IOError

In my situation (as I'm using klepto for caching purpose) removing invalid objects is great solution because I can restore everythink that was deleted.
Unfortunately this solution is very slow - as I have to try to load large archive multiple times (in my case one load can take up to 10-15 seconds)

@mmckerns
Copy link
Member

So, let me distill this -- and you tell me if I'm correct. The idea is if there's a file that is corrupt in any way... basically... if any of the files result in a KeyError (or other error) when loading, they can be ignored/removed.

So, something like a validate_archive method, to check the integrity of the archive -- returning any bad keys. Once you know which keys are bad, they can be removed. There's also the potential to provide a list of keys to ignore on read.

This makes sense, especially if there's some way to do it faster than trying to read all the keys, then failing... seeing which key is bad... removing it, and then repeat this until all keys read.

Is that your request, boiled down to the essentials? Any other thoughts?

@koreiter
Copy link
Author

koreiter commented Jun 8, 2018

That's exactly right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants