validate dir_archive #56

koreiter · 2018-01-05T14:53:32Z

Great job and thanks for this lib!
Short description:
I need some tool that will validate dir_archive and remove invalid key-value pairs and make whole archive in valid state (may loose some data).

Long descrption:
I use dir_archive to store my app cache.
This cache is about 4k elements and whole takes around 400MB so dumping takes several minutes.
When something happen with application during dumping (ie. computer shutdown) archive may be left in invalid state.
This may be some additional parameter in constructor that is False by default (like remove_invalid_records_silently=False)
Of course changing lib so that archive is always in valid state would be appreciated but I guess that it is not so easy.

mmckerns · 2018-03-17T15:42:22Z

@koreiter: can you give some more details of what you are looking for by providing some examples or pseudocode or something like that?

What would an "invalid" key-value pair look like? Would it always be that the file is somehow invalid? You'd like to somehow detect the file is in an invalid state (e.g. the file is corrupt or missing) -- maybe by using metadata about the file, or by reading it, or something... and then do what?

koreiter · 2018-03-20T09:26:27Z

My code can be simplified to:

from klepto.archives import dir_archive

class LargeLists:
    def __init__(self, number):
        self.load = [number] * 10000

lists = [LargeLists(i/13) for i in range(1000)]
archive = dir_archive('path')
for key, obj in enumerate(lists):
    archive[key] = obj
archive.dump()

I stopped executing this after a while (simulating unexpected app crash/stop etc.)
I've attached zip of last succesfull and the one crashed object.
path.zip

Here we can use folder name to spot an invalid object (have to ensure that no one uses "I_*" pattern for his keys). Of course we have to make sure that we don't try to delete currently creating object.
But this is not the only way that object can fail. Right now i can't remember how it looked like but it always raises KeyErrors. I've written code that protect me from such errors:

def remove_key_from_archive(key, archive_path):
    dir_name = "K_{}".format(key)
    full_path = os.path.join(archive_path, dir_name)
    try:
        shutil.rmtree(full_path)
    except FileNotFoundError:
        pass

def find_invalid_folders_from_archive(archive_path):
    pattern = "K_I_.*"
    prog = re.compile(pattern, flags=re.IGNORECASE)
    try:
        for elem in os.listdir(archive_path):
            if prog.fullmatch(elem) is not None:
                yield elem
    except FileNotFoundError:
        raise StopIteration

def remove_old_invalid_folders(archive_path, seconds_old=10):
    cnt = 0
    invalid_folders = find_invalid_folders_from_archive(archive_path)
    for folder in invalid_folders:
        full_path = os.path.join(archive_path, folder)
        try:
            time_diff = time() - os.path.getmtime(full_path)
            if time_diff > seconds_old:
                shutil.rmtree(full_path)
                cnt += 1
        except FileNotFoundError:
            continue
    return cnt

def get_loaded_and_validated_archive(path):
    for i in range(10000):
        remove_old_invalid_folders(path)
        archive = dir_archive(path)
        try:
            archive.load()
            return archive
        except KeyError as e:
            remove_key_from_archive(e.args[0], path)
    raise IOError

In my situation (as I'm using klepto for caching purpose) removing invalid objects is great solution because I can restore everythink that was deleted.
Unfortunately this solution is very slow - as I have to try to load large archive multiple times (in my case one load can take up to 10-15 seconds)

mmckerns · 2018-05-31T12:57:15Z

So, let me distill this -- and you tell me if I'm correct. The idea is if there's a file that is corrupt in any way... basically... if any of the files result in a KeyError (or other error) when loading, they can be ignored/removed.

So, something like a validate_archive method, to check the integrity of the archive -- returning any bad keys. Once you know which keys are bad, they can be removed. There's also the potential to provide a list of keys to ignore on read.

This makes sense, especially if there's some way to do it faster than trying to read all the keys, then failing... seeing which key is bad... removing it, and then repeat this until all keys read.

Is that your request, boiled down to the essentials? Any other thoughts?

koreiter · 2018-06-08T08:14:03Z

That's exactly right.

mmckerns added the enhancement label Mar 17, 2018

mmckerns mentioned this issue Jun 23, 2018

make easier to detect and remove bad keys from archive #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate dir_archive #56

validate dir_archive #56

koreiter commented Jan 5, 2018

mmckerns commented Mar 17, 2018

koreiter commented Mar 20, 2018

mmckerns commented May 31, 2018

koreiter commented Jun 8, 2018

validate dir_archive #56

validate dir_archive #56

Comments

koreiter commented Jan 5, 2018

mmckerns commented Mar 17, 2018

koreiter commented Mar 20, 2018

mmckerns commented May 31, 2018

koreiter commented Jun 8, 2018