Compression algorithms

Implemented most popular compression algorithms and presented results
Each algorithm is implemented in eponymous files

LZ77

The main idea is that you look backward in data to find repeating of the following symbols in the buffer. Save compressed data in a sequence of tuples (offset, length, next symbol).

offset - shows how many symbols back you should go back
length - shows how many symbols you must copy from the offset marker. If it is more than offset, then it goes into a cycle
next symbol - a letter that you should add after copping

Example:

Initial: 'a a b b b c a b a b c b c'

Encoded: (0, 0, 'a'), (1, 1, 'b'), (1, 2, 'c'), (5, 2, 'a'), (5, 2, 'b'), (0, 0, 'c')

LZW

The feature of this algorithm is that we create a dictionary to encode/decode text.
We are looking from the current symbol to the following and find if the string that they form is in the dictionary. If yes, we add its key to the result. If not, we add the substring to the dictionary.

Example:

Initial: 'a b c a b a c b c b a b c'

Encoded: '0 1 2 3 0 2 4 1 3 2'

{0: 'a', 1: 'b', 2: 'c', 3: 'ab', 4: 'bc', 5: 'ca', 6: 'aba', 7: 'ac', 8: 'cb', 9: 'bcb', 10: 'ba', 11: 'abc'})

Huffman

It is a simple but effective algorithm.

Find how many times each symbol is repeated in the text
Create Huffman binary tree
Match letters with their codes

So that symbols with a higher appearance rate will be encoded with shorter code in the prefix code set, and the average length will be the least.

Example:

Initial: 'a b c d d d b c b b b b a'

Encoded: '000100101010110011111000'

{'a': '000', 'c': '001', 'd': '01', 'b': '1'})

Deflate

It is a widely distributed, modified algorithm that combines LZ77 and Huffman coding. That is why deflate encoding provides a higher ratio for compression.

Encode data with the Huffman algorithm
Use LZ77 to encode the sequence of { 0|1 }

Initial: 'd b c b a b c b b c a b a a a'

Huffman: '0111010100101011010001000000'

{'a': '00', 'c': '010', 'd': '011', 'b': '1'}

Deflate: (0, 0, '0'), (0, 0, '1'), (1, 2, '0'), (2, 4, '0'), (7, 5, '1'), (10, 4, '0'), (4, 4, '0'), (1, 1, '0')

Tests

To test scripts were created, more than 1000 tests were in text format with strings of different lengths and formats. Related functions to create test strings are implemented in generate_texts.py and saved in directory Test_cases.

There are cases of 5 types:

Full random - letters chouses absolutely randomly
Typical - with distribution for letter occurrence possibility
Real - text from real documents, books
Cyclic - all text consists of one string repeated many times
Repetitive - there are words in a string that repeats many times among randomly chosen letters.

Results

Algorithms were tested on different strings and its results were saved in directory: Tests_results in .csv files

Functions to run tests and write results, are implemented in run_tests.py.
Visualised results were created with module visualization.py

Compress

As we can see, all algorithms have linear time complexity to compress and decompress text string O(n), n=length. But with different constants.
LZ77 and, consequently, Deflate algorithms have the bigger constant and take more time to compress data.
The fastest compress algorithm is Huffman. It works ~10 times faster than others.

Decompress

In contrast, Huffman and, consequently, Deflate algorithms have the bigger constant to decompress data.
LZ77 and LZW work 3-5 times faster than Huffman and Deflate
LZW is the fastest algorithm to compress and decompress data.

Compressing rate

All algorithms have a higher compression rate for cyclic text and a few higher for repetitive, with a slight difference between real and random tests.
Huffman algorithm has the less difference beetwen cyclic test cases and other

On the contrary, LZW is extremely effective for cyclic tests

By types

Full random texts take the most time to encode, while cyclic are
Repetitive tests are easier to compress but not so much as real and typical tests

- In decompression the difference is even less, but cyclic tests still are easier to decompress in ~1.5 times

More plots are saved in Plots repositoty.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
Plots		Plots
Test_cases		Test_cases
Tests_results		Tests_results
README.md		README.md
deflate.py		deflate.py
generate_texts.py		generate_texts.py
huffman.py		huffman.py
lz77.py		lz77.py
lzw.py		lzw.py
run_tests.py		run_tests.py
visualization.py		visualization.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Compression algorithms

LZ77

LZW

Huffman

Deflate

Tests

Results

Compress

Decompress

Compressing rate

By types

About

Uh oh!

Releases

Packages

Languages

sviat-l/Compression-algorithms

Folders and files

Latest commit

History

Repository files navigation

Compression algorithms

LZ77

LZW

Huffman

Deflate

Tests

Results

Compress

Decompress

Compressing rate

By types

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages