Warning: There is not enough compatibility testing, so do not save important data using this package yet.
A consistent Julia interface for lossless encoding and decoding of bytes in memory.
This monorepo holds a number of different Julia packages:
ChunkCodecCore
: defines the interface.ChunkCodecTests
: defines tests for the interface.
There are also a number of packages with glue code to implement the interface for various C libraries.
LibBlosc
LibBzip2
LibLz4
LibSnappy
LibZlib
LibZstd
Each package contains basic tests.
There is also a test
directory for slower compatibility tests.
Let say you want to encode some data into the gzip (.gz) format.
julia> data = zeros(UInt8, 1000);
First load the package with the encoding options you want to use.
In this case ChunkCodecLibZlib
julia> using ChunkCodecLibZlib
Next create a GzipEncodeOptions
and select options to tune performance.
julia> e = GzipEncodeOptions(;level=4)
GzipEncodeOptions(4)
julia> e isa ChunkCodecCore.EncodeOptions
true
Most of the parameters in an EncodeOptions
are not needed to be able to
decode back to the original data.
The Codec
object in the codec
property has the meta data required for decoding.
julia> gz_codec = e.codec
GzipCodec()
julia> gz_codec isa ChunkCodecCore.Codec
true
GzipCodec
is empty because gzip decoding doesn't require additional meta data.
You can look at the docstring help?> GzipCodec
to learn more about the format.
Finally you can encode the data:
julia> gzipped_data = encode(e, data)
29-element Vector{UInt8}:
encode
will throw an error if the following conditions aren't met.
- The input data is stored in contiguous memory.
- The input length is in
ChunkCodecCore.decoded_size_range(e)
For example:
julia> encode(ChunkCodecLibBlosc.BloscEncodeOptions(), @view(zeros(UInt8, 8)[1:2:end]))
ERROR: ArgumentError: vector is not contiguous in memory
julia> encode(ChunkCodecLibBlosc.BloscEncodeOptions(), zeros(UInt8, Int64(2)^32))
ERROR: ArgumentError: src_size ∈ 0:1:1073741824 must hold. Got
src_size => 4294967296
julia> encode(ChunkCodecLibBlosc.BloscEncodeOptions(;typesize=3), zeros(UInt8,7))
ERROR: ArgumentError: src_size ∈ 0:3:1073741823 must hold. Got
src_size => 7
Lets say you have trusted gzipped data you want to decode.
julia> using ChunkCodecLibZlib
julia> data = zeros(UInt8, 1000);
julia> gzipped_data = encode(GzipEncodeOptions(;level=4), data);
julia> un_gzipped_data = decode(GzipCodec(), gzipped_data);
julia> un_gzipped_data == data
true
There are many potential issues when decoding untrusted data.
If decoding fails because the input data is not valid,
decode
throws a DecodingError
.
For example:
julia> bad_gzipped_data = copy(gzipped_data);
julia> bad_gzipped_data[end-4] ⊻= 0x01;
julia> decode(GzipCodec(), bad_gzipped_data);
ERROR: LibzDecodingError: incorrect data check
julia> LibzDecodingError <: ChunkCodecCore.DecodingError
true
The encoded input may also contain a zip bomb.
The max_size
keyword argument causes decode
to throw a ChunkCodecCore.DecodedSizeError
if decoding fails because the output size would be greater than max_size
. By default max_size
is typemax(Int64)
.
For example:
julia> decode(GzipCodec(), gzipped_data; max_size=100);
ERROR: DecodedSizeError: decoded size is greater than max size: 100
If you have a good idea of what the decoded size is, using the size_hint
keyword argument
can greatly improve performance.
For example:
julia> @time decode(GzipCodec(), gzipped_data; size_hint=1000, max_size=1000);
0.000016 seconds (4 allocations: 8.227 KiB)
julia> @time decode(GzipCodec(), gzipped_data; max_size=1000);
0.000018 seconds (9 allocations: 42.195 KiB)
If ChunkCodecCore.is_thread_safe(::Union{Codec, DecodeOptions, EncodeOptions})
returns true
it is safe to use the options to encode or decode concurrently in multiple threads.
https://github.com/JuliaIO/TranscodingStreams.jl
Filters in https://github.com/JuliaIO/HDF5.jl