Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added CRAM_codecs document to describe the proposed CRAM3.1 codecs. #433

Merged
merged 1 commit into from
Mar 15, 2023

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Jul 19, 2019

This also has the CRAM3.0 rANS codec in it. ̷I̷ ̷h̷a̷v̷e̷n̷'̷t̷ ̷y̷e̷t̷ ̷r̷e̷m̷o̷v̷e̷d̷ ̷t̷h̷i̷s̷ ̷f̷r̷o̷m̷ ̷C̷R̷A̷M̷v̷3̷.̷t̷e̷x̷ Edit: CRAMv3.tex also now updated.

A prebuilt copy of the documents is attached.

Code implementing these codecs can be seen in https://github.com/samtools/htscodecs

The C code is a bit gnarly - it's suffering from many iterations of changes and it's also heavily optimised making it hard to read. The JavaScript code however is reasonably straight forward (if not always optimised for speed) and tries to be a vanilla implementation of the spec pseudocode, for purposes of validating it and acting as additional spec clarification. (That means it's not production quality and deliberately lacks a lot of niceties like testing return codes.)

CRAMcodecs.pdf
CRAMv3.pdf


Edit 13th Feb 2022. To test this version, feel free to use the current release of samtools/htslib (1.16 at time of writing).
You can evaluate CRAM v3.0 vs v3.1 using:

samtools view -T ref.fa -o out.30.cram -O cram,version=3.0 in.bam
samtools view -T ref.fa -o out.31.cram -O cram,version=3.1 in.bam

That's the default "normal" mode. You may also wish to try the fast, small and archive profiles for each, plus compression level can be edited. For example:

samtools view -T ref.fa -o out.30s7.cram -O cram,version=3.0,small,level=7 in.bam
samtools view -T ref.fa -o out.31s7.cram -O cram,version=3.1,small,level=7 in.bam

Demonstration files are available at ftp://ftp.sanger.ac.uk/pub/users/jkb/CRAM3.1/

@jkbonfield
Copy link
Contributor Author

Things up for discussion.

  • Is the adaptive range coder (sections 4.2 to 4.4) actually worth it? The impact is small and it's considerably slower than static rANS4x16. However note the range coder / modelling itself is needed (4.0 to 4.1) as they are used within the fqzcomp codec. I'll update this comment with hard numbers at some point.

  • Should DecodeX4 be changes to DecodeInterleaved with the "4" being a parameter? 4 was chosen as it's a common unit stride for 32-bit integer encoding, which occurs in a number of places, and it also works OK albeit suboptimally for encoding 16-bit integers. However I could perhaps see it may be useful for arbitrary structs or fixed-length strings.

@jkbonfield
Copy link
Contributor Author

If you want to actually try it in action, it's currently in the Scramble program, obtained by cloning io_lib:

git clone --recursive https://github.com/jkbonfield/io_lib
cd io_lib
./bootstrap
mkdir build
cd build
../configure && make

You'll need to specify "-V3.1" when running Scramble, and to get some of the newer codecs you can specify -X profile where profile is fast, normal, small or archive. For example:

./progs/scramble -t4 ~/scratch/data/novaseq.10m.cram /tmp/a.cram;ls -l /tmp/a.cram
# 208404582 bytes

./progs/scramble -t4 -V3.1 ~/scratch/data/novaseq.10m.cram /tmp/a.cram;ls -l /tmp/a.cram
# 175721206 bytes, adds use of rANS4x16pr and name tokeniser

./progs/scramble -t4 -V3.1 -X small ~/scratch/data/novaseq.10m.cram /tmp/a.cram;ls -l /tmp/a.cram
# 166815221 bytes, also enables fqzcomp and increases slice size to 25k seq.

./progs/scramble -t4 -V3.1 -X small -s 100000 ~/scratch/data/novaseq.10m.cram /tmp/a.cram;ls -l /tmp/a.cram
# 164338266 bytes.  As above, but with 100k seq/slice (to allow comparison vs archive)

./progs/scramble -t4 -V3.1 -X archive ~/scratch/data/novaseq.10m.cram /tmp/a.cram;ls -l /tmp/a.cram
# 162891038 bytes, also enables adaptive arithmetic coder for entropy encoding and within name tokeniser. Also bumps 100k seq per slice.

Figures for 10 million records from an old Illumina HiSeq run (40 quals):

./progs/scramble -N 10000000 -t4 ~/scratch/data/9827_2#49.cram /tmp/a.cram;ls -l /tmp/a.cram
# 573183206 bytes

./progs/scramble -N 10000000 -t4 -V3.1 ~/scratch/data/9827_2#49.cram /tmp/a.cram;ls -l /tmp/a.cram
# 551499530 bytes

./progs/scramble -N 10000000 -t4 -V3.1 -X small ~/scratch/data/9827_2#49.cram /tmp/a.cram;ls -l /tmp/a.cram
# 390685723 bytes

./progs/scramble -N 10000000 -t4 -V3.1 -X small -s 100000 ~/scratch/data/9827_2#49.cram /tmp/a.cram;ls -l /tmp/a.cram
# 377206097 bytes

./progs/scramble -N 10000000 -t4 -V3.1 -X archive ~/scratch/data/9827_2#49.cram /tmp/a.cram;ls -l /tmp/a.cram
# 374221295 bytes

Some codecs are clearly very justified, but the adaptive arithmetic coder (-X archive vs -X small -s 100000) may be an unnecessary level of complexity for the small gains.

@jmarshall
Copy link
Member

Mostly TeX-related trivia: IMHO this would be better as CRAMcodecs.pdf (without the underscore) to parallel SAMtags.pdf. Typesetting it fails because (1) the line that should be \input{CRAM_codecs.ver} still says CRAMv3.ver and (2) the image file wasn't committed.

@jkbonfield
Copy link
Contributor Author

Applied the other fixes too. Thanks.

@jkbonfield
Copy link
Contributor Author

A very minor query for the other maintainers.

The CRAM spec assigns a numeric ID to codecs (see Chapter 13 of the CRAMv3 spec). Should we keep those numeric IDs there, or should they be replicated in this document too? This document doesn't describe the encapsulation, which is where the ID comes from.

@jkbonfield
Copy link
Contributor Author

One other thought here is maybe we should add Zstd (and recommend the usage over Zlib which is old hat and lzma unless we're really pushing the boat out or keeping to lzma -1 to -3 which can be surprisingly effective).

Mid level zstd can be a good speed vs size tradeoff, almost always better than gzip or libdeflate.

@jmarshall
Copy link
Member

Sounds like the numeric ID is specific to the CRAM encapsulation so it'd be appropriate to just have it in the CRAM document — as long as the names of distinct codecs are obviously matched up between the two documents.

No firm thoughts either way: only in the CRAM document reduces duplication, but in both makes the correspondence crystal clear.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Sep 20, 2019

That was my inclination too.

I updated it in the CRAM3 document and also removed the now duplicated rANS information. I'm keeping both changes in this PR for now as we won't want to cull this from the existing CRAM spec until the codec doc is signed off too.

@jkbonfield
Copy link
Contributor Author

Given this has been hanging around for a while and I'm thinking on format-breaking things that I want to fix for CRAM 4 (starting with 64-bit references), I started looking at the ideas in #144 (comment)

That mentions splitting up the compound rans4x16pr codec which has RLE + bit-packing + entropy encoding into separate data transform encodings and data compression codecs. It dawns on me that perhaps this is also something that fits in 3.1.

Initially, I wanted to try hacking the BETA, GAMMA, SUBEXP, etc encodings to not output to the CORE block but to be able to specify which external block ID they use. That's trivial, but then it dawned on me BETA is basically identical to my PACK method. Given values 0,1,2,3 we encode using BETA(offset=0,nbits=2) and pack 4 to a byte. That's already half of the new rans done. So my proposal is something like this:

Outputting QS data series using EXTERNAL(11), codec RANS4x16 order 193 (PACK+RLE+O1) is equiv to:

MAP("!-3E"->0123, XBETA(offset=2, nbits=2, RLE(EXTERNAL(50), EXTERNAL(51)))
with RANS0 on external block 50 (run lengths) and RANS1 on external block 51.

For this to be efficient though, it cannot be done a few bits at a time daisy chaining it through a series of function pointers. Instead we need to gather up the whole set of qualities, apply the map transform, apply the beta transform, apply the RLE transform, and then entropy encode the two resultant RLE buffers. That's easier to code up. However it also adds a difference to CRAM 3. If these new MAP, XBETA and RLE codecs are to be used then the data series involved must be "pure" and never interleaved with another data series. (In practice we do this already.)

I'm coding it up to give it a go, but please bear this in mind when reviewing this PR. if it works, we'll probably want to go with the more flexible algorithm. I'll also add DELTA in there too as this is needed for some machine specific data series (signals, traces, etc).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 12, 2019

Outputting QS data series using EXTERNAL(11), codec RANS4x16 order 193 (PACK+RLE+O1) is equiv to:

MAP("!-3E"->0123, XBETA(offset=2, nbits=2, RLE(EXTERNAL(50), EXTERNAL(51)))
with RANS0 on external block 50 (run lengths) and RANS1 on external block 51.

I coded this up, albeit rolling MAP and XBETA into a single "PACK" function, and it works as expected. A marginal slow down (about 7% hit encoding, 3% hit decoding), but gives identical sizes.

eg old (cram_dump snippets):

         QS =>         EXTERNAL {12}
...
        Block 3/30
            Size:         90532 comp / 1500000 uncomp
            Method:       RANS_PR193
            Content type: EXTERNAL
            Content id:   12
            Keys:         QS 
E-----3E3-3333E3-E3-33-33E33-E3EE3EEE33333E333E3EE3333E3EEEE33EEE3E3E3E333333333333E3E33EE3E3EEEEE33

New (cram_dump, manually edited for clarity):

         QS =>            PACK {2, 4[0, 12, 18, 36], RLE(1[255], EXTERNAL(44), EXTERNAL(12)))
...
        Block 3/31
            Size:         70165 comp / 157958 uncomp   # RLE literals
            Method:       RANS_PR1
            Content type: EXTERNAL
            Content id:   12
            Keys:         
...
        Block 23/31
            Size:         20330 comp / 44826 uncomp # RLE run-lengths
            Method:       RANS_PR0
            Content type: EXTERNAL
            Content id:   44
            Keys:         

Whether or not this is a better solution than putting all the logic into the compression codec is questionable. It feels cleaner though to me, although it opens up more work on how to optimise things.

It would also mean changing the definition of the name tokeniser and moving some of the pack and rle functions from rans into calls within the name tokeniser itself as for space-efficiency it serialises its own format rather than via a many-way CRAM external block structure.

Copy link
Contributor

@cmnbroad cmnbroad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to comment in tranches - a few inline to start, with more to come.

Also, we should think a bit about how we want to partition the two docs. I'd propose that it would make sense to keep all of the actual file format specification, including the serialized format of the codecs, in the main doc, so that the entirety of the file format is specified in one place. Then the supplementary can be used doc used for the background, pseudo-code, and additional detail.

CRAMcodecs.tex Outdated
some (potentially many) symbols will have zero frequency. To reduce
the Order-1 table size an additional zero run-length encoding step is
used. Finally the Order-1 frequencies may optionally be compressed
using the Order-0 rANS 4x16 codec.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason these encodings should be fixed and prescribed by the specification, rather than just defined to be nested sub-encodings (BYTE_ARRAY_LEN style) ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean is there a reason the rANS 4x16 order-1 frequency table is explicitly encoded using order-0 rANS?

It's done to make the codec a single piece of code that may be used elsewhere and frankly it's as good an encoding as any I could find. It's possible we could split it up and have frequencies in one block, encoded with whatever method we wish, and data in a second. I don't really know if it buys us anything meaningful though.

There is a further question, originally raised by Vadim (and I wished I'd listened more at the time tbh too) of whether the rANS entropy encoder should have the frequencies stored in the preservation map and the "external" block part is purely the encoded data itself. Doing so would permit one frequency table and many blocks of data compressed using the same table, which is appropriate if we wish to use multi-slice containers. Potentially there is room here for better random access via such mechanisms. I forgot about that. If I was to change anything it'd be via that style, but it's still a bit complex and messy.

@jkbonfield
Copy link
Contributor Author

Fixed barring the discussion on 4x16 O1 table using 4x16 O0 encoder.

I've been experimenting with externalising most of the rANS4x16 data transforms to be separate CRAM encodings which can be nested. This works well, but we'll need to rewrite the name tokeniser in a way that can use these transforms too before running the rANS part.

The newer rANS is still worth it over 4x16, due to better encoding of frequency tables (especially order-1), but I think we can also improve on it by making it an Nx16 where N is 1, 4 or 32. This allows it to be efficient in size on small blocks and efficient in speed of very large blocks (where ~128 bytes worth of state isn't a problem). I already have SIMD vectorised versions at 32-way interleaving that run 2-3x faster than the original 4-way code so it may be good to permit exploitation of this. Writing a generic non-SIMD version is no harder for 4-way unrolling as any N-way unrolling so it's not needless overhead and the spec as it's written just has a lot of for j = 0 to 3 style loops which become for j = 0 to N-1 instead.

Similarly the X4 4-way interleaving should probably be N-way interleaving, so we can try encoding any fixed structure size (eg 16-bit ints, longs, 12 char barcodes, etc).

@jkbonfield
Copy link
Contributor Author

Rebased, and also updated with newer ideas.

The rANS4x16 codec now has a choice of 10-bit or 12-bit frequencies for the order-1 frequency table. The 4x8 rans codec has a fixed frequency total of 12-bit (4096). For speed, via reduced cache miss pressure, this was changed to 10-bit for the 4x16 variant. However it was found to be poor performing on some extreme frequency distributions. Hence it's now configurable so we get speed when it doesn't impact compression ratio.

The code (jkbonfield/htscodecs@40b6eeb) and spec (4f39e32) changes are trivial.

The 4-way interleaving (X4) transformation in rANS4x16 and arith codecs is now a generalised N-way stripe function. It's the same binary format bar a single extra byte to specify N. It's a bit more complex to describe - I struggled with how best to document it - but it's more generic and should in theory permit efficient compression of 16-bit quantities as well as 32-bit ones. (Currently it only really comes into play for encoding integers in the name tokeniser.)

@jkbonfield
Copy link
Contributor Author

Note to @yfarjoun: this is killing latexdiff for reasons currently unknown.

CRAMv3.tex in master and this PR both build and view fine. The diff subdirectory produced by our circleci process however bombs out.

$ cd diff
$ TEXINPUTS=:..:../new pdflatex CRAMv3.tex
[... volumous latex waffle...]
l.2561 
       ] [31]
! Extra }, or forgotten $.
\textdef@ ...th {#1}\let \f@size #2\selectfont #3}
                                                  }
l.2916 }
        %DIFDELCMD < }
? 

The offending block of generated latex around 2901-2917 is:

%DIFDELCMD < {                                                                                                                                                  
%DIFDELCMD < \setlength{\parindent}{1cm}                                                                                                                        
%DIFDELCMD < %%%                                                                                                                                                
\DIFdel{%                                                                                                                                                       
\mbox{%DIFAUXCMD                                                                                                                                                
$ cfreq_{i} = \left\{                                                                                                                                           
\begin{array}{l l}                                                                                                                                              
0 & \quad \textrm{if $                                                                                                                                          
}%DIFAUXCMD                                                                                                                                                     
i < 1$\end{array} \\                                                                                                                                            
cfreq_{i-1} + freq_{i-1} & \quad \textrm{if $i \geq 1$}                                                                                                         
}                                                                                                                                                               
\right. $                                                                                                                                                       
%DIF <  \\*[8pt]                                                                                                                                                
%DIF <  \indent $cfreq_{i} = \displaystyle \sum_{j=0}^{i-1} freq_j$                                                                                             
}%DIFDELCMD < }                                                                                                                                                 
%DIFDELCMD <                                                                                                                                                    

Deleting that no longer causes the error.

@jkbonfield
Copy link
Contributor Author

The offending LaTeX that breaks the diff is this:

{
\setlength{\parindent}{1cm}
$ cfreq_{i} = \left\{
\begin{array}{l l}
0 & \quad \textrm{if $i < 1$} \\
cfreq_{i-1} + freq_{i-1} & \quad \textrm{if $i \geq 1$}
\end{array}
\right. $
% \\*[8pt]
% \indent $cfreq_{i} = \displaystyle \sum_{j=0}^{i-1} freq_j$
}

It's pretty ugly, using an array within math mode to do a table style layout. It could be done as an outer table and lots of open/close math mode within each cell to avoid this latexdiff bug I guess, but I'm inclined to leave it as-is. I can't fix it trivially without needing another PR (which itself would also break) first, before relaunching this diff again. The diffs are a nicety, but if they fail it's just a bit more work for us manually inspecting instead.

@jkbonfield
Copy link
Contributor Author

The CRAMv4.tex document has now been hived off into its own PR #502, given it's likely a lot more contentious and harder to implement than the new codecs in v3.1 alone.

Both have been rebased off master and diverge from that point, but touching separate files so it shouldn't be problematic.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented May 24, 2021

Sorry this is long, but it's hard to describe without some basic understanding of the implementation choices.
However in summary:

Do we want to potentially gain 5-30% speed when we have AVX2 or similar available, at a cost of 5-15% slow down when not? The spec change is minimal, and the use of the new variant is optional and only expected to be used of very large blocks of data - eg quality scores or "signal" aux tags.

In CRAM 3.1 we introduce a newer rANS codec. The primary differences are 16-bit renormalisation instead of 8-bit (for speed), and the addition of in-built RLE and bit-packing transformations. Ideally they would be separate transformation layers implemented as CRAM encodings (I have this IIRC in CRAM 4.0, but it's a bigger change and frankly a small standalone entropy encoder is a smaller pill to swallow than more significant format changes).

However I'm thinking of another change to this. Both the old "rANS 4x8" and new "rANS 4x16" have 4-way interleaving. Here's the pseudocode for 4x16:

image

Naively this is described as a mod on the i^th value to rotate through the 4 rANS states. The reason for doing this is updating that state has various multiplies, which come with a latency. We cannot immediately execute the next CPU instruction if it has a dependency on the previous result. By interleaving multiple states, we drive the CPU at the frequency it can issue instructions and ignore the instruction latency. perf stat tells me decoding a block of data is executing 3.7 instructions per clock cycle. That's pretty optimal for scalar code.

The order-1 entropy encoder is more complex, but follows a similar style:

image

Now it's clear 4-way is just an arbitrary amount. I chose it because 2 way doesn't entirely remove all instruction latency, while more than 4 starts to add other inefficiencies.

However... that assumes we can't use SIMD instructions. Try hard as I could, I never really got SSE4 to work well on rANS. That's not true for AVX2 (now becoming standard) and AVX512 (still pretty rare). I have AVX2 variant of rANS that works on 32-way unrolling. That means it's updating 32 states at a time, which is basically 4 interleaved AVX2 registers (for the same latency reasons as above). An example on a PacBio "ip:B:C," data aux tag.

rans4x8 -o0 (4-way)
273.2 MB/s enc, 517.1 MB/s dec   100000000 bytes -> 77254127 bytes

rans4x16 -o0 (4-way)
414.1 MB/s enc, 778.2 MB/s dec   100000000 bytes -> 77250511 bytes

rans4x16 -o0x604 (scalar 32-way)
255.0 MB/s enc, 621.5 MB/s dec   100000000 bytes -> 77258629 bytes

rans4x16 -o0x4 (SIMD 32-way)
512.3 MB/s enc, 1738.8 MB/s dec  100000000 bytes -> 77258629 bytes

We see here the 4x8 and 4x16 4-way implementations as described in this PR, plus the 32-way variant implemented using pure scalar code (basically an optimised version of the above pseucode with the 4s changed to 32s), and a hand written AVX2 SIMD variant. [In my htscodecs test harness, -o0 / -o1 is order0/1 4-way, +4 is 32-way auto-SIMD, and +0x600 is 32-way with forced scalar only implementations.]

The speed gain for encoder is slighty, but vast for decoder. The scalar implementation however is a bit slower. Note this was gcc, but clang has 377MB/s and 364MB/s enc for -o0 vs -o0x604, so they're far closer together and it indicates maybe we can improve the code to make the 32-way scalar encoder work better on gcc too.

Similarly with a block of unaligned base-calls, we get the following benchmarks:

rans4x8 -o1
244.7 MB/s enc, 382.7 MB/s dec	 100000000 bytes -> 24297283 bytes

rans4x8 -o0
338.3 MB/s enc, 575.8 MB/s dec   100000000 bytes -> 24603575 bytes

rans4x16 -o0 (4-way)
420.6 MB/s enc, 776.4 MB/s dec   100000000 bytes -> 24599287 bytes

rans4x16 -o4 (SIMD 32-way)
514.3 MB/s enc, 1784.1 MB/s dec  100000000 bytes -> 24607353 bytes

rans4x16 -o0x80 (PACK+O0 4-way)
760.1 MB/s enc, 2133.4 MB/s dec  100000000 bytes -> 24321803 bytes

rans4x16 -o0x84 (PACK+O0 SIMD 32-way)
827.0 MB/s enc, 3666.5 MB/s dec  100000000 bytes -> 24329815 bytes

rans4x16 -o0x684 (PACK+O0 scalar 32-way)
580.8 MB/s enc, 1844.7 MB/s dec  100000000 bytes -> 24329815 bytes

Here CRAM 3.0 ends up choosing an order-1 encoder as it's smaller than the order-0, but I include both for comparison. CRAM 3.1 uses the bit-packing as with 4 base types we end up packing 4 to a byte before encoding which makes order-0 almost as good as order-1 but with a fraction of the CPU time. Given the bit-packing hasn't changed, going from PACK+4way to PACK+32waySIMD isn't so major, but it's still a benefit. As before, PACK+4wayScalar however is a bit slower.

So what does all this actually mean? Well these data types are for some 100,000 unaligned PacBio reads. After converting to CRAM and putting through io_lib's cram_size tool we see:

Block CORE              , total size           0
Block content_id      11, total size      535736  RN
Block content_id      12, total size        1928  QS
Block content_id      25, total size      216319  RL
Block content_id      30, total size   293569283  BA   PACK+rANS0 (-o128)
Block content_id      32, total size       22025  TL
Block content_id 6518851, total size       24264  cxC
Block content_id 6910018, total size   931608282  ipB  rANS0
Block content_id 7237699, total size        1446  npC
Block content_id 7370562, total size   854999601  pwB  rANS0
Block content_id 7431491, total size         343  qeC
Block content_id 7431497, total size      150242  qeI
Block content_id 7431507, total size       75834  qeS
Block content_id 7435075, total size        1851  qsC
Block content_id 7435081, total size      136288  qsI
Block content_id 7435091, total size       70902  qsS
Block content_id 7500134, total size        4097  rqf
Block content_id 7564866, total size      246257  snB
Block content_id 8023363, total size         301  zmC
Block content_id 8023369, total size       34626  zmI
Block content_id 8023379, total size        4647  zmS

The files are dominated by base calls and two aux arrays. (There are no quality values in this data set for some reason.)
Profiling CRAM 3.0 decode on the CRAM I see:

    55.33%         27010  scram_flagstat  [.] rans_uncompress_O0
    22.56%         10839  scram_flagstat  [.] rans_uncompress_O1
     8.63%          4272  libc-2.27.so    [.] __memmove_avx_unaligned_erms
     7.96%          3821  libdeflate.so.0 [.] libdeflate_crc32
     3.64%          1788  scram_flagstat  [.] bam_construct_seq

With CRAM 3.1 we see the O1 decoder has mostly dropped away, as we're now using bit-PACKing plus O0 for the base calls:

    48.82%         13767  scram_flagstat  [.] rans_uncompress_O0_4x16
    16.73%          4804  libc-2.27.so    [.] __memmove_avx_unaligned_erms
    13.21%          3849  libdeflate.so.0 [.] libdeflate_crc32
     7.18%          2131  scram_flagstat  [.] rans_uncompress_O1_4x16
     6.15%          1739  scram_flagstat  [.] bam_construct_seq

It's still dominated by the rANS decoding so, so with 32-way AVX2 we see:

    30.80%          6324  scram_flags tat  [.] rans_uncompress_O0_32x16_avx2
    23.74%          4916  libc-2.27.so    [.] __memmove_avx_unaligned_erms
    18.47%          3920  libdeflate.so.0 [.] libdeflate_crc32
     8.73%          1777  scram_flagstat  [.] bam_construct_seq
     6.67%          1436  scram_flagstat  [.] rans_uncompress_O1_32x16_avx2

Still 30%, so the AVX512 version (50% quicker still, but not yet ported to htscodecs) would bring this down further.

In terms of wall clock timings, Scramble shows:

Format    Variant       Size    Encode          Decode
BAM       libdeflate    2270MB  1m4.158s        0m12.820s
CRAM3.0   4x8           2082MB  0m47.136s       0m12.594s
CRAM3.1   4x16          2082MB  0m43.821s       0m7.821s
CRAM3.1   32x16 scalar  2082MB  0m48.187s       0m8.901s   +10/14%
CRAM3.1   32x16 SIMD    2082MB  0m41.951s       0m5.742s   -4/27%

This is perhaps an abnormal data set as it shows the new codec in the best light. Aligned NovaSeq shows really marginal differences; 3-5% gain basically. Aligned ONT data without any signal data (so primarily gaining in QS compression) shows about 8% speed up decoding (and 2% slow down in scalar mode). Old style Illumina with 40-quals is similar.

My proposal is, like the RLE, bitPACK, order-0/1, to make 4-way vs 32-way a parameter stored in the data stream. It means the decode reads this and either naively uses it as a variable in the decoding loops, or has hand-tuned variants for the 4-way case and the 32-way case. Spec wise it's a minimal change, but implementation wise it's more questionable.

It adds complexity and may penalise CRAM performance on very naive implementations or old hardware (note TODO: get access to ARM and do some Neon code), but it brings noticble benefits to smart implementations on modern hardware.

Thank you to all who bothered to read this far. What do people think about this?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Aug 9, 2021

I decided to go for it and add the documentation of 32-way vs 4-way into the CRAM 3.1 codec. I've done this because the penalties of not explicitly vectorising aren't huge (nil for the test JavaScript code, but maybe 10-20% for C) while the potential benefits are very significant. Plus it's not mandatory.

The last commit has these changes in. You can see that the impact on the pseudocode and specification are really quite minimal in the grant scheme of things. (There is a bit more changing there because I wanted to distinguish between interleaving N rans states vs N-way (now X) data-striping (eg shuffling data so every 4th byte is collated together prior to compression).

See samtools/htscodecs#26 for the C and JavaScript implementations of this idea. The JavaScript changes to support selection of N-way rANS state interleaving demonstrates how easy it could be (rather than how complex it can be if you look at the optimised C code!). jkbonfield/htscodecs@6a5e014

CRAMcodecs.tex Outdated
\State $last\_sym \gets s$
\State $rle \gets 0$
\Repeat
\State $A \gets (A,s)$ \Comment{Append $s$ to the symbol list $A$}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This currently implies that A holds a list of symbols in that order that they are read from the file, which has implications for the order of frequency values read later. It might be useful to add a restriction that s is always increasing, or maybe that A should either be sorted into order before returning it, or should be treated as an ordered set rather than a list.

See also https://github.com/samtools/htscodecs/pull/26/files#r814844095

Copy link
Contributor Author

@jkbonfield jkbonfield Feb 28, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact this is a run-length encoded list of symbols with the ability to auto-increment symbol numbers is at least very heavily implying that they are stored sequentially and in ascending order, however I agree it needs to be an explicit statement as technically it could be encoded as a mix of out of order symbols interspersed with ordered clumps of adjacent values. I'll amend.

@jkbonfield
Copy link
Contributor Author

As this is still kicking around in a non-adopted state, mainly due to a combination of lack of review and lack of a complete 2nd implementation, I decided it's still valid to consider changing it.

Both existing long-read technology but also some of the newer instruments (such as Ultima Genomics) exhibit a strong correlation between the local sequence motif and the quality stream. We explored such correlations in the earlier era of Solexa / Illumina, but the correlation was so slight it was deemed inadvisable to create a dependency between distinct types of data. However this decision should be revisited, as has been demonstrated by tools such as ENANO.

An example 1 million fastq lines from some bleeding edge Ultima Genomics data. Quality size only:

htscodecs/fqzcomp_qual
-s0    		   	61261377
-s1			64589653
-s2			60522937
-s3			61196878
-x 0xf600000f0002	59726230

These show the standard quality parameter sets using the test harness in https://github.com/samtools/htscodecs/.
The -x method is a way of explicitly coding a set of parameters rather than using the predefined parameter sets. I manually optimised it so you can see the slight gain with a hand tuned configuration.

I then used the fork of htscodecs in fqzcomp5 (https://github.com/jkbonfield/fqzcomp5) which adds the ability to define the sequence motif as part of the quality context. The parameters involved are length of motif (2-bit encoded) and the backwards offset relative to quality value encoded. So we could have a 4bp motif (8 bits) with an offset of 2 meaning 2bp prior to this qual, the bp for this qual itself, and 1bp after.

Using the FQZ+SEQ encoder previously tuned for ONT on UG data gave this:

-x 0xf600000f0002	59726230 # From above, using QUAL only
-Q3			52094372 Profile for ONT, using QUAL+SEQ in context
-x 0x660000000000a63	51839253 Manually tuned; O1 encode + 10-bits of seq context

It's clear the profile previously tuned on ONT was sufficient here too. This is around a 14% reduction in quality size, so really very significant. We see a similar impact with ONT and PacBio data sets too, so this has mainstream benefits.

Practically speaking, this has an impact on CRAM implementations as they can no longer assume total independence of data series and would need to consider FQZ+SEQ encoding as a special case of requiring all data-streams associated with sequence with be a prerequisite for obtaining qualities. While independence of data streams is very useful for rapid partial decoding, obtaining quality without the corresponding sequence bases has extremely limited value (unlike vice-versa which is a common operation). Furthermore CRAM decoders, if they wish to permit partial decoding, already must track both data dependence and code-path dependence as the specification permits multiple data-series to be encoded within the same block. (An example of data dependence would be DL data series (deletion length) being in the same block as quality values, while code-path dependence would be the knowledge that DL data series can only be decoded if FC and FN data series have also been decoded.) However to facilitate efficient implementations, it's probably best if the codec types are separated to FQZ (qual only) and FQZS (qual+seq context) so tools can rapidly determine data dependencies without needing to decode the FQZ parameter blocks themselves.

--

The second thing to consider for archive CRAMs is increasing the maximum context size from 16 bits to 20 bits. I chose 16 bit because CRAM is designed around fine-grained random access and it's hard for large models to learn all the probabilities, however if we have very large block sizes (for archival) then this may be too restrictive. The above parameters use 8 or 10 bits of sequence motif in the context, leaving very few bits left for quality encoding. Thus the quality encoder is reduced from order-2 context down to order-1 in order to free up bits in the model index.

By increasing the model size to 20 bits, we get the following:

-x 0xc600000000008c2    48780866; O2 encode + 8 bits seq context: 5.5% smaller

This is basically O(2) qual encoding (6-bits at a time to cope with the Q0 to Q40 inclusive range) and 8 bits of sequence context (offset 2). The memory used for the quality encoder increases considerably, but fqzcomp5 running with 4 threads was still only around 2GB so it's not an insurmountable amount. Combined with the seq context extension, these two changes amount to an 18% reduction in quality size for this test data set, and the qualities are the dominant space in an aligned UG CRAM.

As hinted at above, the 16 to 20-bit gains are much lower when using smaller block sizes. The above was the entire 1 million reads in a single block (considerably larger than our current "archive" profile for CRAM in htslib), but spread across 7 blocks going from 16 to 20 bits only saved around 1.5%. It is likely we wouldn't use the larger context size except when aiming for maximum compression in archive mode, where a coarse degree of random access is acceptable. So this is a less significant change to adding sequence contexts.

For reference, current CRAM 3.0 uses 63097604 bytes for these qualities, and current CRAM 3.1 in archive mode uses 61192567 bytes (not too dissimilar to the values seen in the original plot above).

Given the lack of progress elsewhere in this PR, I'm inclined to amend it.

Comments?

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jun 14, 2022

An update, keeping to 16-bit context size at the moment and tested on a single large block of data (varying between 100M and 1G by machine type). This is the benefit of adding in SEQ into the context, hand-tuned fqzcomp params per data type.

Machine Size reduction fqzcomp -x param
Illumina NovaSeq 2.9% 0x6244220e8c04261
ONT 8.2% 0x660000000000a63
PacBio CLR 8.1% 0x660000000000a63
Ultima Genomics 15.3% 0x660000000000a63

It's clear sequence context is the strongest for Ultima Genomics, but it's a brand new platform and it's possible this will change as base-callers further improve.

TODO: PacBio HiFi, more modern ONT (this is old GIAB data), other new platforms.

Going to 20-bit context size didn't help ONT or PB, it helped Illumina only marginally (0.6%), but Ultima Genomics gained a further 5.9% (compared to 16-bit with SEQ context). So I'm unsure if larger contexts are worth it overall yet given the increased memory and slower times, but it could be an elective thing.

CRAMcodecs.tex Outdated

The symbol frequency table indicates which symbols are present and
what their relative frequencies are. The total sum of symbol
frequencies are normalised to add up to 4095.
Copy link

@zaeleus zaeleus Oct 4, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both htscodecs and HTSJDK add up to 4096, not 4095.

Edit: To clarify, these implementations will readjust to 4095, but it's confusing that they initially target and scale with 4096 rather than 4095, which is not how it's described in the document.

I did see that the JavaScript implementation both scales and adds to 4096, which may be a bug?

Copy link
Contributor Author

@jkbonfield jkbonfield Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a legacy thing from CRAM 3.0 which wrongly stated 4095 as I believed 4096 would cause wrap-around issues. That wasn't true, and the code works just fine. I'm a little cautious about explicitly permitting 4096 incase people have implemented a decoder with an assumption that it matches what the spec originally stated, but I think we can enumerate the implementations pretty easily to double check this.

I'll investigate. It would definitely be cleaner to recommend 4096 (with a comment that 4095 is permitted for compatibility with historic versions of the spec).

Edit: I do see this though which appears to require the sum to be <4096:

https://github.com/GMOD/cram-js/blob/master/src/rans/frequencies.js#L53

I haven't worked out how to test this on the command line yet.

Copy link
Contributor Author

@jkbonfield jkbonfield Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can verify that the output from Scramble (io_lib's original CRAM implementation), htslib, and htsjdk all produce sums of 4095 for their O0 and O1 frequency tables.

The TOTFREQ definition in Java just copied my C version, so it all stems from the same silly issue back then I guess. 4096 is the size of the array used in the decode function, and it's that simply because it permits AND operations and lookup tables (which is probably why I originally confused myself with 4095 being the maximum freq). The renormalisation of frequencies is non-trivial and isn't covered by the specification either as it describes the at-rest file format rather than a specific implementation for determining how to get your frequencies summing correctly. So it's inevitable that we start with a target value and then have to rescale after summing to get things adding up.

I agree it looks like the javascript implementation is wrong here. It needs fixing, although that's an htscodecs issue rather than an hts-specs one.

CRAMcodecs.tex Outdated
Comment on lines 361 to 365
One last caveat is that we have no context for the first byte in the
data stream (in fact for 4 equally spaced starting points, see
``interleaving" below). We use the ASCII value (`\textbackslash0') as
the starting context and so need to consider this in our frequency
table.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output of the rANS implementations in htscodecs also includes the final context, which is neither described here nor in the following example. (The final context in the case of the following example is ('a', '\0')).

Copy link
Contributor Author

@jkbonfield jkbonfield Oct 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I didn't get this - I dumped out the contents of out_buf holding the O1 frequency table for the example input and it matches (bar the frequencies themselves, which don't due to changes in the frequency renormalisation algorithm, but as stated before how you get the frequencies to add up to 4095 is up to the implementation).

However, if I change it to abracadabraabracadabraabracadabraabracadabrx so the last letter occurs only once, then I see that initially 'x' isn't in the F[256][256] frequency table:

(gdb) p F['r']
$56 = {0 <repeats 97 times>, 7, 0 <repeats 22 times>, 1, 0 <repeats 135 times>}
(gdb) p F['r']['a']
$57 = 7
(gdb) p F['r']['x']
$58 = 1
(gdb) p F['x'][0]
$59 = 0
(gdb) p F['x']   
$60 = {0 <repeats 256 times>}

However F['x'][0..255] gets renormalised because T['x'] != 0. We could work around this special case where the last character in the buffer appears once and once only, meaning it doesn't have any following contexts of its own, with this minor patch:

diff --git a/htscodecs/rANS_static.c b/htscodecs/rANS_static.c
index 9865022..9a37eb0 100644
--- a/htscodecs/rANS_static.c
+++ b/htscodecs/rANS_static.c
@@ -420,6 +420,12 @@ unsigned char *rans_compress_O1(unsigned char *in, unsigned int in_size,
         if (T[i] == 0)
             continue;
 
+        if (T[i] == 1 && in[in_size-1] == i) {
+            // last symbol can be ignored as it doesn't have following
+            // contexts.
+            continue;
+        }
+
         //uint64_t p = (TOTFREQ * TOTFREQ) / t;
         double p = ((double)TOTFREQ)/T[i];
     normalise_harder:

It changes the output as expected:

@ seq4c[samtools.../htscodecs]; hd /tmp/_o1.orig 
00000000  01 48 00 00 00 01 3f 00  00 00 2c 00 00 00 00 61  |.H....?...,....a|
00000010  8f ff 00 61 61 82 86 62  02 86 bd 83 5e 83 5e 00  |...aa..b....^.^.|
00000020  62 02 72 8f ff 00 61 8f  ff 00 61 8f ff 00 72 61  |b.r...a...a...ra|
00000030  8d ff 78 82 00 00 78 00  8f ff 00 00 47 98 48 53  |..x...x.....G.HS|
00000040  47 98 48 53 47 98 48 53  34 d7 46 02 ad           |G.HSG.HS4.F..|
0000004d
@ seq4c[samtools.../htscodecs]; hd /tmp/_o1.new
00000000  01 43 00 00 00 01 3a 00  00 00 2c 00 00 00 00 61  |.C....:...,....a|
00000010  8f ff 00 61 61 82 86 62  02 86 bd 83 5e 83 5e 00  |...aa..b....^.^.|
00000020  62 02 72 8f ff 00 61 8f  ff 00 61 8f ff 00 72 61  |b.r...a...a...ra|
00000030  8d ff 78 82 00 00 00 47  98 48 53 47 98 48 53 47  |..x....G.HSG.HSG|
00000040  98 48 53 34 d7 46 02 ad                           |.HS4.F..|
00000048

It's subtle, but there are two 'x's in the o1.orig file, with the first showing r->x and the second being x->nul.

Note however this second transition is not necessary for decode and the decoder works fine. This is an implementation tweak (maybe bug), but not something that the specification should require. Basically we can add more entries to the frequency table if we really wish and it'll have no bearing on the decoding of the data as they're unused. I'm not sure how best to explain that in the formal specification.

@@ -1721,7 +1740,7 @@ \subsection{Statistical Modelling}
frequency tables in the data stream covering all possible contexts.

To do this we use a statistical model, containing an array of symbols $S$ and their frequencies $F$.
The sum of these frequences must be less than $2^{16}-32$.
The sum of these frequences must be less than $2^{16}-16$ so after adjusting the frequencies it never go above the maximum unsigned 16-bit integer.
Copy link

@yash-puligundla yash-puligundla Jan 18, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sum of these frequencies must be less than or equal to $2^{16}-16$ so after adjusting the frequencies it never goes above the maximum unsigned 16-bit integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We add 16 to the frequency, and $2^{16}-1$ is the maximum value stored in a uint16_t, not $2^{16}$. Hence less than instead of less than or equal.

NB: it's incremented by 16 instead of 1 because we start with all symbols having a minimum probability of 1/N. We want to distinguish those symbols seen and those not yet seen, so the jump from 1/N (1/256 for a full alphabet) to 17/272 and 33/288 is a more rapid learning of probabillities than simple +1. The price though is more frequent renormalisation, but that's little more than a bunch of bit shifts. It's better too for smaller N too, so limited alphabets benefit from less overhead for the as-yet unseen symbols.

Some range coders use an escape code and freq 0 for all unused symbols. When we add a symbol that currently has freq 0, the escape code is emitted, then the new symbol itself, and then it's incremented as before. Ie we start with 1/1 (P(escape)=1), then emit escape + 'x' and we have P(escape)=1/2 and P(x)=1/2. Then another x => P(escape)=1/3, P(x)=2/3. (Escape and literal symbols can be adjusted further based on the chance of new symbols vs how many emitted so far given a finite alphabet, but it's relatively minor). This method is a better estimation of probabilities, but it's also more complex and I didn't deem the speed hit to be worth the small size difference, especially when we have a small alphabet. It only really helps significantly on small blocks of data or those with non-stationary probabilities, where the size / speed of adaption matters the most.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 23, 2023

The X_EXT part of arith-dynamic compression to use bzip2 after RLE / PACK ops instead of an arithmetic coder seems unwieldy and unjustified. Htslib doesn't use this feature, and nor does io_lib. I can't remember why I put it there, but probably it worked on one specific data set somewhere. Commenting out the bzip2 decoder in htscodecs arith_dynamic.c doesn't cause it to fail any tests either, so I'm obviously not exercising this route.

On the name test data from https://github.com/jkbonfield/htscodecs-corpus I explicitly added bzip2 into the mix of things to try to see what impact on compressed size it gave.

       	  	-9    	-9+bz2	-9+xz
01.names	168399	168399	168399
02.names	415262	414730*	415262
03.names	49156	49156 	49156 
05.names	280156	277696*	280156
08.names	42106	42106 	42106 
09.names	254490	254490	254490
10.names	300635	300635	300635
20.names	139712	139608*	139712
nv.names	323763	302544*	323763
rr.names	88696	88374* 	88696 

The only one of these that changed significantly was nv.names (NovaSeq names), which shrunk by 7%. This doesn't happen on smaller CRAM slices of 10k reads / slice, but does at 25k slice and above. 05.names (HiSeq) was close to 1%, but still not that major.

On a NovaSeq cram I see names are about 25% of storage, so that bzip2 saving comes out at about 1.7% total file size reduction. It seems very specific to this one naming scheme. I'll research further, but am interested in the thoughts of others who have implemented this spec as to whether they did this small corner case, and whether they have any intention to use it?

@yash-puligundla @zaeleus

Edit: The reason for the benefit with bzip2 on NovaSeq comes down to the X and Y coordinates being better compressed with bzip2 than any of the rans or arithmetic coder offerings. This in turn appears to be due to the distribution of values. The coordinates are strongly clumped every 15 or 16 pixels apart in one dimension, and every 9 or 10 in the other. Likely due to the gridded cluster layout. This means a sorting algorithm, such as the BWT in bzip2, can do a better job of representing the data than a naive byte based entropy encoder.

The debate is still out whether it's worth using. It's not possible to produce CRAMs using this right now from htslib, but it isn't hard to enable and has been tested as seen above.

@jkbonfield
Copy link
Contributor Author

I rebased this PR and did a full re-read and final round of minor updates (mostly typographical or language improvements).

I'm happy with it, barring the open question on whether all of the spec is actually needed (see bzip2 above). That doesn't particularly harm though as it's a minimal change in complexity compared to everything else.

Given we have two full implementations for both reading and writing, which have been cross-validated, I don't see any reason for delaying this further.

Please shout now if you feel otherwise!

@jkbonfield jkbonfield force-pushed the cram_codecs branch 10 times, most recently from eced033 to 0872692 Compare January 30, 2023 11:58
@zaeleus
Copy link

zaeleus commented Jan 31, 2023

thoughts of others who have implemented this spec as to whether they did this small corner case, and whether they have any intention to use it?

The implementation to support the EXT flag certainly isn't the hard part. For encoding in noodles, it is disabled by default, and there are no heuristics to automatically select it. It is straightforward for the user to enable it manually, though.

I agree it's hard to justify keeping it when your test case is for a specific pattern of read names. I'd be curious how it compares to the name tokenizer codec instead.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 31, 2023

thoughts of others who have implemented this spec as to whether they did this small corner case, and whether they have any intention to use it?

The implementation to support the EXT flag certainly isn't the hard part. For encoding in noodles, it is disabled by default, and there are no heuristics to automatically select it. It is straightforward for the user to enable it manually, though.

Agreed it's such a trivial thing to add compared to the huge complexity of the rest of it, that it's a very minor point, so I'm not inclined to worry too much.

I agree it's hard to justify keeping it when your test case is for a specific pattern of read names. I'd be curious how it compares to the name tokenizer codec instead.

Htslib doesn't use the arithmetic coder by default, but can be coaxed into using it with e.g. samtools view -O cram,version=3.1,archive, but even then it won't use the bzip2 part of the arith coder on the name tokeniser.

It's a trivial patch to enable this though for level 7 and level 9 name tokeniser:

diff --git a/htscodecs/tokenise_name3.c b/htscodecs/tokenise_name3.c
index 2497f59..4588f05 100644
--- a/htscodecs/tokenise_name3.c
+++ b/htscodecs/tokenise_name3.c
@@ -1253,8 +1253,9 @@ static int compress(uint8_t *in, uint64_t in_len, int level, int use_arith,
         {2,   0,      128},                                   // 1
         {2,   0,                         192+8},              // 3
         {3,   0,  128,                   193+8},              // 5
-        {6,   0,1,    129,   65,    193, 193+8},              // 7
-        {9,   0,1,128,129,64,65,192,193, 193+8},              // 9
+        //{6,   0,1,    129,   65,    193, 193+8},              // 7
+        {9, 0,1,  128,129,64,65,192,193, 193+8},              // 7
+        {10,0,1,4,128,129,64,65,192,193, 193+8},              // 9
     };
 
     // 1-9 to 0-4

The above replaces the tokenise names arithmetic level 7 (17) with the old level 9, and adds a new bzip2 mode to the new level 9, so we can compare -17 vs -19 and see what impact it'd make at highest compression levels.

The figures shown above are for the test data compressed with htscodecs' tests/tokenise_name3 tool. An example on read names from 9-mill to 10-mill into my novaseq test data:

@ seq4c[/tmp]130; samtools view ~/scratch/data/novaseq.10m.cram|tail -1000000|awk '{print $1}'|~/work/samtools_master/htscodecs/tests/tokenise_name3 -19 |wc -c
2800694

@ seq4c[/tmp]; ^19^17
samtools view ~/scratch/data/novaseq.10m.cram|tail -1000000|awk '{print $1}'|~/work/samtools_master/htscodecs/tests/tokenise_name3 -17 |wc -c
3020372

So that's a 7.3% saving on RN. Given on the whole CRAM the RN field was ~19% of the file size, this equates to ~1.3% total cram size reduction. However that's only on the most aggressive "archive" mode as we don't even use the arithmetic coder in any lesser compression profiles as it's not worth the speed vs size tradeoff normally (rANS is so much faster, especially the SIMD vectorised variants).

It's trickier to get it used directly by htslib/samtools though, due to a bug in the auto-strat selection for TOK3+RANS vs TOK3+ARITH. Fixing that though I see:

# With bzip2 in tok3
$ samtools view -@8 -O cram,version=3.1,archive,level=9 ~/scratch/data/novaseq.10m.cram -o /tmp/_1.cram
$ samtools cram-size -v /tmp/_1.cram
#   Content_ID  Uncomp.size    Comp.size   Ratio Method      Data_series
BLOCK     CORE            0            0 100.00% raw        
BLOCK       11    394734019     26432333   6.70% tok3-arith  RN
BLOCK       12   1504781763     79676722   5.29% fqzcomp     QS
BLOCK       13       330065        40458  12.26% gzip-max    IN
BLOCK       14     16549410      4024890  24.32% arith-o1PR  SC
BLOCK       14     10076192      2475367  24.57% gzip-max    SC
BLOCK       15     15002423      2792408  18.61% arith-o1    BF
...
Number of containers                 100
Number of slices                     100
Number of sequences             10000000
Number of bases               1504781763
Total file size                158420463
Format overhead size               77543

vs

# without bzip2 in tok3-arith
#   Content_ID  Uncomp.size    Comp.size   Ratio Method      Data_series
BLOCK     CORE            0            0 100.00% raw        
BLOCK       11    394734019     28611420   7.25% tok3-arith  RN
BLOCK       12   1504781763     79676722   5.29% fqzcomp     QS
BLOCK       13       330065        40458  12.26% gzip-max    IN
BLOCK       14     26316465      6401586  24.33% arith-o1PR  SC
BLOCK       14       309137        71869  23.25% gzip-max    SC
BLOCK       15     15002423      2792408  18.61% arith-o1    BF
...
Number of containers                 100
Number of slices                     100
Number of sequences             10000000
Number of bases               1504781763
Total file size                160474246
Format overhead size               77543

So the ratios are similar to the test tool - 1.3% overall file saving, and 7-8% for RN block.

As an example, samtools speeds vs sizes (8 threads, wall clock time):

cram,version=3.1,archive,level=9	1m10.263s	158285989  # with tok3-arith-bz2 enabled
cram,version=3.1,archive		0m39.254s	158513296  # with tok3-arith-bz2 enabled
cram,version=3.1,archive		0m36.987s	160663427  # without tok3+arith-bz2 enabled
cram,version=3.1,small,use_arith	0m26.861s	162763860  # with tok3-arith-bz2 enabled
cram,version=3.1,small(x)		0m23.084s	163420366  # see comment below - tok3 name tweak only
cram,version=3.1,small			0m17.428s	166275499  # adds use of fqzcomp
cram,version=3.1			0m7.315s	175612417  # uses newer rans and name tokeniser
cram,version=3.0			0m6.451s	207758029  # 18% larger than v3.1
bam					0m11.045s	565605910

(x) Using normal "small" profile is rANS everywhere, but this hacks the code to be rANS for all except the name tokeniser, so the only extra cost and saving is in the RN blocks.

So it's pretty much overkill. Even with a modified profile so "small" mode uses tok3+arith for names, we're still 33% slower for a 1.7% size reduction. Maybe beneficial? Maybe not. Hard to say.

Archive level 9 is the extreme scenario. I'm not expecting anyone to really use archive mode as it's slow. It's there mainly as a demonstration for what the format can do if we throw CPU at it, and to act as a baseline when people compare other (potentially even slower) formats against CRAM.

If we decide to keep it in the spec (and as you say it's minimal extra complexity vs everything else), then I guess the normal compression level of archive mode is a reasonable tradeoff to add bz2 in there: 6% slow down for 1.3% size reduction is pretty decent given the profile is aiming at high compression.

Edit: remember though this is NovaSeq, which benefits here due to the very specific case of bzip2 working well on X and Y coords (likely due to the gridded layout). Other data sets we're not going to see much size change at all, but may still pay the CPU overheads, although it could be improved by better learning strategies.

Anyway, to summarise a long ramble:

My instinct is to keep it there and enable this feature as it's beneficial in at least some (pretty common) situations.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Jan 31, 2023

Note there may also be other places where this is beneficial. Eg the ability to do bit-packing + bzip2 may help considerably for some aux tags.

It's definitely not a clean way of doing things. For CRAM 3.1 I wanted to keep the changes solely focused on "external" codecs, ie replacements for gzip, bzip2 and lzma.

However the correct approach (in the CRAM 4.0 draft) is to lift the transforms out from rans/arith into the CRAM encoding methods so we can use nested layers, such as put data series through PACK, then through RLE, and then store to EXTERNAL block 12, which is then compressed using BZ2. I also added better variable sized integer encoding as a transform too. This is a much more flexible strategy, but it needs more aggressive changes to the format and the slow progress of this PR showed me that we'd need to accept smaller peacemeal updates, even if ultimately they're not the best solution.

@jkbonfield
Copy link
Contributor Author

Squashed, rebased, and attached the prebuild PDFs to the top post.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Feb 9, 2023

To all concerned, I am now starting the timer on acceptance. 1 month from now (14th March) if there are no reasonable objections then CRAM 3.1 will become the current official specification version.

@github-actions
Copy link

Changed PDFs as of 4e4cbbe: CRAMcodecs (diff), CRAMv3.

jkbonfield added a commit to jkbonfield/hts-specs that referenced this pull request Mar 15, 2023
- Added CRAM_codecs document to describe the proposed CRAM3.1 codecs.

  This also has the CRAM3.0 rANS codec in it which has been removed from
  CRAMv3.tex.

- Update CRAMv3.tex to include both CRAM 3.0 and 3.1.  The differences
  are so minor it's essentiall just expanding the list of legal values
  in Block compression methods and updating a few 3.0s to be more general.

The new EXTERNAL compression codecs added for CRAM 3.1 are:

- rANSNx16.  16-bit renormalisation instead of the old 8-bit
  renormalisation, for extra speed.  This now permits either 4-way of
  32-way unrolling, with the latter being a good target for SIMD
  optimisations.

  The newer rANS also includes PACK, RLE and STRIPE (interleaving)
  transformations to be applied prior to entropy encoding.

- Adaptive arithmetic coder.  This has all the same transformations of
  rANSNx16, but with slightly better ratio and better performing on
  data sets with non-stationary probabilities.  It's slower however.

- FQZComp quality codec.  A generalised version of the old fqzcomp-4.6
  tool.

- Name tokeniser.  A compression tool for any data series that has
  repeated structured text elements.  It breaks the line down into a
  series of tokens and optionally deltas each against a previous
  line.  Each token them forms a new series of compression streams,
  which are in turn compressed and serialised into a single block.

With thanks to Michael Macias and Yash Puligundla for review and
corrections.
@github-actions
Copy link

Changed PDFs as of 97a1e20: CRAMcodecs (diff), CRAMv3.

- Added CRAM_codecs document to describe the proposed CRAM3.1 codecs.

  This also has the CRAM3.0 rANS codec in it which has been removed from
  CRAMv3.tex.

- Update CRAMv3.tex to include both CRAM 3.0 and 3.1.  The differences
  are so minor it's essentiall just expanding the list of legal values
  in Block compression methods and updating a few 3.0s to be more general.

The new EXTERNAL compression codecs added for CRAM 3.1 are:

- rANSNx16.  16-bit renormalisation instead of the old 8-bit
  renormalisation, for extra speed.  This now permits either 4-way of
  32-way unrolling, with the latter being a good target for SIMD
  optimisations.

  The newer rANS also includes PACK, RLE and STRIPE (interleaving)
  transformations to be applied prior to entropy encoding.

- Adaptive arithmetic coder.  This has all the same transformations of
  rANSNx16, but with slightly better ratio and better performing on
  data sets with non-stationary probabilities.  It's slower however.

- FQZComp quality codec.  A generalised version of the old fqzcomp-4.6
  tool.

- Name tokeniser.  A compression tool for any data series that has
  repeated structured text elements.  It breaks the line down into a
  series of tokens and optionally deltas each against a previous
  line.  Each token them forms a new series of compression streams,
  which are in turn compressed and serialised into a single block.

With thanks to Michael Macias and Yash Puligundla for review and
corrections.
@github-actions
Copy link

Changed PDFs as of 7efd204: CRAMcodecs (diff), CRAMv3.

@jkbonfield jkbonfield merged commit 7efd204 into samtools:master Mar 15, 2023
@jkbonfield jkbonfield temporarily deployed to github-pages March 15, 2023 16:36 — with GitHub Pages Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants