-
Notifications
You must be signed in to change notification settings - Fork 176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added CRAM_codecs document to describe the proposed CRAM3.1 codecs. #433
Conversation
Things up for discussion.
|
If you want to actually try it in action, it's currently in the Scramble program, obtained by cloning io_lib:
You'll need to specify "-V3.1" when running Scramble, and to get some of the newer codecs you can specify -X profile where profile is fast, normal, small or archive. For example:
Figures for 10 million records from an old Illumina HiSeq run (40 quals):
Some codecs are clearly very justified, but the adaptive arithmetic coder ( |
Mostly TeX-related trivia: IMHO this would be better as CRAMcodecs.pdf (without the underscore) to parallel SAMtags.pdf. Typesetting it fails because (1) the line that should be |
Applied the other fixes too. Thanks. |
A very minor query for the other maintainers. The CRAM spec assigns a numeric ID to codecs (see Chapter 13 of the CRAMv3 spec). Should we keep those numeric IDs there, or should they be replicated in this document too? This document doesn't describe the encapsulation, which is where the ID comes from. |
One other thought here is maybe we should add Zstd (and recommend the usage over Zlib which is old hat and lzma unless we're really pushing the boat out or keeping to lzma -1 to -3 which can be surprisingly effective). Mid level zstd can be a good speed vs size tradeoff, almost always better than gzip or libdeflate. |
Sounds like the numeric ID is specific to the CRAM encapsulation so it'd be appropriate to just have it in the CRAM document — as long as the names of distinct codecs are obviously matched up between the two documents. No firm thoughts either way: only in the CRAM document reduces duplication, but in both makes the correspondence crystal clear. |
That was my inclination too. I updated it in the CRAM3 document and also removed the now duplicated rANS information. I'm keeping both changes in this PR for now as we won't want to cull this from the existing CRAM spec until the codec doc is signed off too. |
Given this has been hanging around for a while and I'm thinking on format-breaking things that I want to fix for CRAM 4 (starting with 64-bit references), I started looking at the ideas in #144 (comment) That mentions splitting up the compound rans4x16pr codec which has RLE + bit-packing + entropy encoding into separate data transform encodings and data compression codecs. It dawns on me that perhaps this is also something that fits in 3.1. Initially, I wanted to try hacking the BETA, GAMMA, SUBEXP, etc encodings to not output to the CORE block but to be able to specify which external block ID they use. That's trivial, but then it dawned on me BETA is basically identical to my PACK method. Given values 0,1,2,3 we encode using BETA(offset=0,nbits=2) and pack 4 to a byte. That's already half of the new rans done. So my proposal is something like this: Outputting QS data series using EXTERNAL(11), codec RANS4x16 order 193 (PACK+RLE+O1) is equiv to: MAP("!-3E"->0123, XBETA(offset=2, nbits=2, RLE(EXTERNAL(50), EXTERNAL(51))) For this to be efficient though, it cannot be done a few bits at a time daisy chaining it through a series of function pointers. Instead we need to gather up the whole set of qualities, apply the map transform, apply the beta transform, apply the RLE transform, and then entropy encode the two resultant RLE buffers. That's easier to code up. However it also adds a difference to CRAM 3. If these new MAP, XBETA and RLE codecs are to be used then the data series involved must be "pure" and never interleaved with another data series. (In practice we do this already.) I'm coding it up to give it a go, but please bear this in mind when reviewing this PR. if it works, we'll probably want to go with the more flexible algorithm. I'll also add DELTA in there too as this is needed for some machine specific data series (signals, traces, etc). |
I coded this up, albeit rolling MAP and XBETA into a single "PACK" function, and it works as expected. A marginal slow down (about 7% hit encoding, 3% hit decoding), but gives identical sizes. eg old (cram_dump snippets):
New (cram_dump, manually edited for clarity):
Whether or not this is a better solution than putting all the logic into the compression codec is questionable. It feels cleaner though to me, although it opens up more work on how to optimise things. It would also mean changing the definition of the name tokeniser and moving some of the pack and rle functions from rans into calls within the name tokeniser itself as for space-efficiency it serialises its own format rather than via a many-way CRAM external block structure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to comment in tranches - a few inline to start, with more to come.
Also, we should think a bit about how we want to partition the two docs. I'd propose that it would make sense to keep all of the actual file format specification, including the serialized format of the codecs, in the main doc, so that the entirety of the file format is specified in one place. Then the supplementary can be used doc used for the background, pseudo-code, and additional detail.
CRAMcodecs.tex
Outdated
some (potentially many) symbols will have zero frequency. To reduce | ||
the Order-1 table size an additional zero run-length encoding step is | ||
used. Finally the Order-1 frequencies may optionally be compressed | ||
using the Order-0 rANS 4x16 codec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason these encodings should be fixed and prescribed by the specification, rather than just defined to be nested sub-encodings (BYTE_ARRAY_LEN style) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean is there a reason the rANS 4x16 order-1 frequency table is explicitly encoded using order-0 rANS?
It's done to make the codec a single piece of code that may be used elsewhere and frankly it's as good an encoding as any I could find. It's possible we could split it up and have frequencies in one block, encoded with whatever method we wish, and data in a second. I don't really know if it buys us anything meaningful though.
There is a further question, originally raised by Vadim (and I wished I'd listened more at the time tbh too) of whether the rANS entropy encoder should have the frequencies stored in the preservation map and the "external" block part is purely the encoded data itself. Doing so would permit one frequency table and many blocks of data compressed using the same table, which is appropriate if we wish to use multi-slice containers. Potentially there is room here for better random access via such mechanisms. I forgot about that. If I was to change anything it'd be via that style, but it's still a bit complex and messy.
Fixed barring the discussion on 4x16 O1 table using 4x16 O0 encoder. I've been experimenting with externalising most of the rANS4x16 data transforms to be separate CRAM encodings which can be nested. This works well, but we'll need to rewrite the name tokeniser in a way that can use these transforms too before running the rANS part. The newer rANS is still worth it over 4x16, due to better encoding of frequency tables (especially order-1), but I think we can also improve on it by making it an Nx16 where N is 1, 4 or 32. This allows it to be efficient in size on small blocks and efficient in speed of very large blocks (where ~128 bytes worth of state isn't a problem). I already have SIMD vectorised versions at 32-way interleaving that run 2-3x faster than the original 4-way code so it may be good to permit exploitation of this. Writing a generic non-SIMD version is no harder for 4-way unrolling as any N-way unrolling so it's not needless overhead and the spec as it's written just has a lot of Similarly the X4 4-way interleaving should probably be N-way interleaving, so we can try encoding any fixed structure size (eg 16-bit ints, longs, 12 char barcodes, etc). |
8290008
to
7ea6ac4
Compare
Rebased, and also updated with newer ideas. The rANS4x16 codec now has a choice of 10-bit or 12-bit frequencies for the order-1 frequency table. The 4x8 rans codec has a fixed frequency total of 12-bit (4096). For speed, via reduced cache miss pressure, this was changed to 10-bit for the 4x16 variant. However it was found to be poor performing on some extreme frequency distributions. Hence it's now configurable so we get speed when it doesn't impact compression ratio. The code (jkbonfield/htscodecs@40b6eeb) and spec (4f39e32) changes are trivial. The 4-way interleaving (X4) transformation in rANS4x16 and arith codecs is now a generalised N-way stripe function. It's the same binary format bar a single extra byte to specify N. It's a bit more complex to describe - I struggled with how best to document it - but it's more generic and should in theory permit efficient compression of 16-bit quantities as well as 32-bit ones. (Currently it only really comes into play for encoding integers in the name tokeniser.) |
Note to @yfarjoun: this is killing latexdiff for reasons currently unknown. CRAMv3.tex in master and this PR both build and view fine. The diff subdirectory produced by our circleci process however bombs out.
The offending block of generated latex around 2901-2917 is:
Deleting that no longer causes the error. |
The offending LaTeX that breaks the diff is this:
It's pretty ugly, using an array within math mode to do a table style layout. It could be done as an outer table and lots of open/close math mode within each cell to avoid this latexdiff bug I guess, but I'm inclined to leave it as-is. I can't fix it trivially without needing another PR (which itself would also break) first, before relaunching this diff again. The diffs are a nicety, but if they fail it's just a bit more work for us manually inspecting instead. |
The CRAMv4.tex document has now been hived off into its own PR #502, given it's likely a lot more contentious and harder to implement than the new codecs in v3.1 alone. Both have been rebased off master and diverge from that point, but touching separate files so it shouldn't be problematic. |
Sorry this is long, but it's hard to describe without some basic understanding of the implementation choices. Do we want to potentially gain 5-30% speed when we have AVX2 or similar available, at a cost of 5-15% slow down when not? The spec change is minimal, and the use of the new variant is optional and only expected to be used of very large blocks of data - eg quality scores or "signal" aux tags. In CRAM 3.1 we introduce a newer rANS codec. The primary differences are 16-bit renormalisation instead of 8-bit (for speed), and the addition of in-built RLE and bit-packing transformations. Ideally they would be separate transformation layers implemented as CRAM encodings (I have this IIRC in CRAM 4.0, but it's a bigger change and frankly a small standalone entropy encoder is a smaller pill to swallow than more significant format changes). However I'm thinking of another change to this. Both the old "rANS 4x8" and new "rANS 4x16" have 4-way interleaving. Here's the pseudocode for 4x16: Naively this is described as a mod on the i^th value to rotate through the 4 rANS states. The reason for doing this is updating that state has various multiplies, which come with a latency. We cannot immediately execute the next CPU instruction if it has a dependency on the previous result. By interleaving multiple states, we drive the CPU at the frequency it can issue instructions and ignore the instruction latency. The order-1 entropy encoder is more complex, but follows a similar style: Now it's clear 4-way is just an arbitrary amount. I chose it because 2 way doesn't entirely remove all instruction latency, while more than 4 starts to add other inefficiencies. However... that assumes we can't use SIMD instructions. Try hard as I could, I never really got SSE4 to work well on rANS. That's not true for AVX2 (now becoming standard) and AVX512 (still pretty rare). I have AVX2 variant of rANS that works on 32-way unrolling. That means it's updating 32 states at a time, which is basically 4 interleaved AVX2 registers (for the same latency reasons as above). An example on a PacBio "ip:B:C," data aux tag.
We see here the 4x8 and 4x16 4-way implementations as described in this PR, plus the 32-way variant implemented using pure scalar code (basically an optimised version of the above pseucode with the 4s changed to 32s), and a hand written AVX2 SIMD variant. [In my htscodecs test harness, -o0 / -o1 is order0/1 4-way, +4 is 32-way auto-SIMD, and +0x600 is 32-way with forced scalar only implementations.] The speed gain for encoder is slighty, but vast for decoder. The scalar implementation however is a bit slower. Note this was gcc, but clang has 377MB/s and 364MB/s enc for -o0 vs -o0x604, so they're far closer together and it indicates maybe we can improve the code to make the 32-way scalar encoder work better on gcc too. Similarly with a block of unaligned base-calls, we get the following benchmarks:
Here CRAM 3.0 ends up choosing an order-1 encoder as it's smaller than the order-0, but I include both for comparison. CRAM 3.1 uses the bit-packing as with 4 base types we end up packing 4 to a byte before encoding which makes order-0 almost as good as order-1 but with a fraction of the CPU time. Given the bit-packing hasn't changed, going from PACK+4way to PACK+32waySIMD isn't so major, but it's still a benefit. As before, PACK+4wayScalar however is a bit slower. So what does all this actually mean? Well these data types are for some 100,000 unaligned PacBio reads. After converting to CRAM and putting through io_lib's
The files are dominated by base calls and two aux arrays. (There are no quality values in this data set for some reason.)
With CRAM 3.1 we see the O1 decoder has mostly dropped away, as we're now using bit-PACKing plus O0 for the base calls:
It's still dominated by the rANS decoding so, so with 32-way AVX2 we see:
Still 30%, so the AVX512 version (50% quicker still, but not yet ported to htscodecs) would bring this down further. In terms of wall clock timings, Scramble shows:
This is perhaps an abnormal data set as it shows the new codec in the best light. Aligned NovaSeq shows really marginal differences; 3-5% gain basically. Aligned ONT data without any signal data (so primarily gaining in QS compression) shows about 8% speed up decoding (and 2% slow down in scalar mode). Old style Illumina with 40-quals is similar. My proposal is, like the RLE, bitPACK, order-0/1, to make 4-way vs 32-way a parameter stored in the data stream. It means the decode reads this and either naively uses it as a variable in the decoding loops, or has hand-tuned variants for the 4-way case and the 32-way case. Spec wise it's a minimal change, but implementation wise it's more questionable. It adds complexity and may penalise CRAM performance on very naive implementations or old hardware (note TODO: get access to ARM and do some Neon code), but it brings noticble benefits to smart implementations on modern hardware. Thank you to all who bothered to read this far. What do people think about this? |
I decided to go for it and add the documentation of 32-way vs 4-way into the CRAM 3.1 codec. I've done this because the penalties of not explicitly vectorising aren't huge (nil for the test JavaScript code, but maybe 10-20% for C) while the potential benefits are very significant. Plus it's not mandatory. The last commit has these changes in. You can see that the impact on the pseudocode and specification are really quite minimal in the grant scheme of things. (There is a bit more changing there because I wanted to distinguish between interleaving N rans states vs N-way (now X) data-striping (eg shuffling data so every 4th byte is collated together prior to compression). See samtools/htscodecs#26 for the C and JavaScript implementations of this idea. The JavaScript changes to support selection of N-way rANS state interleaving demonstrates how easy it could be (rather than how complex it can be if you look at the optimised C code!). jkbonfield/htscodecs@6a5e014 |
CRAMcodecs.tex
Outdated
\State $last\_sym \gets s$ | ||
\State $rle \gets 0$ | ||
\Repeat | ||
\State $A \gets (A,s)$ \Comment{Append $s$ to the symbol list $A$} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This currently implies that A
holds a list of symbols in that order that they are read from the file, which has implications for the order of frequency values read later. It might be useful to add a restriction that s
is always increasing, or maybe that A
should either be sorted into order before returning it, or should be treated as an ordered set rather than a list.
See also https://github.com/samtools/htscodecs/pull/26/files#r814844095
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fact this is a run-length encoded list of symbols with the ability to auto-increment symbol numbers is at least very heavily implying that they are stored sequentially and in ascending order, however I agree it needs to be an explicit statement as technically it could be encoded as a mix of out of order symbols interspersed with ordered clumps of adjacent values. I'll amend.
As this is still kicking around in a non-adopted state, mainly due to a combination of lack of review and lack of a complete 2nd implementation, I decided it's still valid to consider changing it. Both existing long-read technology but also some of the newer instruments (such as Ultima Genomics) exhibit a strong correlation between the local sequence motif and the quality stream. We explored such correlations in the earlier era of Solexa / Illumina, but the correlation was so slight it was deemed inadvisable to create a dependency between distinct types of data. However this decision should be revisited, as has been demonstrated by tools such as ENANO. An example 1 million fastq lines from some bleeding edge Ultima Genomics data. Quality size only:
These show the standard quality parameter sets using the test harness in https://github.com/samtools/htscodecs/. I then used the fork of htscodecs in fqzcomp5 (https://github.com/jkbonfield/fqzcomp5) which adds the ability to define the sequence motif as part of the quality context. The parameters involved are length of motif (2-bit encoded) and the backwards offset relative to quality value encoded. So we could have a 4bp motif (8 bits) with an offset of 2 meaning 2bp prior to this qual, the bp for this qual itself, and 1bp after. Using the FQZ+SEQ encoder previously tuned for ONT on UG data gave this:
It's clear the profile previously tuned on ONT was sufficient here too. This is around a 14% reduction in quality size, so really very significant. We see a similar impact with ONT and PacBio data sets too, so this has mainstream benefits. Practically speaking, this has an impact on CRAM implementations as they can no longer assume total independence of data series and would need to consider FQZ+SEQ encoding as a special case of requiring all data-streams associated with sequence with be a prerequisite for obtaining qualities. While independence of data streams is very useful for rapid partial decoding, obtaining quality without the corresponding sequence bases has extremely limited value (unlike vice-versa which is a common operation). Furthermore CRAM decoders, if they wish to permit partial decoding, already must track both data dependence and code-path dependence as the specification permits multiple data-series to be encoded within the same block. (An example of data dependence would be DL data series (deletion length) being in the same block as quality values, while code-path dependence would be the knowledge that DL data series can only be decoded if FC and FN data series have also been decoded.) However to facilitate efficient implementations, it's probably best if the codec types are separated to FQZ (qual only) and FQZS (qual+seq context) so tools can rapidly determine data dependencies without needing to decode the FQZ parameter blocks themselves. -- The second thing to consider for archive CRAMs is increasing the maximum context size from 16 bits to 20 bits. I chose 16 bit because CRAM is designed around fine-grained random access and it's hard for large models to learn all the probabilities, however if we have very large block sizes (for archival) then this may be too restrictive. The above parameters use 8 or 10 bits of sequence motif in the context, leaving very few bits left for quality encoding. Thus the quality encoder is reduced from order-2 context down to order-1 in order to free up bits in the model index. By increasing the model size to 20 bits, we get the following:
This is basically O(2) qual encoding (6-bits at a time to cope with the Q0 to Q40 inclusive range) and 8 bits of sequence context (offset 2). The memory used for the quality encoder increases considerably, but fqzcomp5 running with 4 threads was still only around 2GB so it's not an insurmountable amount. Combined with the seq context extension, these two changes amount to an 18% reduction in quality size for this test data set, and the qualities are the dominant space in an aligned UG CRAM. As hinted at above, the 16 to 20-bit gains are much lower when using smaller block sizes. The above was the entire 1 million reads in a single block (considerably larger than our current "archive" profile for CRAM in htslib), but spread across 7 blocks going from 16 to 20 bits only saved around 1.5%. It is likely we wouldn't use the larger context size except when aiming for maximum compression in archive mode, where a coarse degree of random access is acceptable. So this is a less significant change to adding sequence contexts. For reference, current CRAM 3.0 uses 63097604 bytes for these qualities, and current CRAM 3.1 in archive mode uses 61192567 bytes (not too dissimilar to the values seen in the original plot above). Given the lack of progress elsewhere in this PR, I'm inclined to amend it. Comments? |
An update, keeping to 16-bit context size at the moment and tested on a single large block of data (varying between 100M and 1G by machine type). This is the benefit of adding in SEQ into the context, hand-tuned fqzcomp params per data type.
It's clear sequence context is the strongest for Ultima Genomics, but it's a brand new platform and it's possible this will change as base-callers further improve. TODO: PacBio HiFi, more modern ONT (this is old GIAB data), other new platforms. Going to 20-bit context size didn't help ONT or PB, it helped Illumina only marginally (0.6%), but Ultima Genomics gained a further 5.9% (compared to 16-bit with SEQ context). So I'm unsure if larger contexts are worth it overall yet given the increased memory and slower times, but it could be an elective thing. |
CRAMcodecs.tex
Outdated
|
||
The symbol frequency table indicates which symbols are present and | ||
what their relative frequencies are. The total sum of symbol | ||
frequencies are normalised to add up to 4095. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both htscodecs and HTSJDK add up to 4096, not 4095.
- https://github.com/samtools/htscodecs/blob/843d4f63b1c64905881b4648916a4d027baa1a1c/htscodecs/rANS_static.c#L37-L38
- https://github.com/samtools/htsjdk/blob/4a4024a97ee3e87096df6ad9b22c8260bd527772/src/main/java/htsjdk/samtools/cram/compression/rans/Constants.java#L4-L5
Edit: To clarify, these implementations will readjust to 4095, but it's confusing that they initially target and scale with 4096 rather than 4095, which is not how it's described in the document.
I did see that the JavaScript implementation both scales and adds to 4096, which may be a bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a legacy thing from CRAM 3.0 which wrongly stated 4095 as I believed 4096 would cause wrap-around issues. That wasn't true, and the code works just fine. I'm a little cautious about explicitly permitting 4096 incase people have implemented a decoder with an assumption that it matches what the spec originally stated, but I think we can enumerate the implementations pretty easily to double check this.
I'll investigate. It would definitely be cleaner to recommend 4096 (with a comment that 4095 is permitted for compatibility with historic versions of the spec).
Edit: I do see this though which appears to require the sum to be <4096:
https://github.com/GMOD/cram-js/blob/master/src/rans/frequencies.js#L53
I haven't worked out how to test this on the command line yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can verify that the output from Scramble (io_lib's original CRAM implementation), htslib, and htsjdk all produce sums of 4095 for their O0 and O1 frequency tables.
The TOTFREQ definition in Java just copied my C version, so it all stems from the same silly issue back then I guess. 4096 is the size of the array used in the decode function, and it's that simply because it permits AND operations and lookup tables (which is probably why I originally confused myself with 4095 being the maximum freq). The renormalisation of frequencies is non-trivial and isn't covered by the specification either as it describes the at-rest file format rather than a specific implementation for determining how to get your frequencies summing correctly. So it's inevitable that we start with a target value and then have to rescale after summing to get things adding up.
I agree it looks like the javascript implementation is wrong here. It needs fixing, although that's an htscodecs issue rather than an hts-specs one.
CRAMcodecs.tex
Outdated
One last caveat is that we have no context for the first byte in the | ||
data stream (in fact for 4 equally spaced starting points, see | ||
``interleaving" below). We use the ASCII value (`\textbackslash0') as | ||
the starting context and so need to consider this in our frequency | ||
table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output of the rANS implementations in htscodecs also includes the final context, which is neither described here nor in the following example. (The final context in the case of the following example is ('a', '\0')
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I didn't get this - I dumped out the contents of out_buf
holding the O1 frequency table for the example input and it matches (bar the frequencies themselves, which don't due to changes in the frequency renormalisation algorithm, but as stated before how you get the frequencies to add up to 4095 is up to the implementation).
However, if I change it to abracadabraabracadabraabracadabraabracadabrx
so the last letter occurs only once, then I see that initially 'x' isn't in the F[256][256]
frequency table:
(gdb) p F['r']
$56 = {0 <repeats 97 times>, 7, 0 <repeats 22 times>, 1, 0 <repeats 135 times>}
(gdb) p F['r']['a']
$57 = 7
(gdb) p F['r']['x']
$58 = 1
(gdb) p F['x'][0]
$59 = 0
(gdb) p F['x']
$60 = {0 <repeats 256 times>}
However F['x'][0..255] gets renormalised because T['x'] != 0
. We could work around this special case where the last character in the buffer appears once and once only, meaning it doesn't have any following contexts of its own, with this minor patch:
diff --git a/htscodecs/rANS_static.c b/htscodecs/rANS_static.c
index 9865022..9a37eb0 100644
--- a/htscodecs/rANS_static.c
+++ b/htscodecs/rANS_static.c
@@ -420,6 +420,12 @@ unsigned char *rans_compress_O1(unsigned char *in, unsigned int in_size,
if (T[i] == 0)
continue;
+ if (T[i] == 1 && in[in_size-1] == i) {
+ // last symbol can be ignored as it doesn't have following
+ // contexts.
+ continue;
+ }
+
//uint64_t p = (TOTFREQ * TOTFREQ) / t;
double p = ((double)TOTFREQ)/T[i];
normalise_harder:
It changes the output as expected:
@ seq4c[samtools.../htscodecs]; hd /tmp/_o1.orig
00000000 01 48 00 00 00 01 3f 00 00 00 2c 00 00 00 00 61 |.H....?...,....a|
00000010 8f ff 00 61 61 82 86 62 02 86 bd 83 5e 83 5e 00 |...aa..b....^.^.|
00000020 62 02 72 8f ff 00 61 8f ff 00 61 8f ff 00 72 61 |b.r...a...a...ra|
00000030 8d ff 78 82 00 00 78 00 8f ff 00 00 47 98 48 53 |..x...x.....G.HS|
00000040 47 98 48 53 47 98 48 53 34 d7 46 02 ad |G.HSG.HS4.F..|
0000004d
@ seq4c[samtools.../htscodecs]; hd /tmp/_o1.new
00000000 01 43 00 00 00 01 3a 00 00 00 2c 00 00 00 00 61 |.C....:...,....a|
00000010 8f ff 00 61 61 82 86 62 02 86 bd 83 5e 83 5e 00 |...aa..b....^.^.|
00000020 62 02 72 8f ff 00 61 8f ff 00 61 8f ff 00 72 61 |b.r...a...a...ra|
00000030 8d ff 78 82 00 00 00 47 98 48 53 47 98 48 53 47 |..x....G.HSG.HSG|
00000040 98 48 53 34 d7 46 02 ad |.HS4.F..|
00000048
It's subtle, but there are two 'x's in the o1.orig file, with the first showing r->x and the second being x->nul.
Note however this second transition is not necessary for decode and the decoder works fine. This is an implementation tweak (maybe bug), but not something that the specification should require. Basically we can add more entries to the frequency table if we really wish and it'll have no bearing on the decoding of the data as they're unused. I'm not sure how best to explain that in the formal specification.
@@ -1721,7 +1740,7 @@ \subsection{Statistical Modelling} | |||
frequency tables in the data stream covering all possible contexts. | |||
|
|||
To do this we use a statistical model, containing an array of symbols $S$ and their frequencies $F$. | |||
The sum of these frequences must be less than $2^{16}-32$. | |||
The sum of these frequences must be less than $2^{16}-16$ so after adjusting the frequencies it never go above the maximum unsigned 16-bit integer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sum of these frequencies must be less than or equal to
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We add 16 to the frequency, and
NB: it's incremented by 16 instead of 1 because we start with all symbols having a minimum probability of 1/N. We want to distinguish those symbols seen and those not yet seen, so the jump from 1/N (1/256 for a full alphabet) to 17/272 and 33/288 is a more rapid learning of probabillities than simple +1. The price though is more frequent renormalisation, but that's little more than a bunch of bit shifts. It's better too for smaller N too, so limited alphabets benefit from less overhead for the as-yet unseen symbols.
Some range coders use an escape code and freq 0 for all unused symbols. When we add a symbol that currently has freq 0, the escape code is emitted, then the new symbol itself, and then it's incremented as before. Ie we start with 1/1 (P(escape)=1), then emit escape + 'x' and we have P(escape)=1/2 and P(x)=1/2. Then another x => P(escape)=1/3, P(x)=2/3. (Escape and literal symbols can be adjusted further based on the chance of new symbols vs how many emitted so far given a finite alphabet, but it's relatively minor). This method is a better estimation of probabilities, but it's also more complex and I didn't deem the speed hit to be worth the small size difference, especially when we have a small alphabet. It only really helps significantly on small blocks of data or those with non-stationary probabilities, where the size / speed of adaption matters the most.
The X_EXT part of arith-dynamic compression to use bzip2 after RLE / PACK ops instead of an arithmetic coder seems unwieldy and unjustified. Htslib doesn't use this feature, and nor does io_lib. I can't remember why I put it there, but probably it worked on one specific data set somewhere. Commenting out the bzip2 decoder in htscodecs arith_dynamic.c doesn't cause it to fail any tests either, so I'm obviously not exercising this route. On the name test data from https://github.com/jkbonfield/htscodecs-corpus I explicitly added bzip2 into the mix of things to try to see what impact on compressed size it gave.
The only one of these that changed significantly was nv.names (NovaSeq names), which shrunk by 7%. This doesn't happen on smaller CRAM slices of 10k reads / slice, but does at 25k slice and above. 05.names (HiSeq) was close to 1%, but still not that major. On a NovaSeq cram I see names are about 25% of storage, so that bzip2 saving comes out at about 1.7% total file size reduction. It seems very specific to this one naming scheme. I'll research further, but am interested in the thoughts of others who have implemented this spec as to whether they did this small corner case, and whether they have any intention to use it? Edit: The reason for the benefit with bzip2 on NovaSeq comes down to the X and Y coordinates being better compressed with bzip2 than any of the rans or arithmetic coder offerings. This in turn appears to be due to the distribution of values. The coordinates are strongly clumped every 15 or 16 pixels apart in one dimension, and every 9 or 10 in the other. Likely due to the gridded cluster layout. This means a sorting algorithm, such as the BWT in bzip2, can do a better job of representing the data than a naive byte based entropy encoder. The debate is still out whether it's worth using. It's not possible to produce CRAMs using this right now from htslib, but it isn't hard to enable and has been tested as seen above. |
09f1e5a
to
0872692
Compare
I rebased this PR and did a full re-read and final round of minor updates (mostly typographical or language improvements). I'm happy with it, barring the open question on whether all of the spec is actually needed (see bzip2 above). That doesn't particularly harm though as it's a minimal change in complexity compared to everything else. Given we have two full implementations for both reading and writing, which have been cross-validated, I don't see any reason for delaying this further. Please shout now if you feel otherwise! |
eced033
to
0872692
Compare
The implementation to support the I agree it's hard to justify keeping it when your test case is for a specific pattern of read names. I'd be curious how it compares to the name tokenizer codec instead. |
Agreed it's such a trivial thing to add compared to the huge complexity of the rest of it, that it's a very minor point, so I'm not inclined to worry too much.
Htslib doesn't use the arithmetic coder by default, but can be coaxed into using it with e.g. It's a trivial patch to enable this though for level 7 and level 9 name tokeniser:
The above replaces the tokenise names arithmetic level 7 (17) with the old level 9, and adds a new bzip2 mode to the new level 9, so we can compare -17 vs -19 and see what impact it'd make at highest compression levels. The figures shown above are for the test data compressed with htscodecs' tests/tokenise_name3 tool. An example on read names from 9-mill to 10-mill into my novaseq test data:
So that's a 7.3% saving on RN. Given on the whole CRAM the RN field was ~19% of the file size, this equates to ~1.3% total cram size reduction. However that's only on the most aggressive "archive" mode as we don't even use the arithmetic coder in any lesser compression profiles as it's not worth the speed vs size tradeoff normally (rANS is so much faster, especially the SIMD vectorised variants). It's trickier to get it used directly by htslib/samtools though, due to a bug in the auto-strat selection for TOK3+RANS vs TOK3+ARITH. Fixing that though I see:
vs
So the ratios are similar to the test tool - 1.3% overall file saving, and 7-8% for RN block. As an example, samtools speeds vs sizes (8 threads, wall clock time):
(x) Using normal "small" profile is rANS everywhere, but this hacks the code to be rANS for all except the name tokeniser, so the only extra cost and saving is in the RN blocks. So it's pretty much overkill. Even with a modified profile so "small" mode uses tok3+arith for names, we're still 33% slower for a 1.7% size reduction. Maybe beneficial? Maybe not. Hard to say. Archive level 9 is the extreme scenario. I'm not expecting anyone to really use archive mode as it's slow. It's there mainly as a demonstration for what the format can do if we throw CPU at it, and to act as a baseline when people compare other (potentially even slower) formats against CRAM. If we decide to keep it in the spec (and as you say it's minimal extra complexity vs everything else), then I guess the normal compression level of archive mode is a reasonable tradeoff to add bz2 in there: 6% slow down for 1.3% size reduction is pretty decent given the profile is aiming at high compression. Edit: remember though this is NovaSeq, which benefits here due to the very specific case of bzip2 working well on X and Y coords (likely due to the gridded layout). Other data sets we're not going to see much size change at all, but may still pay the CPU overheads, although it could be improved by better learning strategies. Anyway, to summarise a long ramble: My instinct is to keep it there and enable this feature as it's beneficial in at least some (pretty common) situations. |
Note there may also be other places where this is beneficial. Eg the ability to do bit-packing + bzip2 may help considerably for some aux tags. It's definitely not a clean way of doing things. For CRAM 3.1 I wanted to keep the changes solely focused on "external" codecs, ie replacements for gzip, bzip2 and lzma. However the correct approach (in the CRAM 4.0 draft) is to lift the transforms out from rans/arith into the CRAM encoding methods so we can use nested layers, such as put data series through PACK, then through RLE, and then store to EXTERNAL block 12, which is then compressed using BZ2. I also added better variable sized integer encoding as a transform too. This is a much more flexible strategy, but it needs more aggressive changes to the format and the slow progress of this PR showed me that we'd need to accept smaller peacemeal updates, even if ultimately they're not the best solution. |
1ecef4c
to
68ccc7d
Compare
Squashed, rebased, and attached the prebuild PDFs to the top post. |
To all concerned, I am now starting the timer on acceptance. 1 month from now (14th March) if there are no reasonable objections then CRAM 3.1 will become the current official specification version. |
Changed PDFs as of 4e4cbbe: CRAMcodecs (diff), CRAMv3. |
- Added CRAM_codecs document to describe the proposed CRAM3.1 codecs. This also has the CRAM3.0 rANS codec in it which has been removed from CRAMv3.tex. - Update CRAMv3.tex to include both CRAM 3.0 and 3.1. The differences are so minor it's essentiall just expanding the list of legal values in Block compression methods and updating a few 3.0s to be more general. The new EXTERNAL compression codecs added for CRAM 3.1 are: - rANSNx16. 16-bit renormalisation instead of the old 8-bit renormalisation, for extra speed. This now permits either 4-way of 32-way unrolling, with the latter being a good target for SIMD optimisations. The newer rANS also includes PACK, RLE and STRIPE (interleaving) transformations to be applied prior to entropy encoding. - Adaptive arithmetic coder. This has all the same transformations of rANSNx16, but with slightly better ratio and better performing on data sets with non-stationary probabilities. It's slower however. - FQZComp quality codec. A generalised version of the old fqzcomp-4.6 tool. - Name tokeniser. A compression tool for any data series that has repeated structured text elements. It breaks the line down into a series of tokens and optionally deltas each against a previous line. Each token them forms a new series of compression streams, which are in turn compressed and serialised into a single block. With thanks to Michael Macias and Yash Puligundla for review and corrections.
Changed PDFs as of 97a1e20: CRAMcodecs (diff), CRAMv3. |
- Added CRAM_codecs document to describe the proposed CRAM3.1 codecs. This also has the CRAM3.0 rANS codec in it which has been removed from CRAMv3.tex. - Update CRAMv3.tex to include both CRAM 3.0 and 3.1. The differences are so minor it's essentiall just expanding the list of legal values in Block compression methods and updating a few 3.0s to be more general. The new EXTERNAL compression codecs added for CRAM 3.1 are: - rANSNx16. 16-bit renormalisation instead of the old 8-bit renormalisation, for extra speed. This now permits either 4-way of 32-way unrolling, with the latter being a good target for SIMD optimisations. The newer rANS also includes PACK, RLE and STRIPE (interleaving) transformations to be applied prior to entropy encoding. - Adaptive arithmetic coder. This has all the same transformations of rANSNx16, but with slightly better ratio and better performing on data sets with non-stationary probabilities. It's slower however. - FQZComp quality codec. A generalised version of the old fqzcomp-4.6 tool. - Name tokeniser. A compression tool for any data series that has repeated structured text elements. It breaks the line down into a series of tokens and optionally deltas each against a previous line. Each token them forms a new series of compression streams, which are in turn compressed and serialised into a single block. With thanks to Michael Macias and Yash Puligundla for review and corrections.
Changed PDFs as of 7efd204: CRAMcodecs (diff), CRAMv3. |
This also has the CRAM3.0 rANS codec in it. ̷I̷ ̷h̷a̷v̷e̷n̷'̷t̷ ̷y̷e̷t̷ ̷r̷e̷m̷o̷v̷e̷d̷ ̷t̷h̷i̷s̷ ̷f̷r̷o̷m̷ ̷C̷R̷A̷M̷v̷3̷.̷t̷e̷x̷ Edit: CRAMv3.tex also now updated.
A prebuilt copy of the documents is attached.
Code implementing these codecs can be seen in https://github.com/samtools/htscodecs
The C code is a bit gnarly - it's suffering from many iterations of changes and it's also heavily optimised making it hard to read. The JavaScript code however is reasonably straight forward (if not always optimised for speed) and tries to be a vanilla implementation of the spec pseudocode, for purposes of validating it and acting as additional spec clarification. (That means it's not production quality and deliberately lacks a lot of niceties like testing return codes.)
CRAMcodecs.pdf
CRAMv3.pdf
Edit 13th Feb 2022. To test this version, feel free to use the current release of samtools/htslib (1.16 at time of writing).
You can evaluate CRAM v3.0 vs v3.1 using:
That's the default "normal" mode. You may also wish to try the fast, small and archive profiles for each, plus compression level can be edited. For example:
Demonstration files are available at ftp://ftp.sanger.ac.uk/pub/users/jkb/CRAM3.1/