You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
A na\"ive implementation of a rANS decoder follows.
605
652
This pseudocode is for clarity only and is not expected to be performant and we would normally rewrite this to use lookup tables for maximum efficiency.
606
-
The function \textsc{ReadUint8} below is undefined, but is expected to fetch the next single unsigned byte from an unspecified input source. Similarly for \textsc{ReadITF8} (variable size inetger) and \textsc{ReadUint32} (32-bit unsigned integer in little endian format).
653
+
The function \textsc{ReadUint8} fetches the next single unsigned byte from an unspecified input source. Similarly for \textsc{ReadITF8} (variable size integer) and \textsc{ReadUint32} (32-bit unsigned integer in little endian format).
\For{$j\gets0\algorithmicto3$}\Comment{Initialise the 4 interleaved streams}
694
740
\State$R_j \gets$\Call{ReadUint32}{}\Comment{Unsigned 32-bit little endian}
695
741
\EndFor
696
742
\For{$i\gets0\algorithmicto nbytes-1$}
@@ -707,7 +753,7 @@ \subsubsection*{rANS order-0}
707
753
\subsubsection*{rANS order-1}
708
754
709
755
As described above, the decode logic is very similar to rANS Order-0 except we have a two dimensional array of frequencies to read and the decode uses the last character as the context for decoding the next one.
710
-
In the pseudocode we demonstrate this by using two dimensional vectors $C_{i,j}$ and $F_{i,j}$.
756
+
In the pseudocode we illustrate this by using two dimensional vectors $C_{i,j}$ and $F_{i,j}$.
711
757
For simplicity, we reuse the Order-0 code by referring to $C_i$ and $F_i$ of the 2D vectors to get a single dimensional vector that operates in the same manner as the Order-0 code.
712
758
This is not necessarily the most efficient implementation.
713
759
@@ -717,9 +763,9 @@ \subsubsection*{rANS order-1}
717
763
\vskip 0.5cm
718
764
719
765
\begin{algorithmic}[1]
720
-
\Statex (Reads a table of Order-1 symbol frequencies $F_{i,j}$
766
+
\Statex (Reads a table of Order-1 symbol frequencies $F_{i,j}$)
721
767
\Statex (and sets the cumulative frequency table $C_{i,j+1} = C_{i,j}+F_{i,j}$)
\State$R_j \gets$\Call{ReadUint32}{}\Comment{Unsigned 32-bit little endian}
747
793
\State$L_j \gets0$\Comment{Last symbol}
748
794
\EndFor
@@ -791,15 +837,59 @@ \section{rANS Nx16}
791
837
Frequencies are now stored using uint7 format instead of ITF8. The
792
838
tables are also stored differently, separating the list of symbols
793
839
present in the alphabet (those with frequency greater than zero) from
794
-
the frequencies themselves. The symbol list must be stored in
795
-
ascending ASCII order, with their frequency values in the same
796
-
ordering as their corresponding symbols. For the Order-1 frequency
797
-
table this list of symbols is those used in any context, thus we only
798
-
have one alphabet recorded for all contexts. This means in some
799
-
contexts some (potentially many) symbols will have zero frequency. To
800
-
reduce the Order-1 table size an additional zero run-length encoding
801
-
step is used. Finally the Order-1 frequencies may optionally be
802
-
compressed using the Order-0 rANS Nx16 codec.
840
+
the frequencies themselves.
841
+
842
+
Finally transformations may be applied to the data prior to
843
+
compression (or after decompression). These consist of stripe, for
844
+
structured data where every Nth byte is sent to one of N separate
845
+
compression streams, Run Length Encoding replacing repeated strings of
846
+
symbols with a symbol and count, and bit-packing where reduced
847
+
alphabets can combine multiple symbols into a byte prior to entropy
848
+
encoding.
849
+
850
+
The initial ``Order'' byte is expanded with additional bits to list
851
+
the transformations to be applied. The specifics of each sub-format
852
+
are listed below, in the order they are applied.
853
+
854
+
\begin{itemize}
855
+
\item{\textbf{\textsc{Stripe}}:}
856
+
rANS Nx16 with multi-way interleaving (see Section~\ref{sec:ransstripe}).
857
+
858
+
\item{\textbf{\textsc{NoSize}}:}
859
+
Do not store the size of the uncompressed data stream.
860
+
This information is not required when the data stream is one of the four sub-streams in the \textsc{Stripe} format.
861
+
862
+
\item{\textbf{\textsc{Cat}}:}
863
+
If present, the order bit flag is ignored.
864
+
865
+
The uncompressed data stream is the same as the compressed stream.
866
+
This is useful for very short data where the overheads of compressing are too high.
867
+
868
+
\item{\textbf{\textsc{N32}}:}
869
+
Flag indicating whether to interleave 4 or 32 rANS states.
870
+
871
+
\item{\textbf{\textsc{Order}}:}
872
+
Bit field defining order-0 (unset) or order-1 (set) entropy encoding, as described above by the \textsc{RansDecodeNx16\_0} and \textsc{RansDecodeNx16\_1} functions.
873
+
874
+
\item{\textbf{\textsc{RLE}}:}
875
+
Bit field defining whether Run Length Encoding has been applied to the data. If set, the reverse transorm will be applied using \textsc{DecodeRLE} after Order-0 or Order-1 uncompression (see Section~\ref{sec:ransRLE}).
876
+
877
+
\item{\textbf{\textsc{Pack}}:}
878
+
Bit field indicating the data was packed prior to compression (see Section~\ref{sec:ranspack}). If set, unpack the bits after any RLE decoding has been applied (if required) using the \textsc{DecodePack} function.
879
+
\end{itemize}
880
+
881
+
\subsection{Frequency tables}
882
+
883
+
Frequency tables in rANS Nx16 separate the list of symbols from their
884
+
frequencies. The symbol list must be stored in ascending ASCII order,
885
+
with their frequency values in the same ordering as their
886
+
corresponding symbols. For the Order-1 frequency table this list of
887
+
symbols is those used in any context, thus we only have one alphabet
888
+
recorded for all contexts. This means in some contexts some
889
+
(potentially many) symbols will have zero frequency. To reduce the
890
+
Order-1 table size an additional zero run-length encoding step is
891
+
used. Finally the Order-1 frequencies may optionally be compressed
892
+
using the Order-0 rANS Nx16 codec.
803
893
804
894
Frequencies must always add up to a power of 2, but do not necessarily
805
895
have to match the final power of two used in the Order-0 (4096) and
@@ -808,8 +898,6 @@ \section{rANS Nx16}
808
898
This is required as the Order-1 frequencies may be scaled differently
809
899
for each context.
810
900
811
-
\subsection{Frequency tables}
812
-
813
901
\begin{algorithmic}[1]
814
902
\Statex (Reads a set of symbols $A$ used in our alphabet)
815
903
\Function{ReadAlphabet}{}
@@ -837,9 +925,9 @@ \subsection{Frequency tables}
837
925
\vskip 0.5cm
838
926
839
927
\begin{algorithmic}[1]
840
-
\Statex (Reads a table of Order-0 symbol frequencies $F_i$
928
+
\Statex (Reads a table of Order-0 symbol frequencies $F_i$)
841
929
\Statex (and sets the cumulative frequency table $C_{i+1} = C_i+F_i$)
842
-
\Procedure{ReadFrequenciesNx16\_0}{$F, C$}
930
+
\Procedure{ReadFrequenciesNx16\_0}{$F,\C$}
843
931
\State$F \gets (0,\ ...)$\Comment(Set to zero for all $i \in\{0, 1,
844
932
..., 255\}$)
845
933
\State$A \gets$\Call{ReadAlphabet}{}
@@ -858,7 +946,7 @@ \subsection{Frequency tables}
858
946
859
947
\begin{algorithmic}[1]
860
948
\Statex (Normalises a table of frequencies $F_i$ to sum to a specified power of 2.)
@@ -1125,7 +1217,7 @@ \subsection{rANS Nx16 Bit Packing}
1125
1217
\hline
1126
1218
1 & byte & $nsym$ & Number of distinct symbols\\
1127
1219
$nsym$ & byte[] & $P$ & Symbol map \\
1128
-
-? & uint7 & $len$ & Length of packed data
1220
+
? & uint7 & $len$ & Length of packed data
1129
1221
\end{tabular}
1130
1222
\end{table}
1131
1223
@@ -1157,7 +1249,7 @@ \subsection{rANS Nx16 Bit Packing}
1157
1249
data as described above.
1158
1250
1159
1251
\begin{algorithmic}[1]
1160
-
\Function{DecodePack}{$data$, $P$, $nsym$, $len$}
1252
+
\Function{DecodePack}{$data,\ P,\ nsym,\ len$}
1161
1253
\State$j \gets0$\Comment{Index into $data$; $i$ is index into output}
1162
1254
\If{$nsym \le1$} \Comment{Constant value}
1163
1255
\For{$i \gets0$ to $len-1$}
@@ -1206,33 +1298,33 @@ \subsection{rANS Nx16 Bit Packing}
1206
1298
\subsection{Striped rANS Nx16}
1207
1299
\label{sec:ransstripe}
1208
1300
1209
-
If we have a series of 32-bit values, we can get better compression by
1301
+
If we have a series of 32-bit values, we can often get better compression by
1210
1302
treating it as a series of 4 8-bit values representing the first to
1211
1303
last bytes in each 32-bit word, than we can by simply processing it as
1212
1304
a stream of 8-bit values.
1213
1305
Each $4{th}$ byte is sent to its own stream producing 4 interleaved streams, so the $1^{st}$ stream will hold data from byte 0, 4, 8, etc while the $2^{nd}$ stream will hold data from byte 1, 5, 9, etc.
1214
1306
Each of those four streams is then itself compressed using this compression format.
1215
1307
1216
-
For example an input block of small unsigned 32-bit little-endian numbers may use RLE for the first three streams as they are mostly zero, and a non-RLE Order-0 entropy encoder of the last stream.
1308
+
For example an input block of small unsigned 32-bit little-endian numbers may use RLE for the first three streams as they are mostly zero, and a non-RLE Order-0 entropy encoder for the last stream.
1217
1309
1218
-
In the general case we describe this as $X$-way interleaved streams.
1310
+
In the general case we describe this as $N$-way interleaved streams.
1219
1311
We can consider this interleaving process to be equivalent to a table
1220
-
transpose of $Y$ rows by $X$ columns to $X$ rows by $Y$ columns,
1221
-
followed by compressing each $X$ row independently.
1312
+
transpose of $M$ rows by $N$ columns to $N$ rows by $M$ columns,
1313
+
followed by compressing each $N$ row independently.
1222
1314
1223
1315
The byte stream consists of a 7-bit encoded uncompressed combined
1224
-
length, a byte holding the value of $X$, followed by $X$ compressed
1316
+
length, a byte holding the value of $N$, followed by $N$ compressed
1225
1317
lengths also 7-bit encoded. Finally the data sub-streams themselves,
1226
1318
each a valid $cdata$ stream, follow.
1227
1319
1228
1320
Normally our $cdata$ format will include the decoded size, but with
1229
1321
\textsc{Stripe} we can omit this from the internal compressed sub-streams
1230
-
as given the total length we know how to compute the sub-lengths.
1322
+
(using the \textsc{NoSize} flag) as given the total length we know how to compute the sub-lengths.
1231
1323
1232
-
Reproducing the original uncompressed data involves decoding the $X$
1324
+
Reproducing the original uncompressed data involves decoding the $N$
1233
1325
sub-streams and interleaving them together again (reversing the table
1234
1326
transpose). The uncompressed data length may not necessary be an exact
1235
-
multiple of $X$, in which case the latter uncompressed sub-streams may
1327
+
multiple of $N$, in which case the latter uncompressed sub-streams may
1236
1328
be 1 byte shorter.
1237
1329
1238
1330
As an example starting with input data $D$ we define the transposed data $T$ as:
The specifics of each sub-format are described below, in the order (minus meta-data specific shuffling) they are applied.
1503
-
1504
-
\begin{itemize}
1505
-
\item{\textbf{\textsc{Stripe}}:}
1506
-
rANS Nx16 with multi-way interleaving (see Section~\ref{sec:ransstripe}).
1507
-
1508
-
\item{\textbf{\textsc{NoSize}}:}
1509
-
Do not store the size of the uncompressed data stream.
1510
-
This information is not required when the data stream is one of the four sub-streams in the \textsc{Stripe} format.
1511
-
1512
-
\item{\textbf{\textsc{Cat}}:}
1513
-
If present, the order bit flag is ignored.
1514
-
1515
-
The uncompressed data stream is the same as the compressed stream.
1516
-
This is useful for very short data where the overheads of compressing are too high.
1517
-
1518
-
\item{\textbf{\textsc{N32}}:}
1519
-
Flag indicating whether to interleave 4 or 32 rANS states.
1520
-
1521
-
\item{\textbf{\textsc{Order}}:}
1522
-
Bit field defining order-0 (unset) or order-1 (set) entropy encoding, as described above by the \textsc{RansDecodeNx16\_0} and \textsc{RansDecodeNx16\_1} functions.
1523
-
1524
-
\item{\textbf{\textsc{RLE}}:}
1525
-
Bit field defining whether Run Length Encoding has been applied to the data. If set, the reverse transorm will be applied using \textsc{DecodeRLE} after Order-0 or Order-1 uncompression (see Section~\ref{sec:ransRLE}).
1526
-
1527
-
\item{\textbf{\textsc{Pack}}:}
1528
-
Bit field indicating the data was packed prior to compression (see Section~\ref{sec:ranspack}). If set, unpack the bits after any RLE decoding has been applied (if required) using the \textsc{DecodePack} function.
1529
-
1530
-
\end{itemize}
1531
-
1532
1593
\section{Range coding}
1533
1594
1534
1595
The range coder is a byte-wise arithmetic coder that operates by
1535
1596
repeatedly reducing a probability range (for example 0.0 to 1.0) one
1536
-
symbol (byte) at a time with the complete compressed data can be
1597
+
symbol (byte) at a time, with the complete compressed data being
1537
1598
represented by any value within the final range.
1538
1599
1539
1600
This is easiest demonstrated with a worked example, so let us imagine
\caption{A pictorial demonstration of range reduction.}
1645
+
\end{figure}
1587
1646
1588
1647
Decoding is simply the reverse of this. In the above picture we can see that 0.45 would read off `c', `a' and `t' by repeatedly comparing the symbol ranges to the current range and using those to identify the symbol and produce a new range.
The \textsc{RangeEncode} function is a straight forward reversal of the \textsc{RangeDecode}, with the exception of the special code for shifting the top byte out of the $low$ variable.
The probabilities passed to the range coder may be fixed for all scenarios (as we had in the ``cat'' example), or they may be adaptive and context aware.
1736
1795
For example the letter `u' occurs around 3\% of time in English text, but if the previous letter was `q' it is close to 100\% and if the previous letter was `u' it is close to 0\%.
@@ -1861,10 +1920,9 @@ \subsection{RLE with Order-0 and Order-1 Encoding}
1861
1920
(if $\ge4$) and 257 for any further continuation runs. Thus encoding
1862
1921
10 `A' characters would first store symbol `A' followed by run length
@@ -2086,7 +2143,7 @@ \subsection{RLE with Order-0 and Order-1 Encoding}
2086
2143
their own algorithm.
2087
2144
2088
2145
\begin{algorithmic}[1]
2089
-
\Function{RansDecodeStripe}{$len$}
2146
+
\Function{DecodeStripe}{$len$}
2090
2147
\State$N \gets$\Call{ReadUint8}{}
2091
2148
\For{$j \gets0$ to $N$} \Comment{Fetch N compressed lengths}
2092
2149
\State$clen_j \gets$\Call{ReadUint7}{}
@@ -2101,7 +2158,7 @@ \subsection{RLE with Order-0 and Order-1 Encoding}
2101
2158
% \For{$i \gets 0$ to $len - 1$} \Comment{Interleave}
2102
2159
% \State $out_i \gets T_{(i \bmod N),\ (i \bdiv N)}$
2103
2160
% \EndFor
2104
-
\For{$j \gets0$ to $N - 1$} \Comment{Interleave}
2161
+
\For{$j \gets0$ to $N - 1$} \Comment{Stripe}
2105
2162
\For{$i \gets0$ to $ulen_j - 1$}
2106
2163
\State$out_{i \times N + j} \gets T_{j,i}$
2107
2164
\EndFor
@@ -2161,7 +2218,7 @@ \subsection{RLE with Order-0 and Order-1 Encoding}
2161
2218
\section{Name tokenisation codec}
2162
2219
2163
2220
Sequence names (identifiers) typically follow a structured pattern and compression based on columns within those structures usually leads to smaller sizes.
2164
-
The sequence name (identifier) tokenisation relies heavily on the General Purpose Entropy Encoder described above.
2221
+
The sequence name (identifier) tokenisation relies heavily on the rANS Nx16 and Adaptive arithmetic coders described above.
A few tricks are used to remove some byte streams. In addition to the explicit marking of duplicate bytes streams, if a byte stream of token types is entirely MATCH apart from the very first value it is discarded. It is possible to regenerate this during decode by observing the other byte streams. For example if we have a byte stream $B_{5,DIGITS}$ but no $B_{5,TYPE}$ then we assume the contents of $B_{5,TYPE}$ consist of one DIGITS type followed by as many MATCH types as are needed.
2436
2493
2437
-
The $cdata$ stream itself is as described in the General Purpose Entropy Encoder section above, with the \textsc{ArithDecode} function.
2494
+
The $cdata$ stream itself is as described in the relevant entropy encoder section above (rANS or arithmetic coding).
2438
2495
2439
2496
\begin{algorithmic}[1]
2440
2497
\Statex
2441
-
\Statex\textit{(Decodes and uncompressed the serialised token byte streams)}
2498
+
\Statex\textit{(Decodes and uncompresses the serialised token byte streams)}
@@ -2762,7 +2824,7 @@ \subsection{FQZComp Data Stream}
2762
2824
The start of an FQZComp data stream consists of the parameters used by
2763
2825
the decoder. The data layout is as follows.
2764
2826
2765
-
\begin{table}
2827
+
\begin{table}[H]
2766
2828
\centering
2767
2829
\begin{tabular}{|r|r|r|r|r|p{8cm}|l|l|}
2768
2830
\hline
@@ -2773,7 +2835,7 @@ \subsection{FQZComp Data Stream}
2773
2835
\multicolumn{3}{|r|}{8} & uint8 & $gflags$ & \multicolumn{3}{p{8.8cm}|}{Global FQZcomp bit-flags. From lowest bit to highest:}\\
2774
2836
\multicolumn{3}{|r|}{} & & & \multicolumn{3}{p{8.8cm}|}{1: $multi\_param$: indicates more than one parameter block is present. Otherwise set $nparam = 1$} \\
2775
2837
\multicolumn{3}{|r|}{} & & & \multicolumn{3}{p{8.8cm}|}{2: $have\_stab$: indicates the parameter selector is mapped through $stab$. Otherwise set $stab_i = i$} \\
2776
-
\multicolumn{3}{|r|}{} & & & \multicolumn{3}{p{8.8cm}|}{4: $do\_rev$: $model\_revcomp$ will be used. (CRAM v3.1)} \\
2838
+
\multicolumn{3}{|r|}{} & & & \multicolumn{3}{p{8.8cm}|}{4: $do\_rev$: $model\_revcomp$ will be used (CRAM v3.1)} \\
2777
2839
\hline
2778
2840
2779
2841
\multicolumn{8}{|l|}{}\\[-0.7em]
@@ -2799,8 +2861,8 @@ \subsection{FQZComp Data Stream}
2799
2861
& \multicolumn{2}{r|}{8} & uint8 & $pflags$ & \multicolumn{2}{p{8.4cm}|}{Per-parameter block bit-flags. From lowest bit to highest:} & \\
0 commit comments