
I’ve been looking into algorithms for compressing floating-point data in columnar databases. Here’s a family tree for a few of them.
Gorilla: The original algorithm based on xorring the consecutive values. Described in 2015 by Facebook in the paper describing their Gorilla timeseries database.
Chimp: Improves upon Gorilla by using more complicated, more efficient encoding for the xorred values. Paper from 2022.
Chimp128: Instead of xorring the current value with the previous value, look at the previous 128 values and xor the current value with the one that produces the most trailing zeros. Published along with Chimp.
Patas: The algorithms above produce a stream of bits which makes the IO inefficient. Patas is a simplification of Chimp128 that produces a stream of bytes instead. The compression ratio suffers, but it goes much faster. Developed for DuckDB in 2022.
Elf: A careful mathematical analysis gives a way for even more efficient encoding for the xorred values. The compression ratio is a bit better than with Chimp/Chimp128 and the performance is worse. Paper from 2023.
Camel: A new XOR-based algorithm. I haven’t studied this one enough to summarize it, but it is included here for the sake of completeness. Paper from 2024, but seemingly not open access.
Pseudodecimal Encoding (PDE): Unlike the previous algorithms, this one is not XOR-based. The key insight is that a lot of float data is originally fixed-point decimal data and it’s more efficient to store it as a tuple of significant digits and a decimal exponent. Publishes as part of the BtrBlocks columnar data format in 2023.
Adaptive Lossless floating-point Compression (ALP): ALP starts with the same idea as PDE but it is designed for vectorized execution, yielding much better performance. It also has an alternative scheme for storing the values that do not compress well with PDE, hence being called adaptive. Paper from 2023.