ClickHouse: don't roll your own crypto

Published on 03/03/2024 in Technical, clickhouse, security, cryptography, rng, vulnerability, cpp

ClickHouse is a column-oriented database management system that is designed for high-performance analytics. It is known for its speed and efficiency in processing large volumes of data, making it a popular choice for companies looking to analyze massive datasets in real-time.

(the above text was written by some AI, because I couldn't be bothered to think of anything, but I promise I'll write the rest of the post myself, maybe)

Working as a CH contributor, a while ago, I stumbled upon an issue in the way the disk encryption feature was implemented.

Some context

When you use CH's disk encryption every file that is written to disk will be encrypted using AES in CTR mode. The same key is used for every file and a different nonce is generated for each file and stored in the file header.

CTR mode

When you use a cipher in CTR mode, there is one thing you need to be very careful about: never re-use the same pair of key and nonce+counter. Wikipedia explains very well how CTR works, so I won't repeat it here. But, summarizing it, we can say that it works by generating a keystream that is XORed with the plain-text to produce the cipher-text. A keystream is simply an "infinite" sequence of random bytes that is generated from the key and nonce+counter pair. So it will always be the same, if the said pair is the same. Re-using this infamous pair with 2 different plain-text is dangerous because of some interesting property that the XOR operation has.

If you have C1 = P1 ^ K it's impossible to recover P1 from C1 unless you know K, this is essentially a one-time-pad cipher. But if you also have C2 = P2 ^ K then, because of the mathematical properties of XOR, you can recover the K simply doing K = C1 ^ C2 and now you can decrypt both C1 and C2.

Long story short, never use key and nonce twice or it will be very easy to decrypt your data.

The problem

We said before that CH usese the same key for every file (which is fine) and generates a new nonce for each file, which would be okay, if the nonce was actually generated randomly. The code looked like this:

InitVector InitVector::random()
{
    std::random_device rd;
    std::mt19937 gen{rd()};
    std::uniform_int_distribution<UInt128::base_type> dis;
    UInt128 counter;
    for (auto & i : counter.items)
        i = dis(gen);
    return InitVector{counter};
}

There are some issues with this code:

The source of entropy is std::random_device which isn't required by the C++ standard to actually return random numbers, in most cases it will and it may even return cryptographically-secure random numbers, but no portable solution should rely on it.
std::mt19937 is seeded using a 32bit integer and returns a 32 bits integer. Also it isn't a CSPRNG.

The latter point is the most worrying. In fact, with only 2^32 different initial states, will can very easily end up re-using nonces. In fact, because of the birthday attack it only takes 65536 nonce generations to have a 50% probability of re-using nonces. And 65536 files are not many files for CH.

Conclusion

I reported the issue upstream and provided a fix that was merged and backported to various versions. This happened a long time ago, so I don't expect any vulnerable CH instance to still be around. Hence why I thought it was a good idea to write about it now.

But please note that simply upgrading CH is not enough to fix the problem, because any file that was encrypted with the buggy code is still in danger. So you also need to re-encrypt everything!

Never roll your own encryption! As you saw in this post, even the best developers can make mistakes with these things!

Kudos to the upstream CH team for being very responsive and helpful, they even offered my a bounty (which I had to decline).