Memory ordering and atomic operations synchronization

Published on 04/03/2024 in Technical, c, cpp, rust, atomic, concurrency, memory, memory-ordering, memory-model, memory-barrier, low-latency, optimization

Every time I need to play with atomic variables and I try to be clever and optimize them as much as possible I need to re-learn what the various memory ordering options do. It doesn't help that memory ordering is easily one of the most complex topics I ever worked with. For this reason I decided to finally write down what I (think) I know about memory ordering, making it easier for the future me to re-learn it the next time I'll need it. If you read so far, maybe it will be useful for you as well.

In this post I'll talk about C++, but everything applies identically to C and Rust (probably also to other languages). I will also completely ignore the existence of release-consume ordering: Rust never supported it, in C++ its use is discouraged since C++17 and compilers never actually made good use of it anyway. This is a pity, because the logic behind the release/consume model is actually very useful and it constitutes the basis for Linux's RCU synchronization system, but we need to be patient and wait for ISO to revise it.

What is memory ordering anyway?

Whenever you perform an atomic operation (e.g. when you use std::atomic_int) your compiler will make sure that the instructions around it will behave according to the selected memory model's rules. Both the compiler (when it generates the instructions) and the CPU (at runtime) can, in general, change the order of execution of the memory accesses around an atomic operation. Of course we don't want these optimizations to change the expected behaviour of the program, so we want them to be applied only when it's safe to do so. But how can the computer know when it is safe to re-order some instructions around an atomic operation?

It doesn't know!

For this reason, by default, the model used in C++ is the sequentially-consistent one. This model is the strictest and the most expensive, but it's also the safest. If you are building a low-latency system or you are simply affected by premature optimization, you may want to speed up your program by using more relaxed models. They may be safe to use in your use case, while being much faster than the default.

The basic building blocks: memory barriers

The way to tell the compiler and the CPU what they cannot do in respect to re-ordering memory operations is to use a memory barrier. There are at least 7 different varieties of memory barriers, but we are only interested in 2 of them:

Acquire: reads and writes that appear after this operation can not be moved before it. But anything that appears before the acquire can be moved past it. A mutex lock is an acquire operation.
Release: reads and writes that appear before this operation can not be moved after it. But anything that appears after the release can be moved before it. A mutex unlock is a release operation.

Other then the ordering restrictions imposed by the above tools, you can also always count on the fact that operations on one specific atomic variable will never be reordered in respect of each other (of course!) but accesses to 2 different atomics may be reordered if you don't use the appropriate memory fencing. Also, thread-local variables can be moved around, ignoring these barriers, because they can't cause trouble in concurrent programs.

The main memory ordering modes

The main memory orderings we are going to analyze in this post are:

Sequential consistent ordering
Release/Acquire ordering
Relaxed ordering

Sequential consistent ordering

As we said, this is the default and the safest option. It basically behaves in the way you would expect the code to behave, if you didn't know that compilers and CPUs love to re-order things around. It also works well across threads.

In details:

every read performs an acquire: every operation that appears after your read will actually be performed after it. In other words, an atomic read happens-before anything that appears after it in the code.
every write performs a release: every operation that appears before your write will actually be performed before it. In other words, anything that appears before your atomic write in the code, will happen-before the write itself.
every read-modify-write perform both acquire and release: the operation acts as a wall that prevents moving things across it from both sides.
a total single modification ordering is enforced across all threads for all atomic operations. In other words, all atomic operations will be observed in the same order by all threads.

The performance implications of the above rules are not just related to the inability to re-order instructions in the most effective way, but also to the cache coherency issues that, depending on the architecture, can force expensive cache operations on all cores to keep everything in sync.

e.g.

This example shows the acquire and release operations at work:

#include <atomic>
#include <cassert>
#include <thread>

std::atomic<bool> x = {false};
int y = 0;

void thread_1()
{
  y = 1;
  x.store(true, std::memory_order_seq_cst);
  // y = 1 always happens-before the store
}

void thread_2()
{
  // the load always happens-before the y == 1 check
  if (x.load(std::memory_order_seq_cst))
    assert(y == 1);  // This _never_ fails and there is no race
}

int main()
{
  std::thread a(thread_1);
  std::thread b(thread_2);
  a.join();
  b.join();
  return 0;
}

This example shows how multiple atomics work consistently across all threads:

#include <atomic>
#include <cassert>
#include <thread>

std::atomic<int> x = 0;
std::atomic<int> y = 0;

void thread_1()
{
  // The following stores will always happen
  // strictly in this order from the point
  // of view of every thread in the program
  y.store(2, std::memory_order_seq_cst);
  x.store(1, std::memory_order_seq_cst);
}

void thread_2()
{
  // If x was a regular variable this loop would be UB
  // and it will likely optimized away.
  // But this is not the case.
  while (x.load(std::memory_order_seq_cst) != 1)
    ;
  assert(y.load(std::memory_order_seq_cst) == 2);  // This _never_ fails
  y.store(1, std::memory_order_seq_cst);
}

void thread_3()
{
  if (y.load(std::memory_order_seq_cst) == 1)
    assert(x.load(std::memory_order_seq_cst) == 1);  // This _never_ fails
}

int main()
{
  std::thread a(thread_1);
  std::thread b(thread_2);
  std::thread c(thread_3);
  a.join();
  b.join();
  c.join();
  return 0;
}

Please note that I'm using std::memory_order_seq_cst explicitly for clarity. Being the default mode, you don't really need to specify it.

Release/Acquire ordering

This is a slightly more relaxed mode compared to the previous one. The only actual difference is that there is no total single modification ordering. You still need to perform an acquire for reads and a release for writes. The interesting thing about this mode is that it can be almost completely free. Because the acquire/release semantic is implicit on some architectures (e.g. x86) and the compiler doesn't have to produce any memory barrier instruction. Re-ordering of non-atomic (or atomic relaxed) operations still follows the same rules as per the sequential consistent ordering, it's only atomic operations on different atomics that can be observed as happening in different orders by different threads.

e.g.

This example shows the lack of total ordering:

#include <atomic>
#include <cassert>
#include <thread>

std::atomic<bool> x = {false};
std::atomic<bool> y = {false};

void thread_1()
{
  x.store(true, std::memory_order_release);
}

void thread_2()
{
  y.store(true, std::memory_order_release);
}

void thread_3()
{
  // this assert _can_ succeed
  assert(x.load(std::memory_order_acquire) &&
         !y.load(std::memory_order_acquire));
}

void thread_4()
{
  // this assert _can_ also succeed in the same run
  // because different threads can see the effects
  // of the stores in different orders
  assert(!x.load(std::memory_order_acquire) &&
         y.load(std::memory_order_acquire));
}

int main()
{
  std::thread a(thread_1);
  std::thread b(thread_2);
  std::thread c(thread_3);
  std::thread d(thread_4);
  a.join();
  b.join();
  c.join();
  d.join();
  return 0;
}

Relaxed ordering

This is the simplest kind of operation: no synchronization happens. Memory accesses around it can be re-ordered in any way. You can always count on the fact that operations on one specific atomic variable will never be reordered in respect of each other. This is useful when what you want is just an atomic variable and you don't want to use it as a synchronization mechanism. Relaxed operations are not 100% free, they may still trigger some cache flushing.

e.g.

This example shows how everything can be re-ordered around relaxed ops:

#include <atomic>
#include <cassert>
#include <thread>

int x = 0;
std::atomic<bool> y = {false};

void thread_1()
{
  x = 1;
  y.store(true, std::memory_order_relaxed);
}

void thread_2()
{
  if (y.load(std::memory_order_relaxed))
    assert(x == 1); // this _can_ fail
}

int main()
{
  std::thread a(thread_1);
  std::thread b(thread_2);
  a.join();
  b.join();
  return 0;
}

What about volatile?

Accesses to volatile glvalues (at some point I should write a post about C++ value categories) won't be re-ordered past each other (or any other observable side-effect e.g. I/O, modifying an object etc). But that's it: anything else can be re-ordered around them, different threads can see changes to different volatiles happening in different orders, and they are not atomic. There are only 2 legitimate uses for volatiles:

volatile static std::sig_atomic_t to set a flag in a signal handler
to access some memory-mapped I/O register, if you are doing some very low-level/embedded stuff.

If you are trying to use them for anything else, you have a subtle bug lurking in the shadows, waiting for the right moment to attack you.

Adults only: intermingling modes

If you didn't have your coffee yet, maybe this is the moment to go and get a hot one: things are about to get hairy.

Mixing different synchronization primitives is very complex. The risks of getting everything wrong might outweigh the performance benefits. But if you are looking for new reasons to bang your head against the screen, I'll make you happy in a moment.

A quite common use case for intermixing different modes is reference counting. When you increment the counter, you don't need to know what value the counter had before or what value it has after the increment, you just want the value to be incremented. You need the operation to happen atomically, but ordering and synchronization don't matter. When it's time to decrement it, things are bit different. Now you want to know what the resulting value is, because you need to free some resource when the counter reaches 0. This is a typical case of when certain operations on an atomic are safe to be performed in relaxed mode (i.e. the increment) while some other operation will need to use acquire/release mode (i.e. decrement).

e.g.

This example shows a naive (and slightly buggy) reference counting system:

#include <atomic>
#include <cassert>
#include <thread>

char * ptr = nullptr;
atomic_size_t counter;

void increment()
{
  // We just care about atomicity here, nothing else
  counter.fetch_add(1, std::memory_order_relaxed);
}

void decrement_and_delete()
{
  // We need to use _release_ here because we want to make sure
  // that other accesses (e.g. increment) to the counter won't be
  // moved past this point.
  if (counter.fetch_sub(1, std::memory_order_release) == 1) {
    // If we reached this it means that counter is now 0
    // So we need to free ptr, but we need to make sure
    // this won't be moved before the decrement.
    // We could have obtained the same guarantee using
    // std::memory_order_acq_rel in fetch_sub.
    // But we only need to use _acquire_ when counter reaches 0
    // so it is more efficient to do it separately and only in
    // that case.
    // Also, because we don't have any other atomic operation
    // to perform, we can use std::atomic_thread_fence that
    // simply applies the memory fencing on its own.
    std::atomic_thread_fence(std::memory_order_acquire);
    delete ptr;
  }
}

void thread_1()
{
  increment();
  // do something with ptr
  decrement_and_delete();
}

void thread_2()
{
  increment();
  // do something with ptr
  decrement_and_delete();
}

int main()
{
  ptr = new char(10);
  increment();
  std::thread a(thread_1);
  std::thread b(thread_2);
  decrement_and_delete();
  a.join();
  b.join();
  return 0;
}

Kudos to you if you can spot the bug.

Conclusion

We made it! We reached the end and, hopefully, we also learned something useful! Feel free to let me know what you think about this post and please let me know if you found any error!