N4455 No Sane Compiler Would Optimize Atomics

Author:JF Bastien
Contact:jfb@google.com
Date:2015-04-10
URL:https://github.com/jfbastien/papers/blob/master/source/N4455.rst

Abstract

False.

Compilers do optimize atomics, memory accesses around atomics, and utilize architecture-specific knowledge. This paper illustrates a few such optimizations, and discusses their implications.

Sample Optimizations

We list optimizations that are either implemented in LLVM, or will be readily implemented. A general rule to keep in mind is that the compiler performs many of its optimizations on and around atomics based on the as-if rule. This implies that the compiler can make operations more atomic as long as it doesn’t violate forward progress requirements, and can make them less atomic as long as it doesn’t add non-benign race which weren’t already present in the original program. Put another way, correct programs must work under all executions an implementation is allowed to create.

Optimizations on Atomics

Atomics themselves can be optimized. A non-contentious example is constant propagation into atomics without other intervening atomics:

void inc(std::atomic<int> *y) {
  *y += 1;
}

std::atomic<int> x;
void two() {
  inc(&x);
  inc(&x);
}

Becomes:

std::atomic<int> x;
void two() {
  x += 2;
}

The above optimization adds atomicity but cannot hinder forward progress, and is therefore correct. This leads to further optimizations such as using the locked inc/dec instructions instead of locked add/sub when adding/subtracting 1 to an atomic on x86:

std::atomic<int> x;
void inc(int val) {
  x += 1;
  x += val;
}

Becomes:

_Z3inci:
  lock incl x(%rip)
  lock addl %edi, x(%rip)
  retq

In a similar vein, some opportunities for strength reduction will show up because non-trivial code gets inlined which then exposes fairly silly code, such as in the following trivial example:

template<typename T>
bool silly(std::atomic<T> *x, T expected, T desired) {
  x->compare_exchange_strong(expected, desired); // Inlined.
  return expected == desired;
}

Becomes:

template<typename T>
bool silly(std::atomic<T> *x, T expected, T desired) {
  return x->compare_exchange_strong(expected, desired);
}

The following works for any memory order but release and acq_rel:

template<typename T>
bool optme(std::atomic<T> *x, T desired) {
  T expected = desired;
  return x->compare_exchange_strong(expected, desired
    std::memory_order_seq_cst, std::memory_order_relaxed);
}

Becomes:

template<typename T>
bool optme(std::atomic<T> *x, T desired) {
  return x->load(std::memory_order_seq_cst) == desired;
}

The above optimization may require that the compiler mark the transformed load as a release sequence as defined in section 1.10 of the C++ standard.

Similarly, while keeping the resulting memory order stronger or equal to the individual ones, the following can occur:

template<typename T>
T optmetoo(std::atomic<T> *x, T y) {
  T z = x->load();
  x->store(y);
  return z;
}

Becomes:

template<typename T>
T optmetoo(std::atomic<T> *x, T y) {
  return x->exchange(y);
}

This may not always pay off! In particular, architectures with weaker memory models may benefit from having write-after-read operations to the same location instead of having an atomic exchange.

Other simple optimizations can also occur because of inlining and constant propagation such as turning atomic<T>::fetch_and(~(T)0) into atomic<T>::load(). The same applies for fetch_or(0) and fetch_xor(0), as well as fetch_and(0) becoming store(0).

As a slightly different example, the value for std::is_lock_free can be determined at compile time for some architectures, but for others the compiler can’t know the value for all sub-architectures and cannot return a compile-time constant. The compiler may be given a specific sub-architecture flag to work around this (restricting which machines the code will execute correctly on) or must defer to feature detection followed by patching when the program is loaded. This is the case, for example, for x86’s LOCK CMPXCHG16B instruction which is used to implement lock-free 16-byte operations.

These optimizations aren’t traditionally performed when using inline assembly and showcases the strengths of hoisting abstractions to the language level.

The reader for seqlock bounds ticket acquisition and release with a load and a fence. This lets the data reads get reordered in-between ticket acquire/release by using relaxed memory ordering for data. The algorithm retries if the ticket changed or data was being modified by the writer:

std::tuple<T, T> reader() {
  T d1, d2;
  unsigned seq0, seq1;
  do {
    seq0 = seq.load(std::memory_order_acquire);
    d1 = data1.load(std::memory_order_relaxed);
    d2 = data2.load(std::memory_order_relaxed);
    std::atomic_thread_fence(std::memory_order_acquire);
    seq1 = seq.load(std::memory_order_relaxed);
  } while (seq0 != seq1 || seq0 & 1);
  return {d1, d2};
}

void writer(T d1, T d2) {
  unsigned seq0 = seq.load(std::memory_order_relaxed);
  seq.store(seq0 + 1, std::memory_order_relaxed);
  data1.store(d1, std::memory_order_release);
  data2.store(d2, std::memory_order_release);
  seq.store(seq0 + 2, std::memory_order_release);
}

The reader’s last ticket load effectively act as a release load, which doesn’t exist in the current memory model but would better express the intent of the code while allowing subsequent operations to be moved into the critical section if profitable. Hans Boehm suggests using a release fetch-add of zero, and shows that on x86 the code can be written as follows:

T d1, d2;
unsigned seq0, seq1;
do {
  seq0 = seq.load(std::memory_order_acquire);
  d1 = data1.load(std::memory_order_relaxed);
  d2 = data2.load(std::memory_order_relaxed);
  seq1 = seq.fetch_add(0, std::memory_order_release);
} while (seq0 != seq1 || seq0 & 1);

This rewritten code then generates the following x86 assembly:

.LBB0_1:
      movl    seq(%rip), %esi
      movl    data1(%rip), %ecx
      movl    data2(%rip), %eax
      mfence
      movl    seq(%rip), %edi
      movl    %esi, %edx
      andl    $1, %edx
      cmpl    %edi, %esi
      jne     .LBB0_1
      testl   %edx, %edx
      jne     .LBB0_1

This x86 assembly reduces contention by replacing fetch_add—an instruction requiring exclusive cache line access—to a simple movl. This optimization is currently only known to be correct on x86, is probably correct for other architectures, and is currently implemented in LLVM.

Similar to the above release fetch-add of zero serving as a release load, one could also use an acquire exchange when an acquire store is desired.

Traditional compiler optimizations, such as dead store elimination, can be performed on atomic operations, even sequentially consistent ones. Optimizers have to be careful to avoid doing so across synchronization points because another thread of execution can observe or modify memory, which means that the traditional optimizations have to consider more intervening instructions than they usually would when considering optimizations to atomic operations. In the case of dead store elimination it isn’t sufficient to prove that an atomic store post-dominates and aliases another to eliminate the other store.

A trickier example is fusion of relaxed atomic operations, even when interleaved:

std::atomic<int> x, y;
void relaxed() {
  x.fetch_add(1, std::memory_order_relaxed);
  y.fetch_add(1, std::memory_order_relaxed);
  x.fetch_add(1, std::memory_order_relaxed);
  y.fetch_add(1, std::memory_order_relaxed);
}

Becomes:

std::atomic<int> x, y;
void relaxed() {
  x.fetch_add(2, std::memory_order_relaxed);
  y.fetch_add(2, std::memory_order_relaxed);
}

We aren’t aware of compilers performing this optimization yet, but it is being discussed. std::atomic_signal_fence could be used to prevent this reordering and fusion, or one could use a stronger memory ordering for the operations: this optimization is only valid on relaxed operations which aren’t ordered with respect to each other.

A compiler can tag all functions on whether they have atomic instructions or not, and optimize around call sites accordingly. This could even be done for all virtual overrides when we can enumerate them, and can be used to carve out different inteference-free regions.

Fence instructions are generated as a consequence of C++’s std::atomic_thread_fence as well as, on some architectures, atomic operations. Fence instructions tend to be expensive, and removing redundant ones as well as positioning them optimally leads to great performance gains, while keeping the code correct and simple. This is currently under review in LLVM.

Not all compiler optimizations are valid on atomics, this topic is still under active research.

Optimizations Around Atomics

Compilers can optimize non-atomic memory accesses before and after atomic accesses. A somewhat surprising example is that the following code can be (and is!) transformed as shown, where x is a non-atomic global.

int x = 0;
std::atomic<int> y;
int dso() {
  x = 0;
  int z = y.load(std::memory_order_seq_cst);
  y.store(0, std::memory_order_seq_cst);
  x = 1;
  return z;
}

Becomes:

int x = 0;
std::atomic<int> y;
int dso() {
  // Dead store eliminated.
  int z = y.load(std::memory_order_seq_cst);
  y.store(0, std::memory_order_seq_cst);
  x = 1;
  return z;
}

The intuition behind the dead store elimination optimization is that the only way another thread could have observed the dead store elimination is if their code had been racy in the first place: only a release/acquire pair could have been synchronized with another thread that observed the store (see this paper for details). Sequentially consistent accesses are acquire/release, the key in this example is having the release store come before the acquire load and synchronize with another thread (which the loop does by observing changes in y).

The following code, with a different store/load ordering and using release/acquire memory ordering, can also be transformed as shown (but currently isn’t, at least in LLVM).

int x = 0;
std::atomic<int> y;
int rlo() {
  x = 0;
  y.store(0, std::memory_order_release);
  int z = y.load(std::memory_order_acquire);
  x = 1;
  return z;
}

Becomes:

int x = 0;
std::atomic<int> y;
int rlo() {
  // Dead store eliminated.
  y.store(0, std::memory_order_release);
  // Redundant load eliminated.
  x = 1;
  return 0; // Stored value propagated here.
}

The above example’s load can be eliminated because there was no synchronization with another thread: even if the release is followed by an acquire the compiler is allowed to assume that the stored value wasn’t modified before the subsequent load, and that the load is therefore redundant.

Whereas the following code must (and does!) remain the same:

int x = 0;
std::atomic<int> y;
int no() {
  x = 0;
  y.store(0, std::memory_order_release);
  while (!y.load(std::memory_order_acquire));
  x = 1;
  return z;
}

Other optimizations such as global value ordering across atomics can be applied.

Mutex: Safer than Atomics?

The same optimization potential applies to C++’s std::mutex: locking a mutex is equivalent to acquire memory ordering, and unlocking a mutex is equivalent to release memory ordering. Using a mutex correctly is slightly easier because the API is simpler than atomic’s API.

Some current implementations rely on pthread’s mutex, which may not expose all optimization opportunities because the compiler may not know how to handle the slow-path futex (usually a syscall), or because the implementation is in a different translation unit. The optimization difficulties can be overcome by teaching the compiler to treat std::mutex or pthread specially, or by making it practical to implement mutexes in pure C++. Optimization across translation units, such as through link-time optimizations, or optimizations relying on escape analysis, can also help expose more opportunities.

Optimizations without Atomics

Another interesting optimization is to use potentially shared memory locations (on the stack, heap and globals) as scratch storage, if the compiler can prove that they are not accessed in other threads concurrently. This is spelled out in the C++11 standard in section 1.10 ¶22. For example the following transformation could occur:

// Some code, but no synchronization.
*p = 1; // Can be on stack, heap or global.

Becomes:

// ...
*p = RAX; // Spill temporary value.
// ...
RAX = *p; // Restore temporary value.
// ...
*p = 1;

Since we write to *p and there is no synchronization operations, other threads do not read/write *p without exercising undefined behavior. We can therefore use it as scratch storage—and thus reduce stack frame size—without changing the observable behavior of the program. This requires escape analysis: the compiler must see the full scope of memory location p, or must know that leaf functions don’t capture p and aren’t used concurrently, for this optimization to be valid.

Architecture and Implementation Specific Optimizations

Optimizations can sometimes be made per-architecture, or even per specific implementation of an architecture. Compilers can usually be told to target specific architectures, CPUs or attributes using flags such as -march, -mcpu, -mattr.

Spinloops are usually implemented with an acquire load, which are equivalent to a relaxed load followed by an acquire fence in the loop. On some architecture implementations it may make sense to hoist the fence outside the loop, but how and when to do this is architecture specific. In a similar way, mutexes usually want to be implemented as a spinloop with exponential randomized backoff followed by a futex. The right implementation of mutexes is highly platform-dependent.

Instructions can also be implemented in manners that are nominally incorrect for the architecture in general, but happen to be correct for specific implementations of the architecture. For example, release fences should lower to dmb ish on ARM, but on Apple’s Swift processor they lower to dmb ishst instead, which would be incorrect on other ARM processors. Some ARM processors can go even further and remove all dmb which aren’t system-wide because their memory model is much stronger than ARM’s prescribed model.

Some architectures support transactional memory. A compiler can use this knowledge to make many consecutive atomic writes into a single atomic transaction, and retry on commit failure. It can also speculate that many reads and writes aren’t accessed concurrently, or that certain locks aren’t contended, and fall back to a slow path, or to smaller transactions, if a commit failure limit is reached. Such approaches have been implemented using Intel’s RTM and HLE extensions.

Other architectures do dynamic binary translation behind the scenes, and also use transactional memory. This can lead to further in-hardware optimizations as well as fairly hard to predict behavior: sometimes races aren’t observed because big transactions commit, and other times they do occur because transactions are smaller. This certainly makes micro-benchmarking hard, if not impossible.

The same applies for simulators and emulators which often just-in-time translate the code they’re executing—leading to hard-to-predict behavior—and which also often emulate multi-core systems using cooperative thread switching—leading to predictable interleaving which is easier to optimize for the simulator.

Volatility

Atomic operations are unsuitable to express that memory locations can be externally modified. Indeed, volatile (or volatile atomic) should be used in these circumstances.

Shared memory isn’t explicitly defined by the C++ standard, yet programmers often use operating system APIs to map the same physical memory location onto multiple virtual addresses in the same process, or across processes. A sufficiently advanced compiler, performing some of the optimizations described above, can seriously harm code which uses shared memory naïvely.

The C++ standard says that lock-free atomic operations must be address free to address such issues, but this mandate isn’t normative.

Takeaways

For the Standards Committee

Don’t assume that these optimizations don’t occur, but rather encourage them. Standardize more common practice that enable to-the-metal optimizations. Provide more libraries that make it easy to use concurrency and parallelism and hard to get it wrong.

For Developers

Drop assembly: it can’t be optimized as well and is only tuned to the architectures that existed when you originally wrote the code. File bugs when performance expectations aren’t met by the compiler. Suggest to the standard committee new idiomatic patterns which enable concurrency and parallelism. Use the tooling available to you, such as ThreadSanitizer, to find races in your code.

For Hardware vendors

Showcase your hardware’s strengths.

For Compiler Writers

Get back to work, there’s so much more to optimize… and so much code to break! Help users write good code: the compiler should provide diagnostics when it detects anti-patterns or misuses of atomics.

Acknowledgement

Thanks to Robin Morisset, Dmitry Vyukov, Chandler Carruth, Jeffrey Yasskin, Paul McKenney, Lawrence Crowl, Hans Boehm and Torvald Riegel for their reviews, corrections and ideas.