Document Number: P2835R0
Date: 2023-03-13
Reply to: Gonzalo Brito Gadeschi <gonzalob _at_ nvidia.com>
Authors: Gonzalo Brito Gadeschi
Audience: Concurrency

Expose std::atomic_ref 's object address

std::atomic_ref prevents applications from obtaining the address of the object referenced by *this and, therefore, from reasoning about contention on accesses to the object, which is crucial for performance (see “Usecases” section).

Applications that need to reason about contention for performance cannot use std::atomic_ref but may be able to use std::atomic& or std::atomic* instead.

That is not always possible, e.g., if object’s type is outside the application’s control. Then, a pair<atomic_ref<T>, T*> may be passed around instead. However, this is not ergonomic, and always having a pointer available slightly increases the hazard of accidentally accessing the object via a raw pointer while an atomic_ref object is still live.

This paper proposes to add a .data() member function to std::atomic_ref instead, which can be used when the application needs to access the underliyng object’s address, e.g., to be able to reason about contention.

Tony tables

Before After
std::atomic<int>& ref;
auto* addr = &ref;
std::atomic_ref ref;
auto* addr = ref.data();

Wording

Add the following to [atomics.ref.generic.general].

namespace std {
  template<class T> struct atomic_ref {
    // ...
    T const* data() const noexcept;
    // ...
  };
}

Add the following to [atomic.ref.ops]:

T const* data() const noexcept;

Use cases

WIP: collecting small-enough examples of use cases for this feature in practice.

Discovery Patterns

Some hardware architectures have instructions to “discover” different threads of the same programm that are running on the same core and are execution the same “program step”.

In those hardware architectures, these instructions can be used to aggregate atomic operations performed by different threads into a single operation performed by one thread. The pattern looks like this:

void unsynchronized_aggregated_faa(atomic<int>& acc, int upd) { // Find all spatially-close threads executing this program step // with same values of "acc" and "upd". auto thread_mask = __discover_threads_with_same(acc, upd); auto thread_count = popcount(thread_mask); // These threads elect a leader, which aggregates their updates // and performs a single atomic RMW operation instead of one // per thread: if(__pick_one(thread_mask)) acc.fetch_add(thread_count * upd, memory_order_relaxed); }

On NVIDIA GPUs, this optimization can significantly increase the performance of certain algoriths, like “arrive” operations on barriers. In this example (godbolt), even with a small number of threads, ~1.25x speed ups are measured.