LLVM Atomic Instructions and Concurrency Guide

The basic 'load' and 'store' allow a variety of - optimizations, but can have unintuitive results in a concurrent environment. - For a frontend writer, the rule is essentially that all memory accessed - with basic loads and stores by multiple threads should be protected by a - lock or other synchronization; otherwise, you are likely to run into - undefined behavior. (Do not use volatile as a substitute for atomics; it - might work on some platforms, but does not provide the necessary guarantees - in general.)

+ optimizations, but can lead to undefined results in a concurrent environment; + see NonAtomic. This section specifically goes + into the one optimizer restriction which applies in concurrent environments, + which gets a bit more of an extended description because any optimization + dealing with stores needs to be aware of it.

From the optimizer's point of view, the rule is that if there are not any instructions with atomic ordering involved, concurrency does not matter, with one exception: if a variable might be visible to another thread or signal handler, a store cannot be inserted along a path where it - might not execute otherwise. For example, suppose LICM wants to take all the - loads and stores in a loop to and from a particular address and promote them - to registers. LICM is not allowed to insert an unconditional store after - the loop with the computed value unless a store unconditionally executes - within the loop. Note that speculative loads are allowed; a load which + might not execute otherwise. Take the following example:

+ +

+/* C code, for readability; run through clang -O2 -S -emit-llvm to get
+   equivalent IR */
+int x;
+void f(int* a) {
+  for (int i = 0; i < 100; i++) {
+    if (a[i])
+      x += 1;
+  }
+}
+

+ +

The following is equivalent in non-concurrent situations:

+ +

+int x;
+void f(int* a) {
+  int xtemp = x;
+  for (int i = 0; i < 100; i++) {
+    if (a[i])
+      xtemp += 1;
+  }
+  x = xtemp;
+}
+

+ +

However, LLVM is not allowed to transform the former to the latter: it could + indirectly introduce undefined behavior if another thread can access x at + the same time. (This example is particularly of interest because before the + concurrency model was implemented, LLVM would perform this + transformation.)

+ +

Note that speculative loads are allowed; a load which is part of a race returns undef, but does not have undefined behavior.

For cases where simple loads and stores are not sufficient, LLVM provides - atomic loads and stores with varying levels of guarantees.

For cases where simple loads and stores are not sufficient, LLVM provides + various atomic instructions. The exact guarantees provided depend on the + ordering; see Atomic orderings

+ +

load atomic and store atomic provide the same + basic functionality as non-atomic loads and stores, but provide additional + guarantees in situations where threads and signals are involved.

cmpxchg and atomicrmw are essentially like an atomic load followed by an atomic store (where the store is conditional for - cmpxchg), but no other memory operation can happen between - the load and store. Note that our cmpxchg does not have quite as many - options for making cmpxchg weaker as the C++0x version.

+ cmpxchg), but no other memory operation can happen on any thread + between the load and store. Note that LLVM's cmpxchg does not provide quite + as many options as the C++0x version.

A fence provides Acquire and/or Release ordering which is not part of another operation; it is normally used along with Monotonic memory @@ -144,7 +178,55 @@ instructions has been clarified in the IR.

In order to achieve a balance between performance and necessary guarantees, there are six levels of atomicity. They are listed in order of strength; each level includes all the guarantees of the previous level except for - Acquire/Release.

+ Acquire/Release. (See also LangRef.)

+ + +

+ NotAtomic +

+ +

NotAtomic is the obvious, a load or store which is not atomic. (This isn't + really a level of atomicity, but is listed here for comparison.) This is + essentially a regular load or store. If there is a race on a given memory + location, loads from that location return undef.

+ +

Relevant standard: This is intended to match shared variables in C/C++, and to be used + in any other context where memory access is necessary, and + a race is impossible. (The precise definition is in + LangRef.) +
Notes for frontends: The rule is essentially that all memory accessed with basic loads and + stores by multiple threads should be protected by a lock or other + synchronization; otherwise, you are likely to run into undefined + behavior. If your frontend is for a "safe" language like Java, + use Unordered to load and store any shared variable. Note that NotAtomic + volatile loads and stores are not properly atomic; do not try to use + them as a substitute. (Per the C/C++ standards, volatile does provide + some limited guarantees around asynchronous signals, but atomics are + generally a better solution.) +
Notes for optimizers: Introducing loads to shared variables along a codepath where they would + not otherwise exist is allowed; introducing stores to shared variables + is not. See Optimization outside + atomic.
Notes for code generation: The one interesting restriction here is that it is not allowed to write + to bytes outside of the bytes relevant to a store. This is mostly + relevant to unaligned stores: it is not allowed in general to convert + an unaligned store into two aligned stores of the same width as the + unaligned store. Backends are also expected to generate an i8 store + as an i8 store, and not an instruction which writes to surrounding + bytes. (If you are writing a backend for an architecture which cannot + satisfy these restrictions and cares about concurrency, please send an + email to llvmdev.)

+ +

@@ -226,7 +308,7 @@ instructions has been clarified in the IR.
which would make those optimizations useful.
Notes for code generation

Code generation is essentially the same as that for unordered for loads - and stores. No fences is required. `cmpxchg` and + and stores. No fences are required. `cmpxchg` and `atomicrmw` are required to appear as a single operation.
@@ -355,10 +437,10 @@ instructions has been clarified in the IR.
SequentiallyConsistent operations may not be reordered.
Notes for code generation

SequentiallyConsistent loads minimally require the same barriers - as Acquire operations and SequeuentiallyConsistent stores require + as Acquire operations and SequentiallyConsistent stores require Release barriers. Additionally, the code generator must enforce - ordering between SequeuentiallyConsistent stores followed by - SequeuentiallyConsistent loads. This is usually done by emitting + ordering between SequentiallyConsistent stores followed by + SequentiallyConsistent loads. This is usually done by emitting either a full fence before the loads or a full fence after the stores; which is preferred varies by architecture.
@@ -379,24 +461,22 @@ instructions has been clarified in the IR.

isSimple(): A load or store which is not volatile or atomic. This is what, for example, memcpyopt would check for operations it might - transform. + transform.

isUnordered(): A load or store which is not volatile and at most Unordered. This would be checked, for example, by LICM before hoisting - an operation. + an operation.

mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note that they return true for any operation which is volatile or at least - Monotonic. + Monotonic.

Alias analysis: Note that AA will return ModRef for anything Acquire or - Release, and for the address accessed by any Monotonic operation. + Release, and for the address accessed by any Monotonic operation.

-
There are essentially two components to supporting atomic operations. The - first is making sure to query isSimple() or isUnordered() instead - of isVolatile() before transforming an operation. The other piece is - making sure that a transform does not end up replacing, for example, an - Unordered operation with a non-atomic operation. Most of the other - necessary checks automatically fall out from existing predicates and - alias analysis queries.
+
To support optimizing around atomic operations, make sure you are using + the right predicates; everything should work if that is done. If your + pass should optimize some atomic operations (Unordered operations in + particular), make sure it doesn't replace an atomic load or store with + a non-atomic operation.

Some examples of how optimizations interact with various kinds of atomic operations:

- Load and store + Optimization outside atomic

- Other atomic instructions + Atomic instructions

+ NotAtomic +