Date: Tue, 9 Aug 2011 21:07:10 +0000 Subject: [PATCH] First draft of the practical guide to atomics. This is mostly descriptive of the intended state once atomic load and store have landed. git-svn-id: https://llvm.org/svn/llvm-project/llvm/trunk@137145 91177308-0d34-0410-b5e6-96231b3b80d8 --- docs/Atomics.html | 295 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 295 insertions(+) create mode 100644 docs/Atomics.html diff --git a/docs/Atomics.html b/docs/Atomics.html new file mode 100644 index 00000000000..92065a9b45e --- /dev/null +++ b/docs/Atomics.html @@ -0,0 +1,295 @@ + + + + LLVM Atomic Instructions and Concurrency Guide + + + + +

+ LLVM Atomic Instructions and Concurrency Guide +

+ +

Introduction
Load and store
Atomic orderings
Other atomic instructions
Atomics and IR optimization
Atomics and Codegen

+ +

Written by Eli Friedman

+ + +

+ Introduction +

+ + +

+ +

Historically, LLVM has not had very strong support for concurrency; some +minimal intrinsics were provided, and volatile was used in some +cases to achieve rough semantics in the presence of concurrency. However, this +is changing; there are now new instructions which are well-defined in the +presence of threads and asynchronous signals, and the model for existing +instructions has been clarified in the IR.

+ +

The atomic instructions are designed specifically to provide readable IR and + optimized code generation for the following:

The new C++0x <atomic> header.
Proper semantics for Java-style memory, for both volatile and + regular shared variables.
gcc-compatible __sync_* builtins.
Other scenarios with atomic semantics, including static + variables with non-trivial constructors in C++.

+ +

This document is intended to provide a guide to anyone either writing a + frontend for LLVM or working on optimization passes for LLVM with a guide + for how to deal with instructions with special semantics in the presence of + concurrency. This is not intended to be a precise guide to the semantics; + the details can get extremely complicated and unreadable, and are not + usually necessary.

+ +

+ + +

+ Load and store +

+ + +

+ +

The basic 'load' and 'store' allow a variety of + optimizations, but can have unintuitive results in a concurrent environment. + For a frontend writer, the rule is essentially that all memory accessed + with basic loads and stores by multiple threads should be protected by a + lock or other synchronization; otherwise, you are likely to run into + undefined behavior. (Do not use volatile as a substitute for atomics; it + might work on some platforms, but does not provide the necessary guarantees + in general.)

+ +

From the optimizer's point of view, the rule is that if there + are not any instructions with atomic ordering involved, concurrency does not + matter, with one exception: if a variable might be visible to another + thread or signal handler, a store cannot be inserted along a path where it + might not execute otherwise. Note that speculative loads are allowed; + a load which is part of a race returns undef, but is not + undefined behavior.

+ +

For cases where simple loads and stores are not sufficient, LLVM provides + atomic loads and stores with varying levels of guarantees.

+ +

+ + +

+ Atomic orderings +

+ + +

+ +

In order to achieve a balance between performance and necessary guarantees, + there are six levels of atomicity. They are listed in order of strength; + each level includes all the guarantees of the previous level except for + Acquire/Release.

+ +

Unordered is the lowest level of atomicity. It essentially guarantees that + races produce somewhat sane results instead of having undefined behavior. + This is intended to match the Java memory model for shared variables. It + cannot be used for synchronization, but is useful for Java and other + "safe" languages which need to guarantee that the generated code never + exhibits undefined behavior. Note that this guarantee is cheap on common + platforms for loads of a native width, but can be expensive or unavailable + for wider loads, like a 64-bit load on ARM. (A frontend for a "safe" + language would normally split a 64-bit load on ARM into two 32-bit + unordered loads.) In terms of the optimizer, this prohibits any + transformation that transforms a single load into multiple loads, + transforms a store into multiple stores, narrows a store, or stores a + value which would not be stored otherwise. Some examples of unsafe + optimizations are narrowing an assignment into a bitfield, rematerializing + a load, and turning loads and stores into a memcpy call. Reordering + unordered operations is safe, though, and optimizers should take + advantage of that because unordered operations are common in + languages that need them.

+ +

Monotonic is the weakest level of atomicity that can be used in + synchronization primitives, although it does not provide any general + synchronization. It essentially guarantees that if you take all the + operations affecting a specific address, a consistent ordering exists. + This corresponds to the C++0x/C1x memory_order_relaxed; see + those standards for the exact definition. If you are writing a frontend, do + not use the low-level synchronization primitives unless you are compiling + a language which requires it or are sure a given pattern is correct. In + terms of the optimizer, this can be treated as a read+write on the relevant + memory location (and alias analysis will take advantage of that). In + addition, it is legal to reorder non-atomic and Unordered loads around + Monotonic loads. CSE/DSE and a few other optimizations are allowed, but + Monotonic operations are unlikely to be used in ways which would make + those optimizations useful.

+ +

Acquire provides a barrier of the sort necessary to acquire a lock to access + other memory with normal loads and stores. This corresponds to the + C++0x/C1x memory_order_acquire. This is a low-level + synchronization primitive. In general, optimizers should treat this like + a nothrow call.

+ +

Release is similar to Acquire, but with a barrier of the sort necessary to + release a lock.This corresponds to the C++0x/C1x + memory_order_release.

+ +

AcquireRelease (acq_rel in IR) provides both an Acquire and a Release barrier. + This corresponds to the C++0x/C1x memory_order_acq_rel. In general, + optimizers should treat this like a nothrow call.

+ +

SequentiallyConsistent (seq_cst in IR) provides Acquire and/or + Release semantics, and in addition guarantees a total ordering exists with + all other SequentiallyConsistent operations. This corresponds to the + C++0x/C1x memory_order_seq_cst, and Java volatile. The intent + of this ordering level is to provide a programming model which is relatively + easy to understand. In general, optimizers should treat this like a + nothrow call.

+ +

+ + +

+ Other atomic instructions +

+ + +

+ +

cmpxchg and atomicrmw are essentially like an + atomic load followed by an atomic store (where the store is conditional for + cmpxchg), but no other memory operation operation can happen + between the load and store.

+ +

A fence provides Acquire and/or Release ordering which is not + part of another operation; it is normally used along with Monotonic memory + operations. A Monotonic load followed by an Acquire fence is roughly + equivalent to an Acquire load.

+ +

Frontends generating atomic instructions generally need to be aware of the + target to some degree; atomic instructions are guaranteed to be lock-free, + and therefore an instruction which is wider than the target natively supports + can be impossible to generate.

+ +

+ + +

+ Atomics and IR optimization +

+ + +

+ +

Predicates for optimizer writers to query: +

isSimple(): A load or store which is not volatile or atomic. This is + what, for example, memcpyopt would check for operations it might + transform. +
isUnordered(): A load or store which is not volatile and at most + Unordered. This would be checked, for example, by LICM before hoisting + an operation. +
mayReadFromMemory()/mayWriteToMemory(): Existing predicate, but note + that they returns true for any operation which is volatile or at least + Monotonic. +
Alias analysis: Note that AA will return ModRef for anything Acquire or + Release, and for the address accessed by any Monotonic operation. +

+ +

There are essentially two components to supporting atomic operations. The + first is making sure to query isSimple() or isUnordered() instead + of isVolatile() before transforming an operation. The other piece is + making sure that a transform does not end up replacing, for example, an + Unordered operation with a non-atomic operation. Most of the other + necessary checks automatically fall out from existing predicates and + alias analysis queries.

+ +

Some examples of how optimizations interact with various kinds of atomic + operations: +

memcpyopt: An atomic operation cannot be optimized into part of a + memcpy/memset, including unordered loads/stores. It can pull operations + across some atomic operations. +
LICM: Unordered loads/stores can be moved out of a loop. It just treats + monotonic operations like a read+write to a memory location, and anything + stricter than that like a nothrow call. +
DSE: Unordered stores can be DSE'ed like normal stores. Monotonic stores + can be DSE'ed in some cases, but it's tricky to reason about, and not + especially important. +
Folding a load: Any atomic load from a constant global can be + constant-folded, because it cannot be observed. Similar reasoning allows + scalarrepl with atomic loads and stores. +

+ +

+ + +

+ Atomics and Codegen +

+ + +

+ +

Atomic operations are represented in the SelectionDAG with + ATOMIC_* opcodes. On architectures which use barrier + instructions for all atomic ordering (like ARM), appropriate fences are + split out as the DAG is built.

+ +

The MachineMemOperand for all atomic operations is currently marked as + volatile; this is not correct in the IR sense of volatile, but CodeGen + handles anything marked volatile very conservatively. This should get + fixed at some point.

+ +

The implementation of atomics on LL/SC architectures (like ARM) is currently + a bit of a mess; there is a lot of copy-pasted code across targets, and + the representation is relatively unsuited to optimization (it would be nice + to be able to optimize loops involving cmpxchg etc.).

+ +

On x86, all atomic loads generate a MOV. + SequentiallyConsistent stores generate an XCHG, other stores + generate a MOV. SequentiallyConsistent fences generate an + MFENCE, other fences do not cause any code to be generated. + cmpxchg uses the LOCK CMPXCHG instruction. + atomicrmw xchg uses XCHG, + atomicrmw add and atomicrmw sub use + XADD, and all other atomicrmw operations generate + a loop with LOCK CMPXCHG. Depending on the users of the + result, some atomicrmw operations can be translated into + operations like LOCK AND, but that does not work in + general.

+ +

On ARM, MIPS, and many other RISC architectures, Acquire, Release, and + SequentiallyConsistent semantics require barrier instructions + for every such operation. Loads and stores generate normal instructions. + atomicrmw and cmpxchg generate LL/SC loops.

+ +

+ + + +

+ + LLVM Compiler Infrastructure
+ Last modified: $Date: 2011-08-09 02:07:00 -0700 (Tue, 09 Aug 2011) $ +

+ + + -- 2.34.1