folly/docs/ThreadCachedInt.md

   1 `folly/ThreadCachedInt.h`
   2 ----------------------
   3
   4 High-performance atomic increment using thread caching.
   5
   6 `folly/ThreadCachedInt.h` introduces a integer class designed for high
   7 performance increments from multiple threads simultaneously without
   8 loss of precision.  It has two read modes, `readFast` gives a potentially stale
   9 value with one load, and `readFull` gives the exact value, but is much slower,
  10 as discussed below.
  11
  12
  13 ### Performance
  14 ***
  15
  16 Increment performance is up to 10x greater than `std::atomic_fetch_add` in high
  17 contention environments.  See `folly/test/ThreadCachedIntTest.h` for more
  18 comprehensive benchmarks.
  19
  20 `readFast` is as fast as a single load.
  21
  22 `readFull`, on the other hand, requires acquiring a mutex and iterating through
  23 a list to accumulate the values of all the thread local counters, so is
  24 significantly slower than `readFast`.
  25
  26
  27 ### Usage
  28 ***
  29
  30 Create an instance and increment it with `increment` or the operator overloads.
  31 Read the value with `readFast` for quick, potentially stale data, or `readFull`
  32 for a more expensive but precise result. There are additional convenience
  33 functions as well, such as `set`.
  34
  35 ``` Cpp
  36     ThreadCachedInt<int64_t> val;
  37     EXPECT_EQ(0, val.readFast());
  38     ++val;                        // increment in thread local counter only
  39     EXPECT_EQ(0, val.readFast()); // increment has not been flushed
  40     EXPECT_EQ(1, val.readFull()); // accumulates all thread local counters
  41     val.set(2);
  42     EXPECT_EQ(2, val.readFast());
  43     EXPECT_EQ(2, val.readFull());
  44 ```
  45
  46 ### Implementation
  47 ***
  48
  49 `folly::ThreadCachedInt` uses `folly::ThreadLocal` to store thread specific
  50 objects that each have a local counter.  When incrementing, the thread local
  51 instance is incremented.  If the local counter passes the cache size, the value
  52 is flushed to the global counter with an atomic increment.  It is this global
  53 counter that is read with `readFast` via a simple load, but will not count any
  54 of the updates that haven't been flushed.
  55
  56 In order to read the exact value, `ThreadCachedInt` uses the extended
  57 `readAllThreads()` API of `folly::ThreadLocal` to iterate through all the
  58 references to all the associated thread local object instances.  This currently
  59 requires acquiring a global mutex and iterating through the references,
  60 accumulating the counters along with the global counter.  This also means that
  61 the first use of the object from a new thread will acquire the mutex in order to
  62 insert the thread local reference into the list.  By default, there is one
  63 global mutex per integer type used in `ThreadCachedInt`.  If you plan on using a
  64 lot of `ThreadCachedInt`s in your application, considering breaking up the
  65 global mutex by introducing additional `Tag` template parameters.
  66
  67 `set` simply sets the global counter value, and marks all the thread local
  68 instances as needing to be reset.  When iterating with `readFull`, thread local
  69 counters that have been marked as reset are skipped.  When incrementing, thread
  70 local counters marked for reset are set to zero and unmarked for reset.
  71
  72 Upon destruction, thread local counters are flushed to the parent so that counts
  73 are not lost after increments in temporary threads.  This requires grabbing the
  74 global mutex to make sure the parent itself wasn't destroyed in another thread
  75 already.
  76
  77 ### Alternate Implementations
  78 ***
  79
  80 There are of course many ways to skin a cat, and you may notice there is a
  81 partial alternate implementation in `folly/test/ThreadCachedIntTest.cpp` that
  82 provides similar performance.  `ShardedAtomicInt` simply uses an array of
  83 `std::atomic<int64_t>`'s and hashes threads across them to do low-contention
  84 atomic increments, and `readFull` just sums up all the ints.
  85
  86 This sounds great, but in order to get the contention low enough to get similar
  87 performance as ThreadCachedInt with 24 threads, `ShardedAtomicInt` needs about
  88 2000 ints to hash across.  This uses about 20x more memory, and the lock-free
  89 `readFull` has to sum up all 2048 ints, which ends up being a about 50x slower
  90 than `ThreadCachedInt` in low contention situations, which is hopefully the
  91 common case since it's designed for high-write, low read access patterns.
  92 Performance of `readFull` is about the same speed as `ThreadCachedInt` in high
  93 contention environments.
  94
  95 Depending on the operating conditions, it may make more sense to use one
  96 implementation over the other.  For example, a lower contention environment will
  97 probably be able to use a `ShardedAtomicInt` with a much smaller array without
  98 hurting performance, while improving memory consumption and perf of `readFull`.