folly/docs/Benchmark.md

   1 `folly/Benchmark.h`
   2 -----------------
   3
   4 `folly/Benchmark.h` provides a simple framework for writing and
   5 executing benchmarks. Currently the framework targets only
   6 single-threaded testing (though you can internally use fork-join
   7 parallelism and measure total run time).
   8
   9 To use this library, you need to be using gcc 4.6 or later. Include
  10 `folly/Benchmark.h` and make sure `folly/benchmark.cpp` is part of the
  11 build (either directly or packaged with a library).
  12
  13 ### Overview
  14 ***
  15
  16 Using `folly/Benchmark.h` is very simple. Here's an example:
  17
  18 ``` Cpp
  19     #include <folly/Benchmark.h>
  20     #include <folly/Foreach.h>
  21     #include <vector>
  22     using namespace std;
  23     using namespace folly;
  24     BENCHMARK(insertFrontVector) {
  25       // Let's insert 100 elements at the front of a vector
  26       vector<int> v;
  27       FOR_EACH_RANGE (i, 0, 100) {
  28         v.insert(v.begin(), i);
  29       }
  30     }
  31     BENCHMARK(insertBackVector) {
  32       // Let's insert 100 elements at the back of a vector
  33       vector<int> v;
  34       FOR_EACH_RANGE (i, 0, 100) {
  35         v.insert(v.end(), i);
  36       }
  37     }
  38     int main() {
  39       runBenchmarks();
  40     }
  41 ```
  42
  43 Compiling and running this code produces to the standard output:
  44
  45 ```
  46     ===============================================================================
  47     test.cpp                                              relative ns/iter  iters/s
  48     ===============================================================================
  49     insertFrontVector                                                3.84K  260.38K
  50     insertBackVector                                                 1.61K  622.75K
  51     ===============================================================================
  52 ```
  53
  54 Let's worry about the empty column "relative" later. The table
  55 contains, for each benchmark, the time spent per call and the converse
  56 number of calls per second. Numbers are represented in metric notation
  57 (K for thousands, M for millions etc). As expected, in this example
  58 the second function is much faster (fewer ns/iter and more iters/s).
  59
  60 The macro `BENCHMARK` introduces a function and also adds it to an
  61 internal array containing all benchmarks in the system. The defined
  62 function takes no arguments and returns `void`.
  63
  64 The framework calls the function many times to collect statistics
  65 about it. Sometimes the function itself would want to do that
  66 iteration---for example how about inserting `n` elements instead of
  67 100 elements? To do the iteration internally, use `BENCHMARK` with two
  68 parameters. The second parameter is the number of iterations and is
  69 passed by the framework down to the function. The type of the count is
  70 implicitly `unsigned`. Consider a slightly reworked example:
  71
  72 ``` Cpp
  73     #include <folly/Benchmark.h>
  74     #include <folly/Foreach.h>
  75     #include <vector>
  76     using namespace std;
  77     using namespace folly;
  78     BENCHMARK(insertFrontVector, n) {
  79       vector<int> v;
  80       FOR_EACH_RANGE (i, 0, n) {
  81         v.insert(v.begin(), i);
  82       }
  83     }
  84     BENCHMARK(insertBackVector, n) {
  85       vector<int> v;
  86       FOR_EACH_RANGE (i, 0, n) {
  87         v.insert(v.end(), i);
  88       }
  89     }
  90     int main() {
  91       runBenchmarks();
  92     }
  93 ```
  94
  95 The produced numbers are substantially different:
  96
  97 ```
  98     ===============================================================================
  99     Benchmark                                             relative ns/iter  iters/s
 100     ===============================================================================
 101     insertFrontVector                                               39.92    25.05M
 102     insertBackVector                                                 3.46   288.89M
 103     ===============================================================================
 104 ```
 105
 106 Now the numbers indicate the speed of one single insertion because the
 107 framework assumed the user-defined function used internal iteration
 108 (which it does). So inserting at the back of a vector is more than 10
 109 times faster than inserting at the front! Speaking of comparisons...
 110
 111 ### Baselines
 112 ***
 113
 114 Choosing one or more good baselines is a crucial activity in any
 115 measurement. Without a baseline there is little information to derive
 116 from the sheer numbers. If, for example, you do experimentation with
 117 algorithms, a good baseline is often an established approach (e.g. the
 118 built-in `std::sort` for sorting). Essentially all experimental
 119 numbers should be compared against some baseline.
 120
 121 To support baseline-driven measurements, `folly/Benchmark.h` defines
 122 `BENCHMARK_RELATIVE`, which works much like `BENCHMARK`, except it
 123 considers the most recent lexically-ocurring `BENCHMARK` a baseline,
 124 and fills the "relative" column. Say, for example, we want to use
 125 front insertion for a vector as a baseline and see how back insertion
 126 compares with it:
 127
 128 ``` Cpp
 129     #include <folly/Benchmark.h>
 130     #include <folly/Foreach.h>
 131     #include <vector>
 132     using namespace std;
 133     using namespace folly;
 134     BENCHMARK(insertFrontVector, n) {
 135       vector<int> v;
 136       FOR_EACH_RANGE (i, 0, n) {
 137         v.insert(v.begin(), i);
 138       }
 139     }
 140     BENCHMARK_RELATIVE(insertBackVector, n) {
 141       vector<int> v;
 142       FOR_EACH_RANGE (i, 0, n) {
 143         v.insert(v.end(), i);
 144       }
 145     }
 146     int main() {
 147       runBenchmarks();
 148     }
 149 ```
 150
 151 This program prints something like:
 152
 153 ```
 154     ===============================================================================
 155     Benchmark                                             relative ns/iter  iters/s
 156     ===============================================================================
 157     insertFrontVector                                               42.65    23.45M
 158     insertBackVector                                     1208.24%    3.53   283.30M
 159     ===============================================================================
 160 ```
 161
 162 showing the 1208.24% relative speed advantage of inserting at the back
 163 compared to front. The scale is chosen in such a way that 100% means
 164 identical speed, numbers smaller than 100% indicate the benchmark is
 165 slower than the baseline, and numbers greater than 100% indicate the
 166 benchmark is faster. For example, if you see 42% that means the speed
 167 of the benchmark is 0.42 of the baseline speed. If you see 123%, it
 168 means the benchmark is 23% or 1.23 times faster.
 169
 170 To close the current benchmark group and start another, simply use
 171 `BENCHMARK` again.
 172
 173 ### Ars Gratia Artis
 174 ***
 175
 176 If you want to draw a horizontal line of dashes (e.g. at the end of a
 177 group or for whatever reason), use `BENCHMARK_DRAW_LINE()`. The line
 178 fulfills a purely aesthetic role; it doesn't interact with
 179 measurements in any way.
 180
 181 ``` Cpp
 182     BENCHMARK(foo) {
 183       Foo foo;
 184       foo.doSomething();
 185     }
 186
 187     BENCHMARK_DRAW_LINE();
 188
 189     BENCHMARK(bar) {
 190       Bar bar;
 191       bar.doSomething();
 192     }
 193 ```
 194
 195 ### Suspending a benchmark
 196 ***
 197
 198 Sometimes benchmarking code must do some preparation work that is
 199 physically inside the benchmark function, but should not take part to
 200 its time budget. To temporarily suspend the benchmark, use the
 201 pseudo-statement `BENCHMARK_SUSPEND` as follows:
 202
 203 ``` Cpp
 204     BENCHMARK(insertBackVector, n) {
 205       vector<int> v;
 206       BENCHMARK_SUSPEND {
 207         v.reserve(n);
 208       }
 209       FOR_EACH_RANGE (i, 0, n) {
 210         v.insert(v.end(), i);
 211       }
 212     }
 213 ```
 214
 215 The preallocation effected with `v.reserve(n)` will not count toward
 216 the total run time of the benchmark.
 217
 218 Only the main thread should call `BENCHMARK_SUSPEND` (and of course it
 219 should not call it while other threads are doing actual work). This is
 220 because the timer is application-global.
 221
 222 If the scope introduced by `BENCHMARK_SUSPEND` is not desired, you may
 223 want to "manually" use the `BenchmarkSuspender` type. Constructing
 224 such an object suspends time measurement, and destroying it resumes
 225 the measurement. If you want to resume time measurement before the
 226 destructor, call `dismiss` against the `BenchmarkSuspender`
 227 object. The previous example could have been written like this:
 228
 229 ``` Cpp
 230     BENCHMARK(insertBackVector, n) {
 231       BenchmarkSuspender braces;
 232       vector<int> v;
 233       v.reserve(n);
 234       braces.dismiss();
 235       FOR_EACH_RANGE (i, 0, n) {
 236         v.insert(v.end(), i);
 237       }
 238     }
 239 ```
 240
 241 ### `doNotOptimizeAway`
 242 ***
 243
 244 Finally, the small utility function `doNotOptimizeAway` prevents
 245 compiler optimizations that may interfere with benchmarking . Call
 246 doNotOptimizeAway(var) against variables that you use for
 247 benchmarking but otherwise are useless. The compiler tends to do a
 248 good job at eliminating unused variables, and this function fools it
 249 into thinking a variable is in fact needed. Example:
 250
 251 ``` Cpp
 252     BENCHMARK(fpOps, n) {
 253       double d = 1;
 254       FOR_EACH_RANGE (i, 1, n) {
 255         d += i;
 256         d -= i;
 257         d *= i;
 258         d /= i;
 259       }
 260       doNotOptimizeAway(d);
 261     }
 262 ```
 263
 264 ### A look under the hood
 265 ***
 266
 267 `folly/Benchmark.h` has a simple, systematic approach to collecting
 268 timings.
 269
 270 First, it organizes measurements in several large epochs, and takes
 271 the minimum over all epochs. Taking the minimum gives the closest
 272 result to the real runtime. Benchmark timings are not a regular random
 273 variable that fluctuates around an average. Instead, the real time
 274 we're looking for is one to which there's a variety of additive noise
 275 (i.e. there is no noise that could actually shorten the benchmark time
 276 below its real value). In theory, taking an infinite amount of samples
 277 and keeping the minimum is the actual time that needs
 278 measuring. That's why the accuracy of benchmarking increases with the
 279 number of epochs.
 280
 281 Clearly, in real functioning there will also be noise and a variety of
 282 effects caused by the running context. But the noise during the
 283 benchmark (straight setup, simple looping) is a poor model for the
 284 noise in the real application. So taking the minimum across several
 285 epochs is the most informative result.
 286
 287 Inside each epoch, the function measured is iterated an increasing
 288 number of times until the total runtime is large enough to make noise
 289 negligible. At that point the time is collected, and the time per
 290 iteration is computed. As mentioned, the minimum time per iteration
 291 over all epochs is the final result.
 292
 293 The timer function used is `clock_gettime` with the `CLOCK_REALTIME`
 294 clock id. Note that you must use a recent Linux kernel (2.6.38 or
 295 newer), otherwise the resolution of `CLOCK_REALTIME` is inadequate.