correctness-model/writeup/evaluation.tex

   1 \mysection{Evaluation}\label{sec:evaluation}
   2
   3 We have implemented \TOOL.  Our evaluation focuses on the following
   4 questions: (1) How expressive is \TOOL for specifying
   5 the correctness properties of real-world concurrent data structures? (2) How easy is it
   6 to use \TOOL? (3) What
   7 is the performance of \TOOL? (4) How effective was \TOOL
   8 in finding bugs?
   9
  10 In order to evaluate \TOOL, we have gathered a contention
  11 free lock, two types of concurrent queues, and a work stealing
  12 deque~\cite{ppoppworkstealing}.  As C/C++11 is relatively new
  13 there are no C/C++11 implementations for many concurrent data
  14 structures, thus we ported several data structures.  The Linux
  15 kernel's reader-writer spinlock and the Michael Scott queue were originally ported
  16 for the \cdschecker benchmark suite.  We also ported an RCU
  17 implementation and Cliff Click's hashtable from its Java
  18 implementation~\cite{clickhashtable}.  We report execution times on an
  19 Intel Core i7 3770.
  20
  21 \mysubsection{Expressiveness}\label{sec:expressiveness}
  22
  23 In this section, we evaluate the expressiveness of \TOOL by reporting
  24 our experiences writing specifications for a range of concurrent data
  25 structures.
  26
  27 \mypara{\bf Lockfree hashtable:} We ported Cliff Click's hashtable,
  28 which supports simultaneous lookups and updates by multiple threads as
  29 well as concurrent table resizing.  The implementation uses an array
  30 of atomic variables to store the key/value slots, and uses
  31 acquire/release synchronization to establish the synchronization
  32 between hashtable accesses.
  33
  34 Hashtable updates consist of two CAS operations --- one to claim
  35 the key slot and one to update the value.  When a \code{put} method
  36 invocation successfully updates both the key and value, the
  37 update is visible to other threads.  Thus, both CAS operations are
  38 ordering points for the \code{put} method, and we annotate both of
  39 them as potential ordering points. The \code{get} method is ordered after
  40 an invocation of the \code{put} only if it sees both the key and
  41 value updates.  Thus we annotate an ordering point for the
  42 key read only if the key is null.  We also
  43 annotate an ordering point for the value read if
  44 it reads the value slot. The test driver has two threads both of
  45 which update and read the value for the same key.
  46
  47 \mypara{\bf RCU:} As discussed in the example, this is a
  48 synchronization mechanism used in the Linux kernel that allows concurrent reads
  49 and updates. We ran this benchmark with four threads, two update the data
  50 structure and two read the data structure.
  51
  52 \mypara{\bf Chase-Lev Deque:} This is a bug-fixed version of a published C11 adaptation
  53 of the Chase-Lev deque~\cite{ppoppworkstealing}.
  54 It maintains a top and bottom index to a shared array of
  55 references. In terms of synchronization,
  56 when pushing an item into the sequential deque,
  57 we attach a unique ID tag to that element. When stealing or taking an item, we
  58 use that tag as the ID of the method call. Thus, we have (push, steal) or
  59 (push, take) pairs that have the same method call ID.
  60 In our test driver, one thread
  61 pushes 3 items and takes 2 items while the other one steals 1
  62 item.
  63
  64 \mypara{\bf Linux Reader-Writer Lock:} A reader-writer lock allows either
  65 multiple concurrent readers or one exclusive writer. We can abstract
  66 it with a boolean \code{writer\_lock} representing
  67 whether the writer lock is held and an integer \code{reader\_cnt}
  68 representing the number of threads that are reading.
  69 We test this benchmark with a single lock that
  70 protects shared variables. We have two threads that read and write
  71 the shared variables under the protection of a read lock and a write
  72 lock.
  73
  74 \mypara{\bf MCS Lock:} This benchmark is an implementation of
  75 the Mellor-Crummey and Scott lock~\cite{mcs-lock,
  76   mcs-lock-url}.  This lock queues waiting threads in a FIFO.
  77 Our test driver utilizes two
  78 threads that read and write shared variables with the protection of
  79 the lock.
  80
  81 \mypara{\bf M\&S Queue:} This benchmark is an adaptation of the
  82 Michael and Scott lock free queue~\cite{lockfreequeue} to the C/C++
  83 memory model.  We ran with two threads, one of which enqueues and the
  84 other of which dequeues an item.
  85
  86 \mypara{\bf SPSC Queue:} This is a lock-free single-producer,
  87 single-consumer queue.
  88 We used a test driver that has two
  89 threads --- one enqueues a value and the other  dequeues
  90 it.
  91
  92 \mypara{\bf MPMC Queue:} This is a multiple-producer, multiple-consumer
  93 queue. Producers call \code{write\_prepare} to obtain a free
  94 slot, update the slot, and call \code{write\_publish} to
  95 publish it. Consumers call \code{read\_fetch} to obtain
  96 a valid slot, read the slot, and call \code{read\_consume}
  97 to free it. The specification focuses on the synchronization
  98 properties which require \code{write\_publish} to synchronize with
  99 \code{read\_fetch} to ensure the data integrity and
 100 \code{read\_consume} to synchronize with \code{write\_prepare} to
 101 ensure that slots are not prematurely recycled.  The test driver
 102 contains two threads, each of which enqueue and dequeue
 103 an item.
 104
 105 \subsection{Ease of Use} In addition to expressiveness, it is also important for
 106 specification languages to be easy to use. In our experience using \TOOL
 107 to specify the real-world data
 108 structures in our benchmark set, we found that \TOOL was easy to use. \TOOL specifications have only three
 109 parts --- equivalent sequential data structures, ordering points, and
 110 synchronization properties, and we explain the reasons as follows. (1) Specifying
 111 sequential data structures is easy and straightforward, and developers can often
 112 just use an off-the-shelf implementation from a library.
 113 (2) When developers specify ordering points, they
 114 only need to know what operations order methods without needing to specify
 115 the subtle reasoning about the corner cases involving
 116 interleavings and reorderings introduced by relaxed memory models. Take the
 117 Michael \& Scott queue as an example, we can easily order enqueuers with the
 118 point when \code{enqueue} loads the \code{tail} pointer right before inserting the new
 119 node.
 120 (3) For synchronization, the fact that we allow specifying
 121 synchronization at the abstraction of methods makes it easy. For example, we only used
 122 36 lines to specify synchronization in a total 1,033 lines of
 123 code (omitting blanks and comments).
 124
 125 \begin{figure}[!htb]
 126 \vspace{-.2cm}
 127 {\footnotesize
 128         \centering
 129
 130         \resizebox{\columnwidth}{!}{
 131         \begin{tabular}{lrrr}
 132                 \hline
 133                 Benchmark & \# Executions & \# Feasible & Total Time (s) \\
 134                 \hline
 135                 Chase-Lev Deque & 1,365 & 232 & 0.15 \\
 136                 SPSC Queue & 19 & 15 & 0.01 \\
 137                 RCU & 1269 & 756 & 0.11 \\
 138                 Lockfree Hashtable & 30,941 & 25,731 & 11.39 \\
 139                 MCS Lock & 19,501 & 13,546 & 2.62 \\
 140                 MPMC Queue & 170,220 & 93,224 & 45.63 \\ % 1r2w
 141                 M\&S Queue & 168 & 114 & 0.05 \\
 142                 Linux RW Lock & 148,053 & 405 & 13.06 \\
 143                 % Results from: 2014-03-25-13:26
 144                 % Flags: -y -m 2 -u 3
 145                 \hline
 146         \end{tabular}
 147         }
 148 \vspace{-.1cm}
 149         \caption{\label{fig:benchmark_results} Benchmark results}
 150 \vspace{-.3cm}
 151 }
 152 \end{figure}
 153
 154 \mysubsection{Performance}\label{sec:performance}
 155
 156 Figure~\ref{fig:benchmark_results} presents performance results for
 157 \TOOL on our benchmark set.  We list the number of
 158 the total executions that \cdschecker has explored, the number of the feasible
 159 executions that we checked the specification for, and the time the benchmark
 160 took to complete.  All of our benchmarks complete within one minute
 161 and most take less than 3 seconds to complete.
 162
 163 \mysubsection{Finding Bugs}\label{sec:findbugs}
 164
 165 \begin{figure}[!htb]
 166 \vspace{-.2cm}
 167 {\footnotesize
 168         \centering
 169
 170         \resizebox{\columnwidth}{!}{
 171         \begin{tabular}{l|ccccc|c}
 172                 \hline
 173                 Benchmark & \# Injection & \# DR & \# UL & \# Correctness & \# Sync & Rate\\
 174                 \hline
 175                 Chase-Lev Deque & 10 & 0 & 2 & 3 & 2 & 70\%\\
 176                 SPSC Queue & 2 & 2 & 0 & 0 & 0 & 100\%\\
 177                 RCU & 3 & 0 & 0 & 1 & 2 & 100\%\\
 178                 Lockfree Hashtable & 5 & 0 & 0 & 0 & 2 & 40\%\\
 179                 MCS Lock & 4 & 0 & 0 & 0 & 4 & 100\%\\
 180                 MPMC Queue & 6 & 0 & 0 & 0 & 2 & 33\%\\
 181                 M\&S Queue & 11 & 0 & 6 & 3 & 0 & 82\%\\
 182                 Linux RW Lock & 8 & 0 & 0 & 0 & 8 & 100\%\\
 183                 \hline
 184                 Total & 49 & 2 & 8 & 7 & 20 & 76\%\\
 185                 \hline
 186
 187                 % Results from: 2014-03-25-17:08
 188                 % Flags: -y -m 2 -u 3
 189         \end{tabular}
 190         }
 191 \vspace{-.2cm}
 192         \caption{\label{fig:injection_results} Bug injection detection results}
 193 \vspace{-.2cm}
 194 }
 195 \end{figure}
 196
 197 The next component of the evaluation examines the effectiveness of \TOOL for finding bugs.
 198
 199 \mypara{\bf New Bugs:} In the M\&S queue benchmark used
 200 in~\cite{cdschecker}, the dequeue interface does not differentiate
 201 between dequeuing the integer zero and returning that no item is
 202 available, and it passed our initial specification.  However, after
 203 modifying the dequeue interface to match that in the original paper,
 204 \TOOL is able to find a new bug that \cdschecker did not find.  The
 205 original test driver for this benchmark performed the enqueues first
 206 to make it easy to write assertions that are valid for all executions.
 207 \TOOL allows specification assertions to capture the behavior of the specific execution
 208 and thus is able to discover the given bug.
 209
 210 \mypara {\bf Injected Bugs:} To further evaluate \TOOL,
 211 we injected bugs in our benchmarks by weakening the ordering
 212 parameter of the atomic operations. These include
 213 changing \code{release}, \code{acquire}, \code{acq\_rel} and
 214 \code{seq\_cst} to \code{relaxed}. We weakened one operation per
 215 each trial, and covered all of the atomic operations that our
 216 tests exercise.  While this injection strategy may not
 217 reproduce all types of errors that developers make, it does
 218 simulate errors that are caused by misunderstanding the
 219 complicated semantics of relaxed memory models.
 220
 221 This fault injection strategy will introduce one of two types of bugs.
 222 The first type is a specification-independent bug, which can be
 223 detected by the underlying \cdschecker infrastructure which includes
 224 internal data races and uninitialized loads. The second type is a
 225 specification-dependent bug, which passes the built-in checks
 226 but violates the \TOOL specification.  These include
 227 failed assertions and synchronization violations.  We
 228 classifying bugs as follows. If \cdschecker reports a data
 229 race or an uninitialized load, \TOOL reports the error and stops.
 230 If not, \TOOL continues to check the
 231 execution against the specification.  It first
 232 checks for violations of the preconditions and postconditions and then for
 233 violations of the synchronization specification.
 234
 235 Figure~\ref{fig:injection_results} shows the results of the injection
 236 detection.  The column \textit{DR} represents data races, \textit{UL} represents
 237 uninitialized loads, \textit{correctness} represents a failed precondition or
 238 postcondition, and
 239 \textit{sync} represents a synchronization violation. The detection rate is the
 240 number of injections for which we detected a bug divided by the total
 241 number of injections.
 242
 243 \mypara{\bf Linux Reader-Writer Lock:} Our initial specification for this
 244 benchmark did not allow \code{write\_trylock} to spuriously fail.
 245 However, when we checked this benchmark against that specification,
 246 \TOOL checker reports a correctness violation. We then analyzed the code
 247 and found that \code{write\_trylock} first subtracts a bias from the
 248 \code{lock} variable to attempt to acquire the lock, and restores that
 249 bias if the attempt to acquire the lock fails.  In the scenario where
 250 two \code{write\_trylock} are racing for the lock before the lock is
 251 released, one \code{write\_trylock} can first decrement the lock
 252 variable, the lock can be released by the original holder, and then
 253 the second \code{write\_trylock} can attempt to acquire the lock.
 254 Even though the history indicates that the lock is unlocked, it
 255 still holds a transient value due to the partially completed first
 256 \code{write\_trylock} invocation.  Thus, the second
 257 \code{write\_trylock} invocation will also fail.  As the second
 258 \code{write\_trylock} serializes after both the first unsuccessful
 259 \code{write\_trylock} and the \code{unlock} operation,
 260 the sequential specification would force it to succeed.  We then modified
 261 the specification
 262 of \code{write\_trylock} to allow spurious failures so that our correctness
 263 model fits this data structure.  This shows
 264 \TOOL can help developers iteratively refine the
 265 specifications of their data structures. By analyzing the \TOOL
 266 diagnostic report, developers can better understand
 267 any inconsistencies between the specification and the
 268 implementation.
 269
 270 \mypara{\bf MCS Lock:} Three of the weakened
 271 operations are not detected because they cause the execution
 272 to fail to terminate (and hit a trace bound).  We
 273 reviewed the code and found that weakening any of those three
 274 operations makes the lock spin forever.
 275
 276 \mypara{\bf M\&S Queue:} Our test driver does not cause
 277 an enqueue or dequeue thread to help another
 278 enqueue thread update the tail pointer, which corresponds to two of the
 279 undetected injections.
 280
 281 \mypara{\bf Lockfree Hashtable:} Our experiment only focuses on the two
 282 primitive methods, \code{get} and \code{put} without triggering the
 283 \code{resize}. We were able to successfully check all executions from a  test driver for the lockfree hash table that generates executions that have no program order preserving sequential histories.
 284
 285 \mypara{\bf MPMC Queue:} The undetected injections in this benchmark are
 286 primarily due to the limitation of our test driver. One
 287 synchronization property of this benchmark is
 288 that \code{read\_consume} should synchronize with the
 289 next \code{write\_prepare} to ensure that a slot cannot be reused
 290 before the consumer has finished with the slot.  Our test driver is
 291 unable to reach this case so those injections are not detected.
 292
 293 From our experiments on concurrent data structures, we can see that
 294 \TOOL checker can help detect incorrect memory orderings,
 295 help developers refine data structure specifications,
 296 and help determine whether strong memory orderings are really
 297 necessary. Since \TOOL checker is a unit testing tool, it is limited
 298 to small-scale tests to explore common usage scenarios of the data
 299 structures. As a unit testing tool, \TOOL was able to find 100\% of
 300 injections for many data structures and to find 76\% of the injections
 301 on average. For our 49 injections, 10 of them were detected by checks
 302 in \cdschecker, and 27 additional injections were detected by
 303 \TOOL. This shows that by writing specifications, we detect
 304 significantly more fault injections.