folly/io/async/README.md

   1 # folly/io/async: An object-oriented wrapper around libevent
   2 ----------------------------------------------------------
   3
   4 [libevent](https://github.com/libevent/libevent) is an excellent
   5 cross-platform eventing library.  Folly's async provides C++ object
   6 wrappers for fd callbacks and event_base, as well as providing
   7 implementations for many common types of fd uses.
   8
   9 ## EventBase
  10
  11 The main libevent / epoll loop.  Generally there is a single EventBase
  12 per thread, and once started, nothing else happens on the thread
  13 except fd callbacks.  For example:
  14
  15 ```
  16 EventBase base;
  17 auto thread = std::thread([&](){
  18   base.loopForever();
  19 });
  20
  21 ```
  22
  23 EventBase has built-in support for message passing between threads.
  24 To send a function to be run in the EventBase thread, use
  25 runInEventBaseThread().
  26
  27 ```
  28 EventBase base;
  29 auto thread1 = std::thread([&](){
  30   base.loopForever();
  31 });
  32 base.runInEventBaseThread([&](){
  33   printf("This will be printed in thread1\n");
  34 });
  35 ```
  36
  37 There are various ways to run the loop.  EventBase::loop() will return
  38 when there are no more registered events.  EventBase::loopForever()
  39 will loop until EventBase::terminateLoopSoon() is called.
  40 EventBase::loopOnce() will only call epoll() a single time.
  41
  42 Other useful methods include EventBase::runAfterDelay() to run events
  43 after some delay, and EventBase::setMaxLatency(latency, callback) to
  44 run some callback if the loop is running very slowly, i.e., there are
  45 too many events in this loop, and some code should probably be running
  46 in different threads.
  47
  48 EventBase always calls all callbacks inline - that is, there is no
  49 explicit or implicit queuing.  The specific implications of this are:
  50
  51 * Tail-latency times (P99) are vastly better than any queueing
  52   implementation
  53 * The EventHandler implementation is responsible for not taking too
  54   long in any individual callback.  All of the EventHandlers in this
  55   implementation already do a good job of this, but if you are
  56   subclassing EventHandler directly, something to keep in mind.
  57 * The callback cannot delete the EventBase or EventHandler directly,
  58   since it is still on the call stack.  See DelayedDestruction class
  59   description below, and use shared_ptrs appropriately.
  60
  61 ## EventHandler
  62
  63 EventHandler is the object wrapper for fd's.  Any class you wish to
  64 receive callbacks on will inherit from
  65 EventHandler. `registerHandler(EventType)` will register to receive
  66 events of a specific type.
  67
  68 Currently supported event types:
  69
  70 * READ - read and EOF events
  71 * WRITE - write events, when kernel write buffer is empty
  72 * READ_WRITE - both
  73 * PERSIST - The event will remain registered even after the handlerReady() fires
  74
  75 Unsupported libevent event types, and why-
  76
  77 * TIMEOUT - this library has specific timeout support, instead of
  78   being attached to read/write fds.
  79 * SIGNAL - similarly, signals are handled separately, see
  80   AsyncSignalHandler (TODO:currently in fbthrift)
  81 * EV_ET - Currently all the implementations of EventHandler are set up
  82   for level triggered.  Benchmarking hasn't shown that edge triggered
  83   provides much improvement.
  84
  85   Edge-triggered in this context means that libevent will provide only
  86   a single callback when an event becomes active, as opposed to
  87   level-triggered where as long as there is still data to read/write,
  88   the event will continually fire each time event_wait is called.
  89   Edge-triggered adds extra code complexity, since the library would
  90   need to maintain a similar list of active FDs that libevent
  91   currently does between edge triggering events.  The only advantage
  92   of edge-triggered is that you can use EPOLLONESHOT to ensure the
  93   event only gets called on a single event_base - but in this library,
  94   we assume each event is only registered on a single thread anyway.
  95
  96 * EV_FINALIZE - EventBase can only be used in a single thread,
  97   excepting a few methods.  To safely unregister an event from a
  98   different thread, it would have to be done through
  99   EventBase::runInEventBaseThread().  Most APIs already make this
 100   thread transition for you, or at least CHECK() that you've done it
 101   in the correct thread.
 102 * EV_CLOSED - This is an optimization - instead of having to READ all
 103   the data and then get an EOF, EV_CLOSED would fire before all the
 104   data is read.  TODO: implement this.  Probably only useful in
 105   request/response servers.
 106
 107 ## Implementations of EventHandler
 108
 109 ### AsyncSocket
 110
 111 A nonblocking socket implementation.  Writes are queued and written
 112 asynchronously, even before connect() is successful.  The read api
 113 consists of two methods: getReadBuffer() and readDataAvailable().
 114 When the READ event is signaled, libevent has no way of knowing how
 115 much data is available to read.   In some systems (linux), we *could*
 116 make another syscall to get the data size in the kernel read buffer,
 117 but syscalls are slow.  Instead, most users will just want to provide
 118 a fixed size buffer in getReadBuffer(), probably using the IOBufQueue
 119 in folly/io.   readDataAvailable() will then describe exactly how much
 120 data was read.
 121
 122 AsyncSocket provides send timeouts, but not read timeouts - generally
 123 read timeouts are application specific, and should use an AsyncTimer
 124 implementation below.
 125
 126 Various notes:
 127
 128 * Using a chain of IOBuf objects, and calling writeChain(), is a very
 129   syscall-efficient way to add/modify data to be sent, without
 130   unnecessary copies.
 131 * setMaxReadsPerEvent() - this prevents an AsyncSocket from blocking
 132   the event loop for too long.
 133 * Don't use the fd for syscalls yourself while it is being used in
 134   AsyncSocket, instead use the provided wrappers, like
 135   AsyncSocket::close(), shutdown(), etc.
 136
 137 #### AsyncSSLSocket
 138
 139 Similar to AsyncSocket, but uses openssl.  Provides an additional
 140 HandshakeCallback to check the server's certificates.
 141
 142 #### TAsyncUDPSocket
 143
 144 TODO: Currently in fbthrift.
 145
 146 A socket that reads/writes UDP packets.  Since there is little state
 147 to maintain, this is much simpler than AsyncSocket.
 148
 149 ### AsyncServerSocket
 150
 151 A listen()ing socket that accept()s fds, and passes them to other
 152 event bases.
 153
 154 The general pattern is:
 155
 156 ```
 157 EventBase base;
 158 auto socket = AsyncServerSocket::newSocket(&base);
 159 socket->bind(port); // 0 to choose any free port
 160 socket->addAcceptCallback(object, &base); // where object is the object that implements the accept callback, and base is the object's eventbase.  base::runInEventBaseThread() will be called to send it a message.
 161 socket->listen(backlog);
 162 socket->startAccepting();
 163 ```
 164
 165 Generally there is a single accept() thread, and multiple
 166 AcceptCallback objects.  The Acceptee objects then will manage the
 167 individual AsyncSockets.  While AsyncSockets *can* be moved between
 168 event bases, most users just tie them to a single event base to get
 169 better cache locallity, and to avoid locking.
 170
 171 Multiple ServerSockets can be made, but currently the linux kernel has
 172 a lock on accept()ing from a port, preventing more than ~20k accepts /
 173 sec.  There are various workarounds (SO_REUSEPORT), but generally
 174 clients should be using connection pooling instead when possible.
 175
 176 #### AsyncSSLServerSocket
 177
 178 Similar to AsyncServerSocket, but provides callbacks for SSL
 179 handshaking.
 180
 181 #### TAsyncUDPServerSocket
 182
 183 Similar to AsyncServerSocket, but for UDP messages - messages are
 184 read() on a single thread, and then fanned out to multiple worker
 185 threads.
 186
 187 ### NotificationQueue (EventFD or pipe notifications)
 188
 189 NotificationQueue is used to send messages between threads in the
 190 *same process*.  It is what backs EventBase::runInEventBaseThread(),
 191 so it is unlikely you'd want to use it directly instead of using
 192 runInEventBaseThread().
 193
 194 An eventFD (for kernels > 2.6.30) or pipe (older kernels) are added to
 195 the EventBase loop to wake up threads receiving messages.   The queue
 196 itself is a spinlock-guarded list.   Since we are almost always
 197 talking about a single sender thread and a single receiver (although
 198 the code works just fine for multiple producers and multiple
 199 consumers), the spinlock is almost always uncontended, and we haven't
 200 seen any perf issues with it in practice.
 201
 202 The eventfd or pipe is only notified if the thread isn't already
 203 awake, to avoid syscalls.  A naive implementaiton that does one write
 204 per message in the queue, or worse, writes the whole message to the
 205 queue, would be significantly slower.
 206
 207 If you need to send messages *between processes*, you would have to
 208 write the whole message to the pipe, and manage the pipe size.  See
 209 AsyncPipe.
 210
 211 ### AsyncTimeout
 212
 213 An individual timeout callback that can be installed in the event
 214 loop.   For code cleanliness and clarity, timeouts are separated from
 215 sockets.   There is one fd used per AsyncTimeout.  This is a pretty
 216 serious restriction, so the two below subclasses were made to support
 217 multiple timeouts using a single fd.
 218
 219 #### HHWheelTimer
 220
 221 Implementation of a [hashed hierarcical wheel
 222 timer](http://www.cs.columbia.edu/~nahum/w6998/papers/sosp87-timing-wheels.pdf).
 223 Any timeout time can be used, with O(1) insertion, deletion, and
 224 callback time.  The wheel itself takes up some amount of space, and
 225 wheel timers have to have a constant tick, consuming a constant amount
 226 of CPU.
 227
 228 An alternative to a wheel timer would be a heap of callbacks sorted by
 229 timeout time, but would change the big-O to O(log n).  In our
 230 experience, the average server has thousands to hundreds of thousands
 231 of open sockets, and the common case is to add and remove timeouts
 232 without them ever firing, assuming the server is able to keep up with
 233 the load.  Therefore O(log n) insertion time overshadows the extra CPU
 234 consumed by a wheel timer tick.
 235
 236 #### TAsyncTimeoutSet
 237
 238 NOTE: currently in proxygen codebase.
 239
 240 If we assume that all timeouts scheduled use the same timeout time, we
 241 can keep O(1) insertion time: just schedule the new timeout at the
 242 tail of the list, along with the time it was actually added.  When the
 243 current timeout fires, we look at the new head of the list, and
 244 schedule AsyncTimeout to fire at the difference between the current
 245 time and the scheduled time (which probably isn't the same as the
 246 timeout time.)
 247
 248 This requires all AsyncTimeoutSets timeouts to have the same timeout
 249 time though, which in practice means many AsyncTimeoutSets are needed
 250 per application.   Using HHWheelTimer instead can clean up the code quite
 251 a bit, because only a single HHWheelTimer is needed per thread, as
 252 opposed to one AsyncTimeoutSet per timeout time per thread.
 253
 254 ### TAsyncSignalHandler
 255
 256 TODO: still in fbthrift
 257
 258 Used to handle AsyncSignals.  Similar to AsyncTimeout, for code
 259 clarity, we don't reuse the same fd as a socket to receive signals.
 260
 261 ### AsyncPipe
 262
 263 TODO: not currently open souce
 264
 265 Async reads/writes to a unix pipe, to send data between processes.
 266 Why don't you just use AsyncSocket for now?
 267
 268 ## Helper Classes
 269
 270 ### RequestContext (in Request.h)
 271
 272 Since messages are frequently passed between threads with
 273 runInEventBaseThread(), ThreadLocals don't work for messages.
 274 Instead, RequestContext can be used, which is saved/restored between
 275 threads.  Major uses for this include:
 276
 277 * NUMA: saving the numa node the code was running on, and explicitly
 278   running it on the same node in other threadpools / eventbases
 279 * Tracing: tracing requests dapper-style intra machine, as well as
 280   between threads themselves.
 281
 282 In this library only runInEventBaseThread save/restores the request
 283 context, although other Facebook libraries that pass requests between
 284 threads do also: folly::wangle::future, and fbthrift::ThreadManager, etc
 285
 286 ### DelayedDestruction
 287
 288 Since EventBase callbacks already have the EventHandler and EventBase
 289 on the stack, calling `delete` on either of these objects would most
 290 likely result in a segfault.  Instead, these objects inherit from
 291 DelayedDestruction, which provides reference counting in the
 292 callbacks.  Instead of delete, `destroy()` is called, which notifies
 293 that is ready to be destroyed.  In each of the callbacks there is a
 294 DestructorGuard, which prevents destruction until all the Guards are
 295 gone from the stack, when the actual delete method is called.
 296
 297 DelayedDestruction can be a painful to use, since shared_ptrs and
 298 unique_ptrs need to have a special DelayedDestruction destructor
 299 type.  It's also pretty easy to forget to add a DestructorGuard in
 300 code that calls callbacks.  But it is well worth it to avoid queuing
 301 callbacks, and the improved P99 times as a result.
 302
 303 ### EventBaseManager
 304
 305 DANGEROUS.
 306
 307 Since there is ususally only a single EventBase per thread, why not
 308 make EventBase managed by a threadlocal?  Sounds easy!  But there are
 309 several catches:
 310
 311 * The EventBase returned by `EventBaseManager::get()->getEventBase()`
 312   may not actually be running.
 313 * There may be more than one event base in the thread (unusual), or
 314   the EventBase in the code may not be registerd in EventBaseManager.
 315 * The event bases in EventBaseManager may be used for different
 316   purposes, i.e. some are AsyncSocket threads, and some are
 317   AsyncServerSocket threads:  So you can't just grab the list of
 318   EventBases and call runInEventBaseThread() on all of them and expect
 319   it to do the right thing.
 320
 321 A much safer option is to explicitly pass around an EventBase, or use
 322 an explicit pool of EventBases.
 323
 324 ### SSLContext
 325
 326 SSL helper routines to load / verify certs.  Used with
 327 AsyncSSL[Server]Socket.
 328
 329 ## Generic Multithreading Advice
 330
 331 Facebook has a lot of experience running services.  For background
 332 reading, see [The C10k problem](http://www.kegel.com/c10k.html) and
 333 [Fast UNIX
 334 servers](http://nick-black.com/dankwiki/index.php/Fast_UNIX_Servers)
 335
 336 Some best practices we've found:
 337
 338 1. It's much easier to maintain latency expectations when each
 339    EventBase thread is used for only a single purpose:
 340    AsyncServerSocket, or inbound AsyncSocket, or in proxies, outbound
 341    AsyncSocket calls.   In a perfect world, one EventBase per thread
 342    per core would be enough, but the implementor needs to be extremely
 343    diligent to make sure all CPU work is moved off of the IO threads to
 344    prevent slow read/write/closes of fds.
 345 2. **ANY** work that is CPU intensive should be offloaded to a pool of
 346    CPU-bound threads, instead of being done in the EventBase threads.
 347    runInEventBaseThread() is fast:  It can be called millions of times
 348    per second before the spinlock becomes an issue - so passing the
 349    request off to a different thread is probably fine perf wise.
 350 3. In contrast to the first two recommendations, if there are more
 351    total threads than cores, context switching overhead can become an
 352    issue.  In particular we have seen this be an issue when a
 353    CPU-intensive thread blocks the scheduling of an IO thread, using
 354    the linux `perf sched` tool.
 355 4. For async programming, in contrast to synchronous systems, managing
 356    load is extremely hard - it is better to use out-of-band methods to
 357    notify of overload, such as timeouts, or CPU usage.  For sync
 358    systems, you are almost always limited by the number of threads.
 359    For more details see [No Time for
 360    Asynchrony](https://www.usenix.org/legacy/event/hotos09/tech/full_papers/aguilera/aguilera.pdf)