From: Nathan Bronson Date: Fri, 17 Jan 2014 18:25:41 +0000 (-0800) Subject: Baton - minimalist inter-thread notification X-Git-Tag: v0.22.0~730 X-Git-Url: http://plrg.eecs.uci.edu/git/?p=folly.git;a=commitdiff_plain;h=9f040aa0d4023a64e08cda5b76b37a15e48ff732 Baton - minimalist inter-thread notification Summary: A Baton allows a thread to block once and be awoken: it captures a single handoff. During its lifecycle (from construction/reset to destruction/reset) a baton must either be post()ed and wait()ed exactly once each, or not at all. Batons may be reused after a call to recycle(). Baton includes no internal padding, and is only 4 bytes in size. Any alignment or padding to avoid false sharing is up to the user. This is basically a stripped-down semaphore that supports only a single call to sem_post. The current posix semaphore sem_t isn't too bad, but this provides more a bit more speed, checking, inlining, smaller size, a guarantee that the implementation won't change, and compatibility with DeterministicSchedule. Baton is directly implemented on top of futex, and takes care to avoid system calls. Test Plan: 1. new unit tests 2. this code has already been in production use in TAO for many months Reviewed By: davejwatson@fb.com FB internal diff: D1130407 --- diff --git a/folly/Baton.h b/folly/Baton.h new file mode 100644 index 00000000..4d987a99 --- /dev/null +++ b/folly/Baton.h @@ -0,0 +1,199 @@ +/* + * Copyright 2014 Facebook, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#ifndef FOLLY_BATON_H +#define FOLLY_BATON_H + +#include +#include +#include +#include +#include + +#include + +namespace folly { + +/// A Baton allows a thread to block once and be awoken: it captures +/// a single handoff. During its lifecycle (from construction/reset to +/// destruction/reset) a baton must either be post()ed and wait()ed exactly +/// once each, or not at all. +/// +/// Baton includes no internal padding, and is only 4 bytes in size. +/// Any alignment or padding to avoid false sharing is up to the user. +/// +/// This is basically a stripped-down semaphore that supports only a +/// single call to sem_post and a single call to sem_wait. The current +/// posix semaphore sem_t isn't too bad, but this provides more a bit more +/// speed, inlining, smaller size, a guarantee that the implementation +/// won't change, and compatibility with DeterministicSchedule. By having +/// a much more restrictive lifecycle we can also add a bunch of assertions +/// that can help to catch race conditions ahead of time. +template class Atom = std::atomic> +struct Baton : boost::noncopyable { + Baton() : state_(INIT) {} + + /// It is an error to destroy a Baton on which a thread is currently + /// wait()ing. In practice this means that the waiter usually takes + /// responsibility for destroying the Baton. + ~Baton() { + // The docblock for this function says that is can't be called when + // there is a concurrent waiter. We assume a strong version of this + // requirement in which the caller must _know_ that this is true, they + // are not allowed to be merely lucky. If two threads are involved, + // the destroying thread must actually have synchronized with the + // waiting thread after wait() returned. To convey causality the the + // waiting thread must have used release semantics and the destroying + // thread must have used acquire semantics for that communication, + // so we are guaranteed to see the post-wait() value of state_, + // which cannot be WAITING. + // + // Note that since we only care about a single memory location, + // the only two plausible memory orders here are relaxed and seq_cst. + assert(state_.load(std::memory_order_relaxed) != WAITING); + } + + /// Equivalent to destroying the Baton and creating a new one. It is + /// a bug to call this while there is a waiting thread, so in practice + /// the waiter will be the one that resets the baton. + void reset() { + // See ~Baton for a discussion about why relaxed is okay here + assert(state_.load(std::memory_order_relaxed) != WAITING); + + // We use a similar argument to justify the use of a relaxed store + // here. Since both wait() and post() are required to be called + // only once per lifetime, no thread can actually call those methods + // correctly after a reset() unless it synchronizes with the thread + // that performed the reset(). If a post() or wait() on another thread + // didn't synchronize, then regardless of what operation we performed + // here there would be a race on proper use of the Baton's spec + // (although not on any particular load and store). Put another way, + // we don't need to synchronize here because anybody that might rely + // on such synchronization is required by the baton rules to perform + // an additional synchronization that has the desired effect anyway. + // + // There is actually a similar argument to be made about the + // constructor, in which the fenceless constructor initialization + // of state_ is piggybacked on whatever synchronization mechanism + // distributes knowledge of the Baton's existence + state_.store(INIT, std::memory_order_relaxed); + } + + /// Causes wait() to wake up. For each lifetime of a Baton (where a + /// lifetime starts at construction or reset() and ends at destruction + /// or reset()) there can be at most one call to post(). Any thread + /// may call post(). + /// + /// Although we could implement a more generic semaphore semantics + /// without any extra size or CPU overhead, the single-call limitation + /// allows us to have better assert-ions during debug builds. + void post() { + uint32_t before = state_.load(std::memory_order_acquire); + assert(before == INIT || before == WAITING); + if (before != INIT || + !state_.compare_exchange_strong(before, EARLY_DELIVERY)) { + // we didn't get to state_ before wait(), so we need to call futex() + assert(before == WAITING); + + state_.store(LATE_DELIVERY, std::memory_order_release); + state_.futexWake(1); + } + } + + /// Waits until post() has been called in the current Baton lifetime. + /// May be called at most once during a Baton lifetime (construction + /// |reset until destruction|reset). If post is called before wait in + /// the current lifetime then this method returns immediately. + /// + /// The restriction that there can be at most one wait() per lifetime + /// could be relaxed somewhat without any perf or size regressions, + /// but by making this condition very restrictive we can provide better + /// checking in debug builds. + void wait() { + uint32_t before; + + static_assert(PreBlockAttempts > 0, + "isn't this assert clearer than an uninitialized variable warning?"); + for (int i = 0; i < PreBlockAttempts; ++i) { + before = state_.load(std::memory_order_acquire); + if (before == EARLY_DELIVERY) { + // hooray! + return; + } + assert(before == INIT); +#ifdef __x86_64__ + // The pause instruction is the polite way to spin, but it doesn't + // actually affect correctness to omit it if we don't have it. + // Pausing donates the full capabilities of the current core to + // its other hyperthreads for a dozen cycles or so + asm volatile ("pause"); +#endif + } + + // guess we have to block :( + if (!state_.compare_exchange_strong(before, WAITING)) { + // CAS failed, last minute reprieve + assert(before == EARLY_DELIVERY); + return; + } + + while (true) { + state_.futexWait(WAITING); + + // state_ is the truth even if FUTEX_WAIT reported a matching + // FUTEX_WAKE, since we aren't using type-stable storage and + // we don't guarantee reuse + uint32_t s = state_.load(std::memory_order_acquire); + assert(s == WAITING || s == LATE_DELIVERY); + + if (s == LATE_DELIVERY) { + return; + } + // retry + } + } + + private: + enum State : uint32_t { + INIT = 0, + EARLY_DELIVERY = 1, + WAITING = 2, + LATE_DELIVERY = 3, + }; + + enum { + // Must be positive. If multiple threads are actively using a + // higher-level data structure that uses batons internally, it is + // likely that the post() and wait() calls happen almost at the same + // time. In this state, we lose big 50% of the time if the wait goes + // to sleep immediately. On circa-2013 devbox hardware it costs about + // 7 usec to FUTEX_WAIT and then be awoken (half the t/iter as the + // posix_sem_pingpong test in BatonTests). We can improve our chances + // of EARLY_DELIVERY by spinning for a bit, although we have to balance + // this against the loss if we end up sleeping any way. Spins on this + // hw take about 7 nanos (all but 0.5 nanos is the pause instruction). + // We give ourself 300 spins, which is about 2 usec of waiting. As a + // partial consolation, since we are using the pause instruction we + // are giving a speed boost to the colocated hyperthread. + PreBlockAttempts = 300, + }; + + detail::Futex state_; +}; + +} // namespace folly + +#endif diff --git a/folly/Makefile.am b/folly/Makefile.am index ead0045f..4aed816a 100644 --- a/folly/Makefile.am +++ b/folly/Makefile.am @@ -26,6 +26,7 @@ nobase_follyinclude_HEADERS = \ AtomicHashMap.h \ AtomicHashMap-inl.h \ AtomicStruct.h \ + Baton.h \ Benchmark.h \ Bits.h \ Chrono.h \ diff --git a/folly/test/BatonTest.cpp b/folly/test/BatonTest.cpp new file mode 100644 index 00000000..e2c5cf13 --- /dev/null +++ b/folly/test/BatonTest.cpp @@ -0,0 +1,100 @@ +/* + * Copyright 2014 Facebook, Inc. + * + * Licensed under the Apache License, Version 2.0 (the "License"); + * you may not use this file except in compliance with the License. + * You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +#include +#include +#include +#include +#include +#include +#include + +using namespace folly; +using namespace folly::test; + +typedef DeterministicSchedule DSched; + +TEST(Baton, basic) { + Baton<> b; + b.post(); + b.wait(); +} + +template class Atom> +void run_pingpong_test(int numRounds) { + Baton batons[17]; + Baton& a = batons[0]; + Baton& b = batons[16]; // to get it on a different cache line + auto thr = DSched::thread([&]{ + for (int i = 0; i < numRounds; ++i) { + a.wait(); + a.reset(); + b.post(); + } + }); + for (int i = 0; i < numRounds; ++i) { + a.post(); + b.wait(); + b.reset(); + } + DSched::join(thr); +} + +TEST(Baton, pingpong) { + DSched sched(DSched::uniform(0)); + + run_pingpong_test(1000); +} + +BENCHMARK(baton_pingpong, iters) { + run_pingpong_test(iters); +} + +BENCHMARK(posix_sem_pingpong, iters) { + sem_t sems[3]; + sem_t* a = sems + 0; + sem_t* b = sems + 2; // to get it on a different cache line + + sem_init(a, 0, 0); + sem_init(b, 0, 0); + auto thr = std::thread([=]{ + for (int i = 0; i < iters; ++i) { + sem_wait(a); + sem_post(b); + } + }); + for (int i = 0; i < iters; ++i) { + sem_post(a); + sem_wait(b); + } + thr.join(); +} + +// I am omitting a benchmark result snapshot because these microbenchmarks +// mainly illustrate that PreBlockAttempts is very effective for rapid +// handoffs. The performance of Baton and sem_t is essentially identical +// to the required futex calls for the blocking case + +int main(int argc, char** argv) { + testing::InitGoogleTest(&argc, argv); + google::ParseCommandLineFlags(&argc, &argv, true); + + auto rv = RUN_ALL_TESTS(); + if (!rv && FLAGS_benchmark) { + folly::runBenchmarks(); + } + return rv; +}