Programming Languages Research Group: Git

author	Alan Frindell <afrind@fb.com>
	Fri, 15 May 2015 17:42:48 +0000 (10:42 -0700)
committer	Viswanath Sivakumar <viswanath@fb.com>
	Wed, 20 May 2015 17:57:10 +0000 (10:57 -0700)
commit	1ae1c1ff3f344c6c1cec8f4b78c725e8e4b9a134
tree	1d3253c8eb9fc0a4c288ae1db94fb94cd4ce0503	tree \| snapshot
parent	d599ae3bef120f64b8042ac4407af96b85b8a4f1	commit \| diff

Fix race issue in EventBase::runInEventBaseThreadAndWait (t6779663)

Summary: Huge thanks to @afrind for debugging this issue with me and found the root cause. As per the comment from @afrind for the diff D1823407, there is a tricky race issue. The main thread could have left and reset the condition_variable from its stack but the EventBase thread tries to access it afterwards due to race and could be blocked indefinitely. This caused the server-side IO threads not able to pick up the incoming connections for the proxygen case. The fix is to use a simpler struct barrier and get a hold of the shared_ptr instead of the same object in a safer way.

Test Plan:
The original issue reproes very easily in HDFS XDC encryption case. Servers easily enter into bad state and we got high volume of timeouts from the client. With the fix, this does not happen anymore after the fix being deployed at 11:15PM. Here is the Scuba log before and after the fix:
https://fburl.com/109969805
And here is the correspond Scuba diagram for successful calls in the same test:
https://fburl.com/109971729
The throughput improved a lot after the fix.

Reviewed By: davejwatson@fb.com

Subscribers: doug, folly-diffs@, yfeldblum, chalfant, afrind, brettp, dougw, fma

FB internal diff: D2071646

Signature: t1:2071646:1431709609:10fb033536f9e4fb428dea8ba68f6a9a051616c0