urcu: add cacheline padding to URCU thread data
The thread-local data was previously smaller than a cacheline, so it was
possible for multiple thread's thread data to be allocated on the same
cacheline and cause false sharing.
I measured the offcore_response.all_data_rd.l3_hit.hitm_other_core
perf counter in a read-only workload with 16 threads executing a
read-only workload against a URCU-protected split-list set. Before, I
saw about 5.6M events per second. After, I saw basically none.
Additionally, 'perf c2c' showed a lot of cache-line bouncing in the URCU
read side critical section before, but now shows none.
I benchmarked the above read-only workload on a machine with 88 logical
cores (Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz) and 88 threads, and
the performance improved more than 2x.
Fixes issue #75