Skip to content

Commit 142bc91

Browse files
committed
automata: reduce regex contention somewhat
> **Context:** A `Regex` uses internal mutable space (called a `Cache`) > while executing a search. Since a `Regex` really wants to be easily > shared across multiple threads simultaneously, it follows that a > `Regex` either needs to provide search functions that accept a `&mut > Cache` (thereby pushing synchronization to a problem for the caller > to solve) or it needs to do synchronization itself. While there are > lower level APIs in `regex-automata` that do the former, they are > less convenient. The higher level APIs, especially in the `regex` > crate proper, need to do some kind of synchronization to give a > search the mutable `Cache` that it needs. > > The current approach to that synchronization essentially uses a > `Mutex<Vec<Cache>>` with an optimization for the "owning" thread > that lets it bypass the `Mutex`. The owning thread optimization > makes it so the single threaded use case essentially doesn't pay for > any synchronization overhead, and that all works fine. But once the > `Regex` is shared across multiple threads, that `Mutex<Vec<Cache>>` > gets hit. And if you're doing a lot of regex searches on short > haystacks in parallel, that `Mutex` comes under extremely heavy > contention. To the point that a program can slow down by enormous > amounts. > > This PR attempts to address that problem. > > Note that it's worth pointing out that this issue can be worked > around. > > The simplest work-around is to clone a `Regex` and send it to other > threads instead of sharing a single `Regex`. This won't use any > additional memory (a `Regex` is reference counted internally), > but it will force each thread to use the "owner" optimization > described above. This does mean, for example, that you can't > share a `Regex` across multiple threads conveniently with a > `lazy_static`/`OnceCell`/`OnceLock`/whatever. > > The other work-around is to use the lower level search APIs on a > `meta::Regex` in the `regex-automata` crate. Those APIs accept a > `&mut Cache` explicitly. In that case, you can use the `thread_local` > crate or even an actual `thread_local!` or something else entirely. I wish I could say this PR was a home run that fixed the contention issues with `Regex` once and for all, but it's not. It just makes things a fair bit better by switching from one stack to eight stacks for the pool, plus a couple other heuristics. The stack is chosen by doing `self.stacks[thread_id % 8]`. It's a pretty dumb strategy, but it limits extra memory usage while at least reducing contention. Obviously, it works a lot better for the 8-16 thread case, and while it helps with the 64-128 thread case too, things are still pretty slow there. A benchmark for this problem is described in #934. We compare 8 and 16 threads, and for each thread count, we compare a `cloned` and `shared` approach. The `cloned` approach clones the regex before sending it to each thread where as the `shared` approach shares a single regex across multiple threads. The `cloned` approach is expected to be fast (and it is) because it forces each thread into the owner optimization. The `shared` approach, however, hit the shared stack behind a mutex and suffers majorly from contention. Here's what that benchmark looks like before this PR for 64 threads (on a 24-core CPU). ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 128.3 ms, System: 5.7 ms] Range (min … max): 7.7 ms … 11.1 ms 278 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master Time (mean ± σ): 1.938 s ± 0.036 s [User: 4.827 s, System: 41.401 s] Range (min … max): 1.885 s … 1.992 s 10 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 215.02 ± 15.45 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./tmp/repro-master' ``` And here's what it looks like after this PR: ``` $ hyperfine "REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro" "REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro" Benchmark 1: REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 9.0 ms ± 0.6 ms [User: 127.6 ms, System: 6.2 ms] Range (min … max): 7.9 ms … 11.7 ms 287 runs Benchmark 2: REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro Time (mean ± σ): 55.0 ms ± 5.1 ms [User: 1050.4 ms, System: 12.0 ms] Range (min … max): 46.1 ms … 67.3 ms 57 runs Summary 'REGEX_BENCH_WHICH=cloned REGEX_BENCH_THREADS=64 ./target/release/repro' ran 6.09 ± 0.71 times faster than 'REGEX_BENCH_WHICH=shared REGEX_BENCH_THREADS=64 ./target/release/repro' ``` So instead of things getting over 215x slower in the 64 thread case, it "only" gets 6x slower. Closes #934
1 parent 9a505a1 commit 142bc91

File tree

1 file changed

+173
-22
lines changed

1 file changed

+173
-22
lines changed

regex-automata/src/util/pool.rs

+173-22
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@ needing to re-create the scratch space for every search, which could wind up
8787
being quite expensive.
8888
*/
8989

90+
use alloc::boxed::Box;
91+
9092
/// A thread safe pool that works in an `alloc`-only context.
9193
///
9294
/// Getting a value out comes with a guard. When that guard is dropped, the
@@ -151,13 +153,13 @@ being quite expensive.
151153
/// let expected = Some(Match::must(0, 3..14));
152154
/// assert_eq!(expected, RE.find(&mut CACHE.get(), b"zzzfoo12345barzzz"));
153155
/// ```
154-
pub struct Pool<T, F = fn() -> T>(alloc::boxed::Box<inner::Pool<T, F>>);
156+
pub struct Pool<T, F = fn() -> T>(Box<inner::Pool<T, F>>);
155157

156158
impl<T, F> Pool<T, F> {
157159
/// Create a new pool. The given closure is used to create values in
158160
/// the pool when necessary.
159161
pub fn new(create: F) -> Pool<T, F> {
160-
Pool(alloc::boxed::Box::new(inner::Pool::new(create)))
162+
Pool(Box::new(inner::Pool::new(create)))
161163
}
162164
}
163165

@@ -235,7 +237,7 @@ mod inner {
235237
sync::atomic::{AtomicUsize, Ordering},
236238
};
237239

238-
use alloc::{boxed::Box, vec, vec::Vec};
240+
use alloc::{boxed::Box, vec::Vec};
239241

240242
use std::{sync::Mutex, thread_local};
241243

@@ -268,6 +270,64 @@ mod inner {
268270
/// do.
269271
static THREAD_ID_DROPPED: usize = 2;
270272

273+
/// The number of stacks we use inside of the pool. These are only used for
274+
/// non-owners. That is, these represent the "slow" path.
275+
///
276+
/// In the original implementation of this pool, we only used a single
277+
/// stack. While this might be okay for a couple threads, the prevalence of
278+
/// 32, 64 and even 128 core CPUs has made it untenable. The contention
279+
/// such an environment introduces when threads are doing a lot of searches
280+
/// on short haystacks (a not uncommon use case) is palpable and leads to
281+
/// huge slowdowns.
282+
///
283+
/// This constant reflects a change from using one stack to the number of
284+
/// stacks that this constant is set to. The stack for a particular thread
285+
/// is simply chosen by `thread_id % MAX_POOL_STACKS`. The idea behind
286+
/// this setup is that there should be a good chance that accesses to the
287+
/// pool will be distributed over several stacks instead of all of them
288+
/// converging to one.
289+
///
290+
/// This is not a particularly smart or dynamic strategy. Fixing this to a
291+
/// specific number has at least two downsides. First is that it will help,
292+
/// say, an 8 core CPU more than it will a 128 core CPU. (But, crucially,
293+
/// it will still help the 128 core case.) Second is that this may wind
294+
/// up being a little wasteful with respect to memory usage. Namely, if a
295+
/// regex is used on one thread and then moved to another thread, then it
296+
/// could result in creating a new copy of the data in the pool even though
297+
/// only one is actually needed.
298+
///
299+
/// And that memory usage bit is why this is set to 8 and not, say, 64.
300+
/// Keeping it at 8 limits, to an extent, how much unnecessary memory can
301+
/// be allocated.
302+
///
303+
/// In an ideal world, we'd be able to have something like this:
304+
///
305+
/// * Grow the number of stacks as the number of concurrent callers
306+
/// increases. I spent a little time trying this, but even just adding an
307+
/// atomic addition/subtraction for each pop/push for tracking concurrent
308+
/// callers led to a big perf hit. Since even more work would seemingly be
309+
/// required than just an addition/subtraction, I abandoned this approach.
310+
/// * The maximum amount of memory used should scale with respect to the
311+
/// number of concurrent callers and *not* the total number of existing
312+
/// threads. This is primarily why the `thread_local` crate isn't used, as
313+
/// as some environments spin up a lot of threads. This led to multiple
314+
/// reports of extremely high memory usage (often described as memory
315+
/// leaks).
316+
/// * Even more ideally, the pool should contract in size. That is, it
317+
/// should grow with bursts and then shrink. But this is a pretty thorny
318+
/// issue to tackle and it might be better to just not.
319+
/// * It would be nice to explore the use of, say, a lock-free stack
320+
/// instead of using a mutex to guard a `Vec` that is ultimately just
321+
/// treated as a stack. The main thing preventing me from exploring this
322+
/// is the ABA problem. The `crossbeam` crate has tools for dealing with
323+
/// this sort of problem (via its epoch based memory reclamation strategy),
324+
/// but I can't justify bringing in all of `crossbeam` as a dependency of
325+
/// `regex` for this.
326+
///
327+
/// See this issue for more context and discussion:
328+
/// https://github.com/rust-lang/regex/issues/934
329+
const MAX_POOL_STACKS: usize = 8;
330+
271331
thread_local!(
272332
/// A thread local used to assign an ID to a thread.
273333
static THREAD_ID: usize = {
@@ -291,6 +351,17 @@ mod inner {
291351
};
292352
);
293353

354+
/// This puts each stack in the pool below into its own cache line. This is
355+
/// an absolutely critical optimization that tends to have the most impact
356+
/// in high contention workloads. Without forcing each mutex protected
357+
/// into its own cache line, high contention exacerbates the performance
358+
/// problem by causing "false sharing." By putting each mutex in its own
359+
/// cache-line, we avoid the false sharing problem and the affects of
360+
/// contention are greatly reduced.
361+
#[derive(Debug)]
362+
#[repr(C, align(64))]
363+
struct CacheLine<T>(T);
364+
294365
/// A thread safe pool utilizing std-only features.
295366
///
296367
/// The main difference between this and the simplistic alloc-only pool is
@@ -299,12 +370,16 @@ mod inner {
299370
/// This makes the common case of running a regex within a single thread
300371
/// faster by avoiding mutex unlocking.
301372
pub(super) struct Pool<T, F> {
302-
/// A stack of T values to hand out. These are used when a Pool is
303-
/// accessed by a thread that didn't create it.
304-
stack: Mutex<Vec<Box<T>>>,
305373
/// A function to create more T values when stack is empty and a caller
306374
/// has requested a T.
307375
create: F,
376+
/// Multiple stacks of T values to hand out. These are used when a Pool
377+
/// is accessed by a thread that didn't create it.
378+
///
379+
/// Conceptually this is `Mutex<Vec<Box<T>>>`, but sharded out to make
380+
/// it scale better under high contention work-loads. We index into
381+
/// this sequence via `thread_id % stacks.len()`.
382+
stacks: Vec<CacheLine<Mutex<Vec<Box<T>>>>>,
308383
/// The ID of the thread that owns this pool. The owner is the thread
309384
/// that makes the first call to 'get'. When the owner calls 'get', it
310385
/// gets 'owner_val' directly instead of returning a T from 'stack'.
@@ -354,9 +429,17 @@ mod inner {
354429
unsafe impl<T: Send, F: Send + Sync> Sync for Pool<T, F> {}
355430

356431
// If T is UnwindSafe, then since we provide exclusive access to any
357-
// particular value in the pool, it should therefore also be considered
358-
// RefUnwindSafe. Also, since we use std::sync::Mutex, we get poisoning
359-
// from it if another thread panics while the lock is held.
432+
// particular value in the pool, the pool should therefore also be
433+
// considered UnwindSafe.
434+
//
435+
// We require `F: UnwindSafe + RefUnwindSafe` because we call `F` at any
436+
// point on demand, so it needs to be unwind safe on both dimensions for
437+
// the entire Pool to be unwind safe.
438+
impl<T: UnwindSafe, F: UnwindSafe + RefUnwindSafe> UnwindSafe for Pool<T, F> {}
439+
440+
// If T is UnwindSafe, then since we provide exclusive access to any
441+
// particular value in the pool, the pool should therefore also be
442+
// considered RefUnwindSafe.
360443
//
361444
// We require `F: UnwindSafe + RefUnwindSafe` because we call `F` at any
362445
// point on demand, so it needs to be unwind safe on both dimensions for
@@ -375,9 +458,13 @@ mod inner {
375458
// 'Pool::new' method as 'const' too. (The alloc-only Pool::new
376459
// is already 'const', so that should "just work" too.) The only
377460
// thing we're waiting for is Mutex::new to be const.
461+
let mut stacks = Vec::with_capacity(MAX_POOL_STACKS);
462+
for _ in 0..stacks.capacity() {
463+
stacks.push(CacheLine(Mutex::new(alloc::vec![])));
464+
}
378465
let owner = AtomicUsize::new(THREAD_ID_UNOWNED);
379466
let owner_val = UnsafeCell::new(None); // init'd on first access
380-
Pool { stack: Mutex::new(vec![]), create, owner, owner_val }
467+
Pool { create, stacks, owner, owner_val }
381468
}
382469
}
383470

@@ -401,6 +488,9 @@ mod inner {
401488
let caller = THREAD_ID.with(|id| *id);
402489
let owner = self.owner.load(Ordering::Acquire);
403490
if caller == owner {
491+
// N.B. We could also do a CAS here instead of a load/store,
492+
// but ad hoc benchmarking suggests it is slower. And a lot
493+
// slower in the case where `get_slow` is common.
404494
self.owner.store(THREAD_ID_INUSE, Ordering::Release);
405495
return self.guard_owned(caller);
406496
}
@@ -444,37 +534,82 @@ mod inner {
444534
return self.guard_owned(caller);
445535
}
446536
}
447-
let mut stack = self.stack.lock().unwrap();
448-
let value = match stack.pop() {
449-
None => Box::new((self.create)()),
450-
Some(value) => value,
451-
};
452-
self.guard_stack(value)
537+
let stack_id = caller % self.stacks.len();
538+
// We try to acquire exclusive access to this thread's stack, and
539+
// if so, grab a value from it if we can. We put this in a loop so
540+
// that it's easy to tweak and experiment with a different number
541+
// of tries. In the end, I couldn't see anything obviously better
542+
// than one attempt in ad hoc testing.
543+
for _ in 0..1 {
544+
let mut stack = match self.stacks[stack_id].0.try_lock() {
545+
Err(_) => continue,
546+
Ok(stack) => stack,
547+
};
548+
if let Some(value) = stack.pop() {
549+
return self.guard_stack(value);
550+
}
551+
// Unlock the mutex guarding the stack before creating a fresh
552+
// value since we no longer need the stack.
553+
drop(stack);
554+
let value = Box::new((self.create)());
555+
return self.guard_stack(value);
556+
}
557+
// We're only here if we could get access to our stack, so just
558+
// create a new value. This seems like it could be wasteful, but
559+
// waiting for exclusive access to a stack when there's high
560+
// contention is brutal for perf.
561+
self.guard_stack_transient(Box::new((self.create)()))
453562
}
454563

455564
/// Puts a value back into the pool. Callers don't need to call this.
456565
/// Once the guard that's returned by 'get' is dropped, it is put back
457566
/// into the pool automatically.
458567
fn put_value(&self, value: Box<T>) {
459-
let mut stack = self.stack.lock().unwrap();
460-
stack.push(value);
568+
let caller = THREAD_ID.with(|id| *id);
569+
let stack_id = caller % self.stacks.len();
570+
// As with trying to pop a value from this thread's stack, we
571+
// merely attempt to get access to push this value back on the
572+
// stack. If there's too much contention, we just give up and throw
573+
// the value away.
574+
//
575+
// Interestingly, in ad hoc benchmarking, it is beneficial to
576+
// attempt to push the value back more than once, unlike when
577+
// popping the value. I don't have a good theory for why this is.
578+
// I guess if we drop too many values then that winds up forcing
579+
// the pop operation to create new fresh values and thus leads to
580+
// less reuse. There's definitely a balancing act here.
581+
for _ in 0..10 {
582+
let mut stack = match self.stacks[stack_id].0.try_lock() {
583+
Err(_) => continue,
584+
Ok(stack) => stack,
585+
};
586+
stack.push(value);
587+
return;
588+
}
461589
}
462590

463591
/// Create a guard that represents the special owned T.
464592
fn guard_owned(&self, caller: usize) -> PoolGuard<'_, T, F> {
465-
PoolGuard { pool: self, value: Err(caller) }
593+
PoolGuard { pool: self, value: Err(caller), discard: false }
466594
}
467595

468596
/// Create a guard that contains a value from the pool's stack.
469597
fn guard_stack(&self, value: Box<T>) -> PoolGuard<'_, T, F> {
470-
PoolGuard { pool: self, value: Ok(value) }
598+
PoolGuard { pool: self, value: Ok(value), discard: false }
599+
}
600+
601+
/// Create a guard that contains a value from the pool's stack with an
602+
/// instruction to throw away the value instead of putting it back
603+
/// into the pool.
604+
fn guard_stack_transient(&self, value: Box<T>) -> PoolGuard<'_, T, F> {
605+
PoolGuard { pool: self, value: Ok(value), discard: true }
471606
}
472607
}
473608

474609
impl<T: core::fmt::Debug, F> core::fmt::Debug for Pool<T, F> {
475610
fn fmt(&self, f: &mut core::fmt::Formatter<'_>) -> core::fmt::Result {
476611
f.debug_struct("Pool")
477-
.field("stack", &self.stack)
612+
.field("stacks", &self.stacks)
478613
.field("owner", &self.owner)
479614
.field("owner_val", &self.owner_val)
480615
.finish()
@@ -490,6 +625,12 @@ mod inner {
490625
/// in the special case of `Err(THREAD_ID_DROPPED)`, it means the
491626
/// guard has been put back into the pool and should no longer be used.
492627
value: Result<Box<T>, usize>,
628+
/// When true, the value should be discarded instead of being pushed
629+
/// back into the pool. We tend to use this under high contention, and
630+
/// this allows us to avoid inflating the size of the pool. (Because
631+
/// under contention, we tend to create more values instead of waiting
632+
/// for access to a stack of existing values.)
633+
discard: bool,
493634
}
494635

495636
impl<'a, T: Send, F: Fn() -> T> PoolGuard<'a, T, F> {
@@ -557,7 +698,17 @@ mod inner {
557698
#[inline(always)]
558699
fn put_imp(&mut self) {
559700
match core::mem::replace(&mut self.value, Err(THREAD_ID_DROPPED)) {
560-
Ok(value) => self.pool.put_value(value),
701+
Ok(value) => {
702+
// If we were told to discard this value then don't bother
703+
// trying to put it back into the pool. This occurs when
704+
// the pop operation failed to acquire a lock and we
705+
// decided to create a new value in lieu of contending for
706+
// the lock.
707+
if self.discard {
708+
return;
709+
}
710+
self.pool.put_value(value);
711+
}
561712
// If this guard has a value "owned" by the thread, then
562713
// the Pool guarantees that this is the ONLY such guard.
563714
// Therefore, in order to place it back into the pool and make

0 commit comments

Comments
 (0)