BUG/MAJOR: pools: fix possible race with free() in the lockless variant

In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed report showing reproducible crashes under musl. Musl is one of the libs coming with a simple allocator for which we prefer to keep the shared cache. On x86 we have a DWCAS so the lockless implementation is enabled for such libraries. And this implementation has had a small race since day one: the allocator will need to read the first object's <next> pointer to place it into the free list's head. If another thread picks the same element and immediately releases it, while both the local and the shared pools are too crowded, it will be freed to the OS. If the libc's allocator immediately releases it, the memory area is unmapped and we can have a crash while trying to read that pointer. However there is no problem as long as the item remains mapped in memory because whatever value found there will not be placed into the head since the counter will have changed. The probability for this to happen is extremely low, but as analyzed by Fabiano, it increases with the buffer size. On 16 threads it's relatively easy to reproduce with 2MB buffers above 200k req/s, where it should happen within the first 20 seconds of traffic usually. This is a structural issue for which there are two non-trivial solutions: - place a read lock in the alloc call and a barrier made of lock/unlock in the free() call to force to serialize operations; this will have a big performance impact since free() is already one of the contention points; - change the allocator to use a self-locked head, similar to what is done in the MT_LISTS. This requires two memory writes to the head instead of a single one, thus the overhead is exactly one memory write during alloc and one during free; This patch implements the second option. A new POOL_DUMMY pointer was defined for the locked pointer value, allowing to both read and lock it with a single xchg call. The code was carefully optimized so that the locked period remains the shortest possible and that bus writes are avoided as much as possible whenever the lock is held. Tests show that while a bit slower than the original lockless implementation on large buffers (2MB), it's 2.6 times faster than both the no-cache and the locked implementation on such large buffers, and remains as fast or faster than the all implementations when buffers are 48k or higher. Tests were also run on arm64 with similar results. Note that this code is not used on modern libcs featuring a fast allocator. A nice benefit of this change is that since it removes a dependency on the DWCAS, it will be possible to remove the locked implementation and replace it with this one, that is then usable on all systems, thus significantly increasing their performance with large buffers. Given that lockless pools were introduced in 1.9 (not supported anymore), this patch will have to be backported as far as 2.0. The code changed several times in this area and is subject to many ifdefs which will complicate the backport. What is important is to remove all the DWCAS code from the shared cache alloc/free lockless code and replace it with this one. The pool_flush() code is basically the same code as the allocator, retrieving the whole list at once. If in doubt regarding what barriers to use in older versions, it's safe to use the generic ones. This patch depends on the following previous commits: - MINOR: pools: do not maintain the lock during pool_flush() - MINOR: pools: call malloc_trim() under thread isolation - MEDIUM: pools: use a single pool_gc() function for locked and lockless The last one also removes one occurrence of an unneeded DWCAS in the code that was incompatible with this fix. The removal of the now unused seq field will happen in a future patch. Many thanks to Fabiano for his detailed report, and to Olivier for his help on this issue.
2025-04-20 22:15:40 +00:00 · 2021-06-09 18:59:58 +02:00 · 2021-06-09 18:59:58 +02:00 · 2a4523f6f4
commit 2a4523f6f4
parent 9b3ed51371
3 changed files with 54 additions and 30 deletions
--- a/include/haproxy/pool-t.h
+++ b/include/haproxy/pool-t.h
@ -68,6 +68,11 @@
 #define POOL_LINK(pool, item) ((void **)(item))
 #endif
 /* A special pointer for the pool's free_list that indicates someone is
 * currently manipulating it. Serves as a short-lived lock.
 */
 #define POOL_BUSY ((void *)1)
 #define POOL_AVG_SAMPLES 1024
 /* possible flags for __pool_alloc() */
@ -91,9 +96,6 @@ struct pool_free_list {
 	uintptr_t seq;
 };
 /* Note below, in case of lockless pools, we still need the lock only for
 * the flush() operation.
 */
 struct pool_head {
 	void **free_list;
 #ifdef CONFIG_HAP_LOCKLESS_POOLS
--- a/include/haproxy/pool.h
+++ b/include/haproxy/pool.h
@ -112,27 +112,39 @@ static inline void pool_put_to_shared_cache(struct pool_head *pool, void *ptr)
 */
 static inline void *pool_get_from_shared_cache(struct pool_head *pool)
 {
-	struct pool_free_list cmp, new;
+	void *ret;
-	cmp.seq = pool->seq;
+	/* we'll need to reference the first element to figure the next one. We
-	__ha_barrier_load();
+	 * must temporarily lock it so that nobody allocates then releases it,
-
+	 * or the dereference could fail.
-	cmp.free_list = pool->free_list;
+	 */
 	ret = pool->free_list;
 	do {
-		if (cmp.free_list == NULL)
+		while (unlikely(ret == POOL_BUSY)) {
-			return NULL;
+			__ha_cpu_relax();
-		new.seq = cmp.seq + 1;
+			ret = _HA_ATOMIC_LOAD(&pool->free_list);
-		__ha_barrier_load();
+		}
-		new.free_list = *POOL_LINK(pool, cmp.free_list);
+		if (ret == NULL)
-	} while (HA_ATOMIC_DWCAS((void *)&pool->free_list, (void *)&cmp, (void *)&new) == 0);
+			return ret;
-	__ha_barrier_atomic_store();
+	} while (unlikely((ret = _HA_ATOMIC_XCHG(&pool->free_list, POOL_BUSY)) == POOL_BUSY));
 	if (unlikely(ret == NULL)) {
 		_HA_ATOMIC_STORE(&pool->free_list, NULL);
 		goto out;
 	}
 	/* this releases the lock */
 	_HA_ATOMIC_STORE(&pool->free_list, *POOL_LINK(pool, ret));
 	_HA_ATOMIC_INC(&pool->used);
 #ifdef DEBUG_MEMORY_POOLS
 	/* keep track of where the element was allocated from */
-	*POOL_LINK(pool, cmp.free_list) = (void *)pool;
+	*POOL_LINK(pool, ret) = (void *)pool;
 #endif
-	return cmp.free_list;
+
 out:
 	__ha_barrier_atomic_store();
 	return ret;
 }
 /* Locklessly add item <ptr> to pool <pool>, then update the pool used count.
@ -141,16 +153,21 @@ static inline void *pool_get_from_shared_cache(struct pool_head *pool)
 */
 static inline void pool_put_to_shared_cache(struct pool_head *pool, void *ptr)
 {
-	void **free_list = pool->free_list;
+	void **free_list;
 	_HA_ATOMIC_DEC(&pool->used);
 	if (unlikely(pool_is_crowded(pool))) {
 		pool_put_to_os(pool, ptr);
 	} else {
 		free_list = _HA_ATOMIC_LOAD(&pool->free_list);
 		do {
-			*POOL_LINK(pool, ptr) = (void *)free_list;
+			while (unlikely(free_list == POOL_BUSY)) {
-			__ha_barrier_store();
+				__ha_cpu_relax();
 				free_list = _HA_ATOMIC_LOAD(&pool->free_list);
 			}
 			_HA_ATOMIC_STORE(POOL_LINK(pool, ptr), (void *)free_list);
 			__ha_barrier_atomic_store();
 		} while (!_HA_ATOMIC_CAS(&pool->free_list, &free_list, ptr));
 		__ha_barrier_atomic_store();
 	}
--- a/src/pool.c
+++ b/src/pool.c
@ -293,21 +293,26 @@ void pool_gc(struct pool_head *pool_ctx)
 */
 void pool_flush(struct pool_head *pool)
 {
-	struct pool_free_list cmp, new;
+	void *next, *temp;
 	void **next, *temp;
 	if (!pool)
 		return;
-	HA_SPIN_LOCK(POOL_LOCK, &pool->lock);
+
 	/* The loop below atomically detaches the head of the free list and
 	 * replaces it with a NULL. Then the list can be released.
 	 */
 	next = pool->free_list;
 	do {
-		cmp.free_list = pool->free_list;
+		while (unlikely(next == POOL_BUSY)) {
-		cmp.seq = pool->seq;
+			__ha_cpu_relax();
-		new.free_list = NULL;
+			next = _HA_ATOMIC_LOAD(&pool->free_list);
-		new.seq = cmp.seq + 1;
+		}
-	} while (!_HA_ATOMIC_DWCAS(&pool->free_list, &cmp, &new));
+		if (next == NULL)
 			return;
 	} while (unlikely((next = _HA_ATOMIC_XCHG(&pool->free_list, POOL_BUSY)) == POOL_BUSY));
 	_HA_ATOMIC_STORE(&pool->free_list, NULL);
 	__ha_barrier_atomic_store();
-	HA_SPIN_UNLOCK(POOL_LOCK, &pool->lock);
+
 	next = cmp.free_list;
 	while (next) {
 		temp = next;
 		next = *POOL_LINK(pool, temp);