2013/11/20 - How hashing works internally in haproxy - maddalab@gmail.com This document describes how Haproxy implements hashing both map-based and consistent hashing, both prior to versions 1.5 and the motivation and tests that were done when providing additional options starting in version 1.5. A note on hashing in general, hash functions strive to have little correlation between input and output. The heart of a hash function is its mixing step. The behavior of the mixing step largely determines whether the hash function is collision-resistant. Hash functions that are collision resistant are more likely to have an even distribution of load. The purpose of the mixing function is to spread the effect of each message bit throughout all the bits of the internal state. Ideally every bit in the hash state is affected by every bit in the message. And we want to do that as quickly as possible simply for the sake of program performance. A function is said to satisfy the strict avalanche criterion if, whenever a single input bit is complemented (toggled between 0 and 1), each of the output bits should change with a probability of one half for an arbitrary selection of the remaining input bits. To guard against a combination of hash function and input that results in high rate of collisions, haproxy implements an avalanche algorithm on the result of the hashing function. In all versions 1.4 and prior avalanche is always applied when using the consistent hashing directive. It is intended to provide quite a good distribution for little input variations. The result is quite suited to fit over a 32-bit space with enough variations so that a randomly picked number falls equally before any server position, which is ideal for consistently hashed backends, a common use case for caches. In all versions 1.4 and prior Haproxy implements the SDBM hashing function. However tests show that alternatives to SDBM have a better cache distribution on different hashing criteria. Additional tests involving alternatives for hash input and an option to trigger avalanche, we found different algorithms perform better on different criteria. DJB2 performs well when hashing ascii text and is a good choice when hashing on host header. Other alternatives perform better on numbers and are a good choice when using source ip. The results also vary by use of the avalanche flag. The results of the testing can be found under the tests folder. Here is a summary of the discussion on the results on 1 input criteria and the methodology used to generate the results. A note of the setup when validating the results independently, one would want to avoid backend server counts that may skew the results. As an example with DJB2 avoid 33 servers. Please see the implementations of the hashing function, which can be found in the links under references. The following was the set up used (a) hash-type consistent/map-based (b) avalanche on/off (c) balanche host(hdr) (d) 3 criteria for inputs - ~ 10K requests, including duplicates - ~ 46K requests, unique requests from 1 MM requests were obtained - ~ 250K requests, including duplicates (e) 17 servers in backend, all servers were assigned the same weight Result of the hashing were obtained across the server via monitoring log files for haproxy. Population Standard deviation was used to evaluate the efficacy of the hashing algorithm. Lower standard deviation, indicates a better distribution of load across the backends. On 10K requests, when using consistent hashing with avalanche on host headers, DJB2 significantly out performs SDBM. Std dev on SDBM was 48.95 and DJB2 was 26.29. This relationship is inverted with avalanche disabled, however DJB2 with avalanche enabled out performs SDBM with avalanche disabled. On map-based hashing SDBM out performs DJB2 irrespective of the avalanche option. SDBM without avalanche is marginally better than with avalanche. DJB2 performs significantly worse with avalanche enabled. Summary: The results of the testing indicate that there isn't a hashing algorithm that can be applied across all input criteria. It is necessary to support alternatives to SDBM, which is generally the best option, with algorithms that are better for different inputs. Avalanche is not always applicable and may result in less smooth distribution. References: Mixing Functions/Avalanche: http://home.comcast.net/~bretm/hash/3.html Hash Functions: http://www.cse.yorku.ca/~oz/hash.html