CPUs have to serialise write access to a single cache line
effectively reducing level of possible parallelism. Placing
mutexes on different cache lines avoids this problem.
Most gains will be seen on NUMA servers where CPU interconnect
traffic is especially expensive
Before:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4 2000000 932 ns/op
BenchmarkFingerprintLockerSerial-4 30000000 49.6 ns/op
After:
go test . -run none -bench BenchmarkFingerprintLocker
BenchmarkFingerprintLockerParallel-4 3000000 569 ns/op
BenchmarkFingerprintLockerSerial-4 30000000 51.0 ns/op