This splits the asm function into exact and non-exact version. The exact
version is as fast or faster on newer CPUs (which EXTERNAL_AVX_FAST describes
well) whilst the non-exact version is faster than the exact on older CPUs.
Also fixes yasm compilation which doesn't accept !cpuflags(avx) syntax.
Signed-off-by: Rostislav Pehlivanov <atomnuker@gmail.com>