modelled after aarch64 code on Cortex-A8, s16 and s32 code is about 2x faster, float code about 7x faster Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net> Signed-off-by: Martin Storsjö <martin@martin.st>