Restore alphabetical order in lists, break overly long lines, do some
prettyprinting, add some explanatory section comments, group parts
together that belong together logically.
modelled after aarch64 code
on Cortex-A8, s16 and s32 code is about 2x faster,
float code about 7x faster
Signed-off-by: Peter Meerwald <pmeerw@pmeerw.net>
Signed-off-by: Martin Storsjö <martin@martin.st>