This gets rid of the variable-length scratch buffer by filtering 16
pixels at a time and writing directly to the destination. The extra
loads this requires to load the source values are compensated by not
doing a round-trip to memory before shifting.
Signed-off-by: Mans Rullgard <mans@mansr.com>
Use uintptr_t instead of plain int. Without this change, the
comparisons will come out wrong for pointers in certain ranges.
Fixes random failures on ppc64. Also fixes some compiler warnings.
Signed-off-by: Mans Rullgard <mans@mansr.com>
This allows using more specific implementations for chroma/luma, e.g.
we can make assumptions on filterSize being constant, thus avoiding
that test at runtime.
It just does that part in scalar form, I doubt using a vector store
over 2 array would speed it up particularly.
The function should be written to not use a scratch buffer.
Remove unused variables "flags" and "dstFormat" in yuv2packed1,
merge source rows per plane for yuv2packed[12], and make every
source argument int16_t (some where invalidly set to uint16_t).
This prevents stack pollution and is part of the Great Evil Plan
to simplify swscale.
This will likely lead to a considerable performance boost,
since it removes a branch from the inner loop. Part of the
Great Evil Plan to simplify swscale.