Overall almost 4% faster, idct_add down from 350 to 85 cycles, idct_dc_add down from 83 to 30 cycles. squash: rv34 idct rearrange partial register loads