Still not faster than SSE2. Improving upon SSE2 performance will
probably require restructuring the algorithm to combine the various
multiply/add operations, but I'm not sure how to do that without
introducing further roundoff error. Left as an exercise for the reader.
The IFAST FDCT algorithm is sort of a legacy feature anyhow. Even with
SSE2 instructions, the ISLOW FDCT was almost as fast as the IFAST FDCT.
Since the ISLOW FDCT has been accelerated with AVX2 instructions, it is
now about the same speed as the IFAST FDCT on AVX2-equipped CPUs.