extensions: add AVX2 linear-float -> gamma-u8 conversions
Add AVX2 conversions from Y float, YA float, RGB float, and RGBA
float, to Y' u8, Y'A u8, R'G'B' u8, and R'G'B'A u8, respectively.
The conversions use a lookup table, similarly to the two-table
conversions, indexed using AVX2's 256-bit gather instruction, which
allow us to process 8 floats at once. Over here, this conversion
is ~5x faster than the SSE conversions.
Note, however, that unlike two-table, we don't use a second
"feedback" table to correct the result, leading to an off-by-one
conversion error with a probability of ~0.1%, using a 2^16-element
table. This error rate is low enough for babl to use the
conversion, but might still be a bit too high regardless; it can be
further reduced with a bigger table.