Autor: Marko Cebokli Data: A: Minimalistic plugin API for video effects Assumpte: Re: [Frei0r] Add fast char-float-char conversion functions with
gamma correction
On Sunday, November 04, 2012 12:20:56 PM Steinar H. Gunderson wrote: >
> The technique is creative, but this isn't correct:
>
> if (a>0.9999) a=0.9999;
> ft[i]=(uint8_t)(256.0*a);
>
> This introduces round-off errors into your tables (cast-to-int truncates,
> it does not round), and also a bias since you're multiplying by the wrong
> value. You want
>
> ft[i]=lrintf(255.0*a);
>
> No need for the 0.9999 hack. Similarly, this is wrong:
>
> a=((float)i+0.5)/256.0;
>
> and should be replaced with
>
> a=i/255.0;
>
Well, I have tested this so that any uint8 value, when converted to float and
back, remains the same uint8 value, for all gamma variants.
I did not do exact error measurement with integration, for the float to uint
conversion, might do it if I find some time. But even then, what is the thing
to minimize, AVG, RMS or MAX error?
Note that here we do not have the simple +-0.5 triangular quantization error,
as the "step" of the truncated float numbers changes with value, etc.
> Second, you seem to have forgotten the return type on float_2_uint8:
>
> static inline float_2_uint8(const float *in, uint8_t *tab)
>
Mea culpa, will correct this.
> You probably want to add an “int” here. Also, RGB8_2_float() etc. should
> either be declared static inline, or moved to a .c file, or you'll end up
> linking them over and over again into your binary.
>
Agree, will do that too.
> Third, you say this is “fast”, but have you actually measured it?
> The backwards transform it a lookup into a 64 kB table, but your L1 data
> cache is typically only 32 kB, so you'll be reading from L2 all the time.
> (That's nonwithstanding any penalties for going through memory for the
> type-punning hacks.) There's a good reason why I chose a 14-bit table for
> colgate instead of a 16-bit table :-)
>
You were converting from int, with float you don't get enough of the mantisa
with 14 bits, unles you do some bit shuffling first (which you need in any
case to get a 14 bit number). Using 16 bits one can cover ALL the float values
including nans & co, with extremely siple code. Getting rid of bit shuffling
and range testing/clamping should more than compensate for L2.
And L2 isn't that much slower than L1, at least judging by the linux memory
tester.
Not sure that the punning must go through the memory, depends on the compiler.
My experience is that GCC's -O2 almost always beat me, when I was trying to be
smart with assembler.
> Anyway, for gamma conversions, using a table is probably OK, but for linear
> conversions, you'll be blown out of the sky by the cvtss2si (or even
> cvtps2pi) instruction, aka lrintf() (on an -ffast-math system).
>
I did study these SSE conversion instructions a bit, but wasn't really
impressed, they are among the slower ones. And you still need to go from 32
to 8 bits and multiply (shift).
Regardless, the main reason for using tables even with linear, is again the
ability to safely convert ANY float value, no range checking/clamping and
multiplication needed.
And of course, the KISS principle, one code fits all.
> > out[i].r=f1*(float)*cin++;
> > and
> > *cout++=(uint8_t)(in[i].r*255.0);
>
> This is a straw man, though, since the second of these is not the actual
> fastest way to convert a float to an int (again, see lrintf()).
>
Straw man or not, it is being currently used in actual plugins.
Didn't knew that lrintf is faster than a simple cast? Will try that out to
see.
Actually I thought the compiler might be smart enough to use the fastest
available instructions, including SSE.