• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Wow, a big C/C++ code speedup!!!

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

macklin01

Computational Oncologist / Biomathematician / Mode
Joined
Apr 3, 2002
Location
Bloomington, IN
I was messing with some C++ code today using the standard gcc/g++ compiler.

I'm not sure about you all, but I use the function "pow( double, double)" all the time. Particularly useful for scientific computing, I suppose, but also for graphics programming. In particular, I use things like

Radius = pow( x, 2.0) + pow( y , 2.0);

quite a bit.

I found that code I compiled ran very slowly. In particular, one algorithm I have for reinitializing a 150x150 data set in a particular way would take 5-10 minutes on a P4 running with fully optimized code. (-O3, -march=pentium4, etc.)

Well, the other day, I was writing some scratch code, and for the heck of it, I tried things like x*x instead of pow(x,2.0), and the resulting code was ten times faster. I couldn't believe it. Well, today I went back and made a new function:

Code:
double intpow( double base, int exponent )
{
 int i;
 double out = base;
 for( i=1 ; i < exponent ; i++ )
 {
  out *= base;
 }
 return out;
}

and used that anytime I had integer powers. As expected, it sped the costliest function of my computational tumor growth function up from taking 5-10 minutes to 4-5 seconds. It was just amazing.

I'd assume this is because pow(double,double) needs to be a lot more complicated to support things like pow( 3.5, 1.5 ). It leaves me very tempted to compare the performance of sqrt(x) and pow(x,0.5). Fascinating stuff.

Anyway, since exponents are used so frequently in coding, I thought I'd share that with all of you. It's amazing what a difference such a small change can make!

-- Paul
 
Good work. You might see a similiar advantage if you make that an inline function. Function calls are very expensive in terms of time. Loops are also somewhat expensive because of how processors are designed. The best possible solution (C++) is to use template recursion, which could use multiple inlining and come up with the best possible code, just a few inline multiplications and no loop. Heres a quick example:

Code:
template <unsigned int exponent>
inline double intpow(double base)
{
  return intpow<exponent-1>(base) * base;
}

template <>
inline double intpow<0>(double base)
{
  return 1;
}

int main()
{
  cout << intpow<12>(1.2345) << " == " << pow(1.2345, 12) << endl;
}
This only works with constant exponents though.
 
I may be wrong, I am pretty new to C++, but isn't inline like garbage collecting in Java in that it is only a suggestion to the compiler but is not forced? The actual compiler decides (based on some algorithm) if it will actually add the code inline.

Not saying it is a bad suggestion or unwarrented but I wanted to clarify for myself.
 
Yes, I think you are correct. Most compiliers honor it, I believe. The academic and express versions of MSVS are the only exceptions I know of. Another problem is it often isn't honored if optimizations are turned off, so the code will run very slow during debugging. I guess the best answer requires some empirical testing to decide.

Another problem is that compiliers might not allow too many layers of templates. I think the standard says this should be configurable, but default to at least 32. So my code my not work with something like intpow<33>(x).

Another trick I use is macros. Even less flexible than templates, but they work in C and they are fast.
Code:
#define SQUARE(x) x*x
#define CUBE(x) x*x*x
 
Hmm, thanks for the tips. I may have to give this a try later on. Defining square in this way is certainly the optimal solution for the most commonly used power; I guess that there's always a tradeoff between ultimate flexibility ( pow( double, double ) ) and ultimate speed ( #define square ... )

Would your define macro be faster that a function double square( double )? Thanks -- Paul
 
The macro should always be at least as fast as the function, when it has a simple arugment. Its entirly possible that that compilier can make the function as fast as the macro (by inlining it and such,) but it should never get faster.

One gotcha is argument expansion. pow(0.1 / 0.4, 2) and intpow(0.1 / 0.4, 2) will evaluate 0.1/0.4 exactly one time. The macro SQUARE(0.1 / 0.4) will be expanded into (0.1 / 0.4) * (0.1 / 0.4), which will be slower than intpow because it evaluates the divide twice. Any attempt to work around this is hacky at best. Heres how I might do that.

Code:
double fast_square_x;
#define FAST_SQUARE(x) (((fast_square_x=x)&0) + fast_square_x * fast_square_x)
I used a global variable and an embeded assignement. I'm also preying that the compilier optimizer doesn't throw out the assignement with the &0, uses left to right ordering for addition and that this macro isn't used in multiple threads simultaneously.

Inline functions are a much cleaner way to do this sort of thing.
 
Another tiny optimization you could throw in for the original for loop intpow(x,y) would be to use a decrementing loop like so:

for( i=exponent-1 ; i!=0 ; i-- )

I don't know for sure what gcc would do it, but the assembly for decrementing loops is quicker than incrementing, so it might shave off a tiny fraction of overhead. EDIT: Heck, if you're adventurous and can get ASM compiling to work, just write an assembly language version of intpow(x,y) :D

JigPu
 
This would be faster in many cases:

Code:
double intpow( double base, int exponent )
{
  int i;
  double out = 1, curpwr = base;
  for( ; exponent > 0; exponent = exponent >> 1)
  {
    if ((exponent & 1) > 0)
      base *= curpwr;
    curpwr *= curpwr;
  }
  return out;
}

As a template:
Code:
template <unsigned int exponent>
inline double intpowt(double base)
{
  return intpowt<exponent >> 1>(base*base) * (((exponent & 1) > 0) ? base : 1);
}

template <>
inline double intpowt<0>(double base)
{
  return 1;
}

Or for constant powers if you dislike templates:
Code:
#define intpow5(b,e) \
(((e & 1) > 0) ? b : 1)
#define intpow4(b,e) \
((((e & 1) > 0) ? b : 1) * intpow5((b*b), (e >> 1)))
#define intpow3(b,e) \
((((e & 1) > 0) ? b : 1) * intpow4((b*b), (e >> 1)))
#define intpow2(b,e) \
((((e & 1) > 0) ? b : 1) * intpow3((b*b), (e >> 1)))
#define intpow1(b,e) \
((((e & 1) > 0) ? b : 1) * intpow2((b*b), (e >> 1)))
#define intpow(b,e) \
((((e & 1) > 0) ? b : 1) * intpow1((b*b), (e >> 1)))
Add additional layers for exponents > 32.

If you know M4 you could write a recursive macro version as well :)

edit: Fixed a few typos ...
 
A macro preprocessor; sorta the C preprocessor on steroids :) One neat feature is that you're allowed to write recursive macros (with some limitations), so instead of the multiple intpow functions you could just do
Code:
#define intpow(b,e) \
((((e & 1) > 0) ? b : 1) * intpow((b*b), (e >> 1)))
with some syntatic changes to show that you're allowed recursion (I don't know M4 so can't put the right alphabet soup in).
 
There is a boost preprocessor library that allows you do iteration and recursion using the usual C++ preprocessor. I suspect that it will work for C just as well since it uses the same preprocessor.

I've never used it and I have no idea how they did it.
 
Interesting stuff, I have not done any C++ lately, except a few little progs, but I am sure I could use some of these tips before too long.
 
I read this thread, and decided to test the performance of these functions, so I wrote a simple program in C using clock() from time.h to measure the time taken to perform 100 million power operations with random base and different exponents. Since the pow() is probably compiler-level optimized, I turned on optimization and put the test functions in a separate .c file, in order to prevent the compiler from optimizing it under the assumption that the exponent is always the same. I tested pow(), the first intpow() from this post, and tried to write an implementation which combined the performance benefits from both of them. I plotted the results, and this is what I got:

pow.PNG

It looks like pow( use special case handling for exponents 0, 1 and 2, and otherwise uses 2^6-ary exponentiation by squaring. intpow() outperforms it for lower exponents, since pow() takes fractional numbers in account. However, the time complexity for intpow() is linear, so it absolutely unusable for larger exponents. At exponent 1000, it takes about 257,000 clock cycles (out of graph).

So I wrote a routine which uses 2^5-ary exponentiation, but does not support fractional exponents. It outperforms the other functions for nearly all the values tested. However, pow() would probably be faster for ridiculously large exponents (which you are not probably going to use, anyway ;)), due to the 2^6-ary algotithm time growing a little slower.

Here is the code. I only cared about optimizing it as much as I could, so I am sorry if it is hard to find out how it works... :shrug:

Code:
double fipow(double register b, unsigned int e) {
     double r = 1.0;
     if(!e) return r;
     int c;
     asm volatile ( //Highest bit index of e in n, cleared in e
          "bsr  %1,  %0\n\t"
     : "=r" (c)
     : "r" (e));
     switch(c) {
          case 0: case 1: case 2: case 3: case 4: c = 0; break;
          case 5: case 6: case 7: case 8: case 9: c = 5; break;
          case 10: case 11: case 12: case 13: case 14: c = 10; break;
          case 15: case 16: case 17: case 18: case 19: c = 15; break;
          case 20: case 21: case 22: case 23: case 24: c = 20; break;
          case 25: case 26: case 27: case 28: case 29: c = 25; break;
          case 30:
               r = b; //c & 0xC0000000 == 1
               r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r;
               c = 24;
               break;
          case 31:
               if(e & 0x40000000) r = b*b*b; //c & 0xC0000000 == 3
               else r = b*b; //c & 0xC0000000 == 2
               r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r;
               c = 24;
               break;
     }
     do {
          double register t;
     	switch((e >> c) & 0x1F) {
     		case 0: goto __cont;
     		case 1: t = b; break;
     		case 2: t = b*b; break;
     		case 3: t = b*b*b; break;
               case 4: t = b*b; t*=t; break;
               case 5: t = b*b; t*=t; t*=b; break;
               case 6: t = b*b*b; t*=t; break;
               case 7: t = b*b*b; t*=t; t*=b; break;
               case 8: t = b*b; t*=t; t*=t; break;
               case 9: t = b*b; t*=t; t*=t; t*=b; break;
               case 10: t = b*b; t*=t; t*=b; t*=t; break;
               case 11: t = b*b; t*=t; t*=b; t*=t; t*=b; break;
               case 12: t = b*b*b; t*=t; t*=t; break;
               case 13: t = b*b*b; t*=t; t*=t; t*=b; break;
               case 14: t = b*b*b; t*=t; t*=b; t*=t; break;
               case 15: t = b*b*b; t*=t; t*=b; t*=t; t*=b; break;
               case 16: t = b*b; t*=t; t*=t; t*=t; break;
               case 17: t = b*b; t*=t; t*=t; t*=t; t*=b; break;
               case 18: t = b*b; t*=t; t*=t; t*=b; t*=t; break;
               case 19: t = b*b; t*=t; t*=t; t*=b; t*=t; t*=b; break;
               case 20: t = b*b; t*=t; t*=b; t*=t; t*=t; break;
               case 21: t = b*b; t*=t; t*=b; t*=t; t*=t; t*=b; break;
               case 22: t = b*b; t*=t; t*=b; t*=t; t*=b; t*=t; break;
               case 23: t = b*b; t*=t; t*=b; t*=t; t*=b; t*=t; t*=b; break;
               case 24: t = b*b*b; t*=t; t*=t; t*=t; break;
               case 25: t = b*b*b; t*=t; t*=t; t*=t; t*=b; break;
               case 26: t = b*b*b; t*=t; t*=t; t*=b; t*=t; break;
               case 27: t = b*b*b; t*=t; t*=t; t*=b; t*=t; t*=b; break;
               case 28: t = b*b*b; t*=t; t*=b; t*=t; t*=t; break;
               case 29: t = b*b*b; t*=t; t*=b; t*=t; t*=t; t*=b; break;
               case 30: t = b*b*b; t*=t; t*=b; t*=t; t*=b; t*=t; break;
               case 31: t = b*b*b; t*=t; t*=b; t*=t; t*=b; t*=t; t*=b; break;
     	}
     	int c2;
     	for(c2 = c; c2; c2 -= 5) {
               t *= t;
               t *= t;
               t *= t;
               t *= t;
               t *= t;
          }
     	r *= t;
     	__cont:
     	c -= 5;
     } while(c >= 0);
     return r;
}
 
Nice 11-year thread necromancy :).

Is that really any faster than this with optimizations turned on?
Code:
double intpowlog(double x, unsigned int e)
{
        if (e == 0) return 1.0;
        if (e % 2 == 0)
        {
                double h = intpowlog(x, e / 2);
                return h * h;
        }
        else
        {
                double h = intpowlog(x, e / 2);
                return h * h * x;
        }
}

Also, modern compilers have been ignoring the register keyword for decades now. It has absolutely no effect except what's required by the standard (that you can't take the address of it).
 
Thank you! When I started using C, I should have read good documentation about my compiler, but I didn't. So I'm always learning new details that I ignored :-/.
I have tested the code, and it is really fast! :clap:

pow2.PNG

I really should have tried to improve the algorithm before starting to think how to scratch little pieces of performance... :bang head However, I actually enjoy optimizing code as much as I can. I can take it as a challenge to beef the power function up using all those extremely nonportable, dirty tricks, which absurdly waste my time in favour of my processor's :).
 
Yeah I love doing low level optimizations, too, but it's very difficult, especially for multi-threaded programs with complicated memory access patterns (memory accesses are way slower than computations on modern CPUs). And yeah, always a good idea to optimize algorithmically first.

It also really helps to be familiar with what modern compilers can do. They can do a lot of amazing stuff.

Often the fastest code is the simplest code that the compiler can easily understand and optimize. For simple things like loop-unrolling, compilers can usually do a better job than we can, because they have extensive knowledge of the hardware architecture. There's really no need to do that manually and make the code long and complicated (and not any faster - sometimes slower if you start spilling from L1 instruction cache).

GCC documentation is a good starting point: https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
 
Thank you for your attention! I have been underestimating my compiler for a long time. It clearly knows more about my (slow) hardware than I do. I won't write those huge things anymore.
 
Back