Results 1 to 18 of 18
  1. #1
    Computational Oncologist / Biomathematician / Moderator on Vacation, Ph.D.
    macklin01's Avatar
    10 Year Badge
    Join Date
    Apr 2002
    Location
    Pasadena, CA
    Folding Profile Heatware Profile

    Talking Wow, a big C/C++ code speedup!!!

    I was messing with some C++ code today using the standard gcc/g++ compiler.

    I'm not sure about you all, but I use the function "pow( double, double)" all the time. Particularly useful for scientific computing, I suppose, but also for graphics programming. In particular, I use things like

    Radius = pow( x, 2.0) + pow( y , 2.0);

    quite a bit.

    I found that code I compiled ran very slowly. In particular, one algorithm I have for reinitializing a 150x150 data set in a particular way would take 5-10 minutes on a P4 running with fully optimized code. (-O3, -march=pentium4, etc.)

    Well, the other day, I was writing some scratch code, and for the heck of it, I tried things like x*x instead of pow(x,2.0), and the resulting code was ten times faster. I couldn't believe it. Well, today I went back and made a new function:

    Code:
    double intpow( double base, int exponent )
    {
     int i;
     double out = base;
     for( i=1 ; i < exponent ; i++ )
     {
      out *= base;
     }
     return out;
    }
    and used that anytime I had integer powers. As expected, it sped the costliest function of my computational tumor growth function up from taking 5-10 minutes to 4-5 seconds. It was just amazing.

    I'd assume this is because pow(double,double) needs to be a lot more complicated to support things like pow( 3.5, 1.5 ). It leaves me very tempted to compare the performance of sqrt(x) and pow(x,0.5). Fascinating stuff.

    Anyway, since exponents are used so frequently in coding, I thought I'd share that with all of you. It's amazing what a difference such a small change can make!

    -- Paul
    My heatware (macklin01)

    Need image I/O for your science apps? Try EasyBMP

    My biomedical research: Mathematical Cancer Modeling & Simulation

    I'm on vacation as a moderator as I devote more time to my faculty position.
    Thank you for your understanding if I don't respond to your PM. -- Paul

  2. #2
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Good work. You might see a similiar advantage if you make that an inline function. Function calls are very expensive in terms of time. Loops are also somewhat expensive because of how processors are designed. The best possible solution (C++) is to use template recursion, which could use multiple inlining and come up with the best possible code, just a few inline multiplications and no loop. Heres a quick example:

    Code:
    template <unsigned int exponent>
    inline double intpow(double base)
    {
      return intpow<exponent-1>(base) * base;
    }
    
    template <>
    inline double intpow<0>(double base)
    {
      return 1;
    }
    
    int main()
    {
      cout << intpow<12>(1.2345) << " == " << pow(1.2345, 12) << endl;
    }
    This only works with constant exponents though.

  3. #3
    Member Jarlax's Avatar
    Join Date
    Aug 2003
    Location
    Asylum
    I may be wrong, I am pretty new to C++, but isn't inline like garbage collecting in Java in that it is only a suggestion to the compiler but is not forced? The actual compiler decides (based on some algorithm) if it will actually add the code inline.

    Not saying it is a bad suggestion or unwarrented but I wanted to clarify for myself.

    Q6600 2.4 -> 3.0
    Asus P5B Deluxe, 2x1G G.Skill PC6400
    PC Power & Cooling 680 silencer
    eVGA 8800 GTS (640 meg)

    2.4c -> 3.3 (275x12) @ 1.55v, Zalman 7000-AlCu
    Asus P4P800, 2x512 PC3700 Buffalo (BH-5) [2-2-2-5] @ 3.2v
    Thermaltake 480 watt
    eVGA 7800GS


  4. #4
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Yes, I think you are correct. Most compiliers honor it, I believe. The academic and express versions of MSVS are the only exceptions I know of. Another problem is it often isn't honored if optimizations are turned off, so the code will run very slow during debugging. I guess the best answer requires some empirical testing to decide.

    Another problem is that compiliers might not allow too many layers of templates. I think the standard says this should be configurable, but default to at least 32. So my code my not work with something like intpow<33>(x).

    Another trick I use is macros. Even less flexible than templates, but they work in C and they are fast.
    Code:
    #define SQUARE(x) x*x
    #define CUBE(x) x*x*x

  5. #5
    Computational Oncologist / Biomathematician / Moderator on Vacation, Ph.D.
    macklin01's Avatar
    10 Year Badge
    Join Date
    Apr 2002
    Location
    Pasadena, CA
    Folding Profile Heatware Profile
    Hmm, thanks for the tips. I may have to give this a try later on. Defining square in this way is certainly the optimal solution for the most commonly used power; I guess that there's always a tradeoff between ultimate flexibility ( pow( double, double ) ) and ultimate speed ( #define square ... )

    Would your define macro be faster that a function double square( double )? Thanks -- Paul
    My heatware (macklin01)

    Need image I/O for your science apps? Try EasyBMP

    My biomedical research: Mathematical Cancer Modeling & Simulation

    I'm on vacation as a moderator as I devote more time to my faculty position.
    Thank you for your understanding if I don't respond to your PM. -- Paul

  6. #6
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    The macro should always be at least as fast as the function, when it has a simple arugment. Its entirly possible that that compilier can make the function as fast as the macro (by inlining it and such,) but it should never get faster.

    One gotcha is argument expansion. pow(0.1 / 0.4, 2) and intpow(0.1 / 0.4, 2) will evaluate 0.1/0.4 exactly one time. The macro SQUARE(0.1 / 0.4) will be expanded into (0.1 / 0.4) * (0.1 / 0.4), which will be slower than intpow because it evaluates the divide twice. Any attempt to work around this is hacky at best. Heres how I might do that.

    Code:
    double fast_square_x;
    #define FAST_SQUARE(x) (((fast_square_x=x)&0) + fast_square_x * fast_square_x)
    I used a global variable and an embeded assignement. I'm also preying that the compilier optimizer doesn't throw out the assignement with the &0, uses left to right ordering for addition and that this macro isn't used in multiple threads simultaneously.

    Inline functions are a much cleaner way to do this sort of thing.

  7. #7
    Inactive Pokémon Moderator JigPu's Avatar
    10 Year Badge
    Join Date
    Jun 2001
    Location
    Vancouver, WA
    Another tiny optimization you could throw in for the original for loop intpow(x,y) would be to use a decrementing loop like so:

    for( i=exponent-1 ; i!=0 ; i-- )

    I don't know for sure what gcc would do it, but the assembly for decrementing loops is quicker than incrementing, so it might shave off a tiny fraction of overhead. EDIT: Heck, if you're adventurous and can get ASM compiling to work, just write an assembly language version of intpow(x,y)

    JigPu
    .... ASRock Z68 Extreme3 Gen3
    .... Intel Core i5 2500 ........................ 4 thread ...... 3300 MHz ......... -0.125 V
    2x ASUS GTX 560 Ti ............................... 1 GiB ....... 830 MHz ...... 2004 MHz
    .... G.SKILL Sniper Low Voltage ............. 8 GiB ..... 1600 MHz ............ 1.25 V
    .... OCZ Vertex 3 ................................. 120 GB ............. nilfs2 ..... Arch Linux
    .... Kingwin LZP-550 .............................. 550 W ........ 94% Eff. ....... 80+ Plat
    .... Nocuta NH-D14 ................................ 20 dB ..... 0.35 C°/W ................ 7 V


    "In order to combat power supply concerns, Nvidia has declared that G80 will be the first graphics card in the world to run entirely off of the souls of dead babies. This will make running the G80 much cheaper for the average end user."
    "GeForce 8 Series." Wikipedia, The Free Encyclopedia. 7 Aug 2006, 20:59 UTC. Wikimedia Foundation, Inc. 8 Aug 2006.

  8. #8
    Member
    Join Date
    Nov 2003
    Location
    Canberra, Down Under
    This would be faster in many cases:

    Code:
    double intpow( double base, int exponent )
    {
      int i;
      double out = 1, curpwr = base;
      for( ; exponent > 0; exponent = exponent >> 1)
      {
        if ((exponent & 1) > 0)
          base *= curpwr;
        curpwr *= curpwr;
      }
      return out;
    }
    As a template:
    Code:
    template <unsigned int exponent>
    inline double intpowt(double base)
    {
      return intpowt<exponent >> 1>(base*base) * (((exponent & 1) > 0) ? base : 1);
    }
    
    template <>
    inline double intpowt<0>(double base)
    {
      return 1;
    }
    Or for constant powers if you dislike templates:
    Code:
    #define intpow5(b,e) \
    (((e & 1) > 0) ? b : 1)
    #define intpow4(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow5((b*b), (e >> 1)))
    #define intpow3(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow4((b*b), (e >> 1)))
    #define intpow2(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow3((b*b), (e >> 1)))
    #define intpow1(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow2((b*b), (e >> 1)))
    #define intpow(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow1((b*b), (e >> 1)))
    Add additional layers for exponents > 32.

    If you know M4 you could write a recursive macro version as well

    edit: Fixed a few typos ...

  9. #9
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Quote Originally Posted by emboss
    If you know M4 you could write a recursive macro version as well
    Whats M4?

  10. #10
    Member
    Join Date
    Nov 2003
    Location
    Canberra, Down Under
    A macro preprocessor; sorta the C preprocessor on steroids One neat feature is that you're allowed to write recursive macros (with some limitations), so instead of the multiple intpow functions you could just do
    Code:
    #define intpow(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow((b*b), (e >> 1)))
    with some syntatic changes to show that you're allowed recursion (I don't know M4 so can't put the right alphabet soup in).

  11. #11
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    There is a boost preprocessor library that allows you do iteration and recursion using the usual C++ preprocessor. I suspect that it will work for C just as well since it uses the same preprocessor.

    I've never used it and I have no idea how they did it.

  12. #12
    Registered
    Join Date
    May 2005
    Hey thanks for the tip :-D I'll test it out later.
    My Pride and Joy:

    P4 2.4C GHz @ 3.2 GHz (self-built watercooled)
    ABIT IS7 i865PE
    512MB PC3200 GeIL Value
    300GB Maxtor Diamondmax 9 ATA/133 16MB Cache
    Albatron NVIDIA GeForce FX 5600 TURBO 128-bit 128MB DDR
    LG CD-RW (w00t)
    430W Award PSU
    400W Something PSU
    OS: GNU
    Distro: Gentoo
    Kernel: 2.6.13-rc6-ck1

  13. #13
    Open Source Senior khiloa's Avatar
    Join Date
    Jul 2004
    Location
    /usa/sc/florence
    Interesting stuff, I have not done any C++ lately, except a few little progs, but I am sure I could use some of these tips before too long.


    Say NO to ads!

    Did someone just help you? Thank them.

    Along with F@H, please help support some of our other teams.


  14. #14
    New Member
    Join Date
    Sep 2016
    I read this thread, and decided to test the performance of these functions, so I wrote a simple program in C using clock() from time.h to measure the time taken to perform 100 million power operations with random base and different exponents. Since the pow() is probably compiler-level optimized, I turned on optimization and put the test functions in a separate .c file, in order to prevent the compiler from optimizing it under the assumption that the exponent is always the same. I tested pow(), the first intpow() from this post, and tried to write an implementation which combined the performance benefits from both of them. I plotted the results, and this is what I got:

    Name:  pow.PNG
Views: 115
Size:  12.8 KB

    It looks like pow( use special case handling for exponents 0, 1 and 2, and otherwise uses 2^6-ary exponentiation by squaring. intpow() outperforms it for lower exponents, since pow() takes fractional numbers in account. However, the time complexity for intpow() is linear, so it absolutely unusable for larger exponents. At exponent 1000, it takes about 257,000 clock cycles (out of graph).

    So I wrote a routine which uses 2^5-ary exponentiation, but does not support fractional exponents. It outperforms the other functions for nearly all the values tested. However, pow() would probably be faster for ridiculously large exponents (which you are not probably going to use, anyway ), due to the 2^6-ary algotithm time growing a little slower.

    Here is the code. I only cared about optimizing it as much as I could, so I am sorry if it is hard to find out how it works...

    Code:
    double fipow(double register b, unsigned int e) {
         double r = 1.0;
         if(!e) return r;
         int c;
         asm volatile ( //Highest bit index of e in n, cleared in e
              "bsr  %1,  %0\n\t"
         : "=r" (c)
         : "r" (e));
         switch(c) {
              case 0: case 1: case 2: case 3: case 4: c = 0; break;
              case 5: case 6: case 7: case 8: case 9: c = 5; break;
              case 10: case 11: case 12: case 13: case 14: c = 10; break;
              case 15: case 16: case 17: case 18: case 19: c = 15; break;
              case 20: case 21: case 22: case 23: case 24: c = 20; break;
              case 25: case 26: case 27: case 28: case 29: c = 25; break;
              case 30:
                   r = b; //c & 0xC0000000 == 1
                   r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r;
                   c = 24;
                   break;
              case 31:
                   if(e & 0x40000000) r = b*b*b; //c & 0xC0000000 == 3
                   else r = b*b; //c & 0xC0000000 == 2
                   r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r; r*=r;
                   c = 24;
                   break;
         }
         do {
              double register t;
         	switch((e >> c) & 0x1F) {
         		case 0: goto __cont;
         		case 1: t = b; break;
         		case 2: t = b*b; break;
         		case 3: t = b*b*b; break;
                   case 4: t = b*b; t*=t; break;
                   case 5: t = b*b; t*=t; t*=b; break;
                   case 6: t = b*b*b; t*=t; break;
                   case 7: t = b*b*b; t*=t; t*=b; break;
                   case 8: t = b*b; t*=t; t*=t; break;
                   case 9: t = b*b; t*=t; t*=t; t*=b; break;
                   case 10: t = b*b; t*=t; t*=b; t*=t; break;
                   case 11: t = b*b; t*=t; t*=b; t*=t; t*=b; break;
                   case 12: t = b*b*b; t*=t; t*=t; break;
                   case 13: t = b*b*b; t*=t; t*=t; t*=b; break;
                   case 14: t = b*b*b; t*=t; t*=b; t*=t; break;
                   case 15: t = b*b*b; t*=t; t*=b; t*=t; t*=b; break;
                   case 16: t = b*b; t*=t; t*=t; t*=t; break;
                   case 17: t = b*b; t*=t; t*=t; t*=t; t*=b; break;
                   case 18: t = b*b; t*=t; t*=t; t*=b; t*=t; break;
                   case 19: t = b*b; t*=t; t*=t; t*=b; t*=t; t*=b; break;
                   case 20: t = b*b; t*=t; t*=b; t*=t; t*=t; break;
                   case 21: t = b*b; t*=t; t*=b; t*=t; t*=t; t*=b; break;
                   case 22: t = b*b; t*=t; t*=b; t*=t; t*=b; t*=t; break;
                   case 23: t = b*b; t*=t; t*=b; t*=t; t*=b; t*=t; t*=b; break;
                   case 24: t = b*b*b; t*=t; t*=t; t*=t; break;
                   case 25: t = b*b*b; t*=t; t*=t; t*=t; t*=b; break;
                   case 26: t = b*b*b; t*=t; t*=t; t*=b; t*=t; break;
                   case 27: t = b*b*b; t*=t; t*=t; t*=b; t*=t; t*=b; break;
                   case 28: t = b*b*b; t*=t; t*=b; t*=t; t*=t; break;
                   case 29: t = b*b*b; t*=t; t*=b; t*=t; t*=t; t*=b; break;
                   case 30: t = b*b*b; t*=t; t*=b; t*=t; t*=b; t*=t; break;
                   case 31: t = b*b*b; t*=t; t*=b; t*=t; t*=b; t*=t; t*=b; break;
         	}
         	int c2;
         	for(c2 = c; c2; c2 -= 5) {
                   t *= t;
                   t *= t;
                   t *= t;
                   t *= t;
                   t *= t;
              }
         	r *= t;
         	__cont:
         	c -= 5;
         } while(c >= 0);
         return r;
    }

  15. #15
    Member
    Join Date
    May 2008
    Location
    London, England
    Heatware Profile
    Nice 11-year thread necromancy .

    Is that really any faster than this with optimizations turned on?
    Code:
    double intpowlog(double x, unsigned int e)
    {
            if (e == 0) return 1.0;
            if (e % 2 == 0)
            {
                    double h = intpowlog(x, e / 2);
                    return h * h;
            }
            else
            {
                    double h = intpowlog(x, e / 2);
                    return h * h * x;
            }
    }
    Also, modern compilers have been ignoring the register keyword for decades now. It has absolutely no effect except what's required by the standard (that you can't take the address of it).
    Heatware

    System:
    CPU - 2x Xeon E5-2670 (16 cores, 32 threads total)
    Mobo - SuperMicro X9DRW-3F
    RAM - 64GB, DDR3-1600 ECC Registered
    Video cards - GTX 970

  16. Thanks!

    sno.lcn (09-22-16)

  17. #16
    New Member
    Join Date
    Sep 2016
    Thank you! When I started using C, I should have read good documentation about my compiler, but I didn't. So I'm always learning new details that I ignored .
    I have tested the code, and it is really fast!

    Name:  pow2.PNG
Views: 34
Size:  12.1 KB

    I really should have tried to improve the algorithm before starting to think how to scratch little pieces of performance... However, I actually enjoy optimizing code as much as I can. I can take it as a challenge to beef the power function up using all those extremely nonportable, dirty tricks, which absurdly waste my time in favour of my processor's .

  18. #17
    Member
    Join Date
    May 2008
    Location
    London, England
    Heatware Profile
    Yeah I love doing low level optimizations, too, but it's very difficult, especially for multi-threaded programs with complicated memory access patterns (memory accesses are way slower than computations on modern CPUs). And yeah, always a good idea to optimize algorithmically first.

    It also really helps to be familiar with what modern compilers can do. They can do a lot of amazing stuff.

    Often the fastest code is the simplest code that the compiler can easily understand and optimize. For simple things like loop-unrolling, compilers can usually do a better job than we can, because they have extensive knowledge of the hardware architecture. There's really no need to do that manually and make the code long and complicated (and not any faster - sometimes slower if you start spilling from L1 instruction cache).

    GCC documentation is a good starting point: https://gcc.gnu.org/onlinedocs/gcc/O...e-Options.html
    Heatware

    System:
    CPU - 2x Xeon E5-2670 (16 cores, 32 threads total)
    Mobo - SuperMicro X9DRW-3F
    RAM - 64GB, DDR3-1600 ECC Registered
    Video cards - GTX 970

  19. #18
    New Member
    Join Date
    Sep 2016
    Thank you for your attention! I have been underestimating my compiler for a long time. It clearly knows more about my (slow) hardware than I do. I won't write those huge things anymore.

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •