Results 1 to 13 of 13
  1. #1
    Computational Oncologist / Biomathematician / Moderator on Vacation, Ph.D.
    macklin01's Avatar
    Join Date
    Apr 2002
    Location
    Pasadena, CA
    Posts
    5,643

    Talking Wow, a big C/C++ code speedup!!!

    I was messing with some C++ code today using the standard gcc/g++ compiler.

    I'm not sure about you all, but I use the function "pow( double, double)" all the time. Particularly useful for scientific computing, I suppose, but also for graphics programming. In particular, I use things like

    Radius = pow( x, 2.0) + pow( y , 2.0);

    quite a bit.

    I found that code I compiled ran very slowly. In particular, one algorithm I have for reinitializing a 150x150 data set in a particular way would take 5-10 minutes on a P4 running with fully optimized code. (-O3, -march=pentium4, etc.)

    Well, the other day, I was writing some scratch code, and for the heck of it, I tried things like x*x instead of pow(x,2.0), and the resulting code was ten times faster. I couldn't believe it. Well, today I went back and made a new function:

    Code:
    double intpow( double base, int exponent )
    {
     int i;
     double out = base;
     for( i=1 ; i < exponent ; i++ )
     {
      out *= base;
     }
     return out;
    }
    and used that anytime I had integer powers. As expected, it sped the costliest function of my computational tumor growth function up from taking 5-10 minutes to 4-5 seconds. It was just amazing.

    I'd assume this is because pow(double,double) needs to be a lot more complicated to support things like pow( 3.5, 1.5 ). It leaves me very tempted to compare the performance of sqrt(x) and pow(x,0.5). Fascinating stuff.

    Anyway, since exponents are used so frequently in coding, I thought I'd share that with all of you. It's amazing what a difference such a small change can make!

    -- Paul
    My heatware (macklin01)

    Need image I/O for your science apps? Try EasyBMP

    My biomedical research: Mathematical Cancer Modeling & Simulation

    I'm on vacation as a moderator as I devote more time to my faculty position.
    Thank you for your understanding if I don't respond to your PM. -- Paul

  2. #2
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Posts
    775
    Good work. You might see a similiar advantage if you make that an inline function. Function calls are very expensive in terms of time. Loops are also somewhat expensive because of how processors are designed. The best possible solution (C++) is to use template recursion, which could use multiple inlining and come up with the best possible code, just a few inline multiplications and no loop. Heres a quick example:

    Code:
    template <unsigned int exponent>
    inline double intpow(double base)
    {
      return intpow<exponent-1>(base) * base;
    }
    
    template <>
    inline double intpow<0>(double base)
    {
      return 1;
    }
    
    int main()
    {
      cout << intpow<12>(1.2345) << " == " << pow(1.2345, 12) << endl;
    }
    This only works with constant exponents though.

  3. #3
    Member Jarlax's Avatar
    Join Date
    Aug 2003
    Location
    Asylum
    Posts
    295
    I may be wrong, I am pretty new to C++, but isn't inline like garbage collecting in Java in that it is only a suggestion to the compiler but is not forced? The actual compiler decides (based on some algorithm) if it will actually add the code inline.

    Not saying it is a bad suggestion or unwarrented but I wanted to clarify for myself.

    Q6600 2.4 -> 3.0
    Asus P5B Deluxe, 2x1G G.Skill PC6400
    PC Power & Cooling 680 silencer
    eVGA 8800 GTS (640 meg)

    2.4c -> 3.3 (275x12) @ 1.55v, Zalman 7000-AlCu
    Asus P4P800, 2x512 PC3700 Buffalo (BH-5) [2-2-2-5] @ 3.2v
    Thermaltake 480 watt
    eVGA 7800GS


  4. #4
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Posts
    775
    Yes, I think you are correct. Most compiliers honor it, I believe. The academic and express versions of MSVS are the only exceptions I know of. Another problem is it often isn't honored if optimizations are turned off, so the code will run very slow during debugging. I guess the best answer requires some empirical testing to decide.

    Another problem is that compiliers might not allow too many layers of templates. I think the standard says this should be configurable, but default to at least 32. So my code my not work with something like intpow<33>(x).

    Another trick I use is macros. Even less flexible than templates, but they work in C and they are fast.
    Code:
    #define SQUARE(x) x*x
    #define CUBE(x) x*x*x

  5. #5
    Computational Oncologist / Biomathematician / Moderator on Vacation, Ph.D.
    macklin01's Avatar
    Join Date
    Apr 2002
    Location
    Pasadena, CA
    Posts
    5,643
    Hmm, thanks for the tips. I may have to give this a try later on. Defining square in this way is certainly the optimal solution for the most commonly used power; I guess that there's always a tradeoff between ultimate flexibility ( pow( double, double ) ) and ultimate speed ( #define square ... )

    Would your define macro be faster that a function double square( double )? Thanks -- Paul
    My heatware (macklin01)

    Need image I/O for your science apps? Try EasyBMP

    My biomedical research: Mathematical Cancer Modeling & Simulation

    I'm on vacation as a moderator as I devote more time to my faculty position.
    Thank you for your understanding if I don't respond to your PM. -- Paul

  6. #6
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Posts
    775
    The macro should always be at least as fast as the function, when it has a simple arugment. Its entirly possible that that compilier can make the function as fast as the macro (by inlining it and such,) but it should never get faster.

    One gotcha is argument expansion. pow(0.1 / 0.4, 2) and intpow(0.1 / 0.4, 2) will evaluate 0.1/0.4 exactly one time. The macro SQUARE(0.1 / 0.4) will be expanded into (0.1 / 0.4) * (0.1 / 0.4), which will be slower than intpow because it evaluates the divide twice. Any attempt to work around this is hacky at best. Heres how I might do that.

    Code:
    double fast_square_x;
    #define FAST_SQUARE(x) (((fast_square_x=x)&0) + fast_square_x * fast_square_x)
    I used a global variable and an embeded assignement. I'm also preying that the compilier optimizer doesn't throw out the assignement with the &0, uses left to right ordering for addition and that this macro isn't used in multiple threads simultaneously.

    Inline functions are a much cleaner way to do this sort of thing.

  7. #7
    Inactive Pokémon Moderator JigPu's Avatar
    Join Date
    Jun 2001
    Location
    Vancouver, WA
    Posts
    6,576
    Another tiny optimization you could throw in for the original for loop intpow(x,y) would be to use a decrementing loop like so:

    for( i=exponent-1 ; i!=0 ; i-- )

    I don't know for sure what gcc would do it, but the assembly for decrementing loops is quicker than incrementing, so it might shave off a tiny fraction of overhead. EDIT: Heck, if you're adventurous and can get ASM compiling to work, just write an assembly language version of intpow(x,y)

    JigPu
    .... ASRock Z68 Extreme3 Gen3
    .... Intel Core i5 2500 ........................ 4 thread ...... 3300 MHz ......... -0.125 V
    2x ASUS GTX 560 Ti ............................... 1 GiB ....... 830 MHz ...... 2004 MHz
    .... G.SKILL Sniper Low Voltage ............. 8 GiB ..... 1600 MHz ............ 1.25 V
    .... OCZ Vertex 3 ................................. 120 GB ............. nilfs2 ..... Arch Linux
    .... Kingwin LZP-550 .............................. 550 W ........ 94% Eff. ....... 80+ Plat
    .... Nocuta NH-D14 ................................ 20 dB ..... 0.35 C°/W ................ 7 V


    "In order to combat power supply concerns, Nvidia has declared that G80 will be the first graphics card in the world to run entirely off of the souls of dead babies. This will make running the G80 much cheaper for the average end user."
    "GeForce 8 Series." Wikipedia, The Free Encyclopedia. 7 Aug 2006, 20:59 UTC. Wikimedia Foundation, Inc. 8 Aug 2006.

  8. #8
    Member
    Join Date
    Nov 2003
    Location
    Canberra, Down Under
    Posts
    830
    This would be faster in many cases:

    Code:
    double intpow( double base, int exponent )
    {
      int i;
      double out = 1, curpwr = base;
      for( ; exponent > 0; exponent = exponent >> 1)
      {
        if ((exponent & 1) > 0)
          base *= curpwr;
        curpwr *= curpwr;
      }
      return out;
    }
    As a template:
    Code:
    template <unsigned int exponent>
    inline double intpowt(double base)
    {
      return intpowt<exponent >> 1>(base*base) * (((exponent & 1) > 0) ? base : 1);
    }
    
    template <>
    inline double intpowt<0>(double base)
    {
      return 1;
    }
    Or for constant powers if you dislike templates:
    Code:
    #define intpow5(b,e) \
    (((e & 1) > 0) ? b : 1)
    #define intpow4(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow5((b*b), (e >> 1)))
    #define intpow3(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow4((b*b), (e >> 1)))
    #define intpow2(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow3((b*b), (e >> 1)))
    #define intpow1(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow2((b*b), (e >> 1)))
    #define intpow(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow1((b*b), (e >> 1)))
    Add additional layers for exponents > 32.

    If you know M4 you could write a recursive macro version as well

    edit: Fixed a few typos ...

  9. #9
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Posts
    775
    Quote Originally Posted by emboss
    If you know M4 you could write a recursive macro version as well
    Whats M4?

  10. #10
    Member
    Join Date
    Nov 2003
    Location
    Canberra, Down Under
    Posts
    830
    A macro preprocessor; sorta the C preprocessor on steroids One neat feature is that you're allowed to write recursive macros (with some limitations), so instead of the multiple intpow functions you could just do
    Code:
    #define intpow(b,e) \
    ((((e & 1) > 0) ? b : 1) * intpow((b*b), (e >> 1)))
    with some syntatic changes to show that you're allowed recursion (I don't know M4 so can't put the right alphabet soup in).

  11. #11
    Senior Member mccoyn's Avatar
    Join Date
    Nov 2003
    Location
    Michigan, USA
    Posts
    775
    There is a boost preprocessor library that allows you do iteration and recursion using the usual C++ preprocessor. I suspect that it will work for C just as well since it uses the same preprocessor.

    I've never used it and I have no idea how they did it.

  12. #12
    Registered
    Join Date
    May 2005
    Posts
    79
    Hey thanks for the tip :-D I'll test it out later.
    My Pride and Joy:

    P4 2.4C GHz @ 3.2 GHz (self-built watercooled)
    ABIT IS7 i865PE
    512MB PC3200 GeIL Value
    300GB Maxtor Diamondmax 9 ATA/133 16MB Cache
    Albatron NVIDIA GeForce FX 5600 TURBO 128-bit 128MB DDR
    LG CD-RW (w00t)
    430W Award PSU
    400W Something PSU
    OS: GNU
    Distro: Gentoo
    Kernel: 2.6.13-rc6-ck1

  13. #13
    Open Source Senior khiloa's Avatar
    Join Date
    Jul 2004
    Location
    /usa/sc/florence
    Posts
    2,131
    Interesting stuff, I have not done any C++ lately, except a few little progs, but I am sure I could use some of these tips before too long.


    Say NO to ads!

    Did someone just help you? Thank them.

    Along with F@H, please help support some of our other teams.


Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •