Page 1 of 2 1 2 LastLast
Results 1 to 20 of 21
  1. #1
    Member OC Noob's Avatar
    Join Date
    Jun 2002
    Location
    Phoenix, AZ USA

    Cell, P4, gflops and mhz... (PS3/Xbox2/PC)

    I posted this in a gaming thread and thought someone here could probably answer my question. This is, after all, the CPU section

    "This is one thing Sony does extremely well. Hype. I'm still trying to understand if this thing is the greatest thing since sliced bread or just an overhyped piece of silicon.

    They say it does 256 Gflops @ 4.6 ghz. Can it even reach those speed and is that just theoretical?

    A P4 3.2 can do a theoretical 59 gflops and a Radeon X800 can do 200 gflops according to its documentation.

    According to that presentation on Cell, PS3 bandwidth limits Cell to 1/10 of the power (256 gflop) so 25 gflops. If thats its theoretical limit then a P4 is theoretically faster then what Cell will be in the PS3.


    What the heck, I need some computing genius to make some sense out of this crapola."
    Hail to the King:
    Opteron 165 w/ DFI Ultra-D w/ BBA 1900XT 512 mb GSkill PC4200 1 GB x 2
    74 gb WD Raptor x 2 Raid 0 MSI (ATI550) Tuner
    w/ Windows XP Media Center Edition & OCZ Powerstream 520
    DD TDX H2O block w/ Maze4 GPA block DD D4 pump
    single 120mm Fan Heater core w/ shroud In Lian-Li fish tank window case

    RIP (Rest In Pieces):
    P4 3.0 ghz @ 3.75 ghz Aerocool HT-101 IC7-G
    Radeon 9800 Pro 430 W Antec True Power

  2. #2
    Member
    Join Date
    Jun 2003
    Location
    Amsterdam, NL
    From what I understand, Flops are what it is pretty good at, as long as it's not bandwith limited (which it is, due to lack of a decent sized cache). 4.6ghz, I see as being theoretical speed, just doesn't seem realistic to me.

  3. #3
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    I believe their sample ran at 4GHz, but I may be mistaken.
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  4. #4
    Member OC Noob's Avatar
    Join Date
    Jun 2002
    Location
    Phoenix, AZ USA
    So does it mean anything that Cell can do 256 gflops at 4.6 ghz, P4 can do 59 @ 3.2 ghz and an X800 can do 200 gflops?

    Probably all theoretical numbers, but can we use them to make any kind of a comparison between the three?

    And why use CPU's when GPU's are so good at gflops?


    Sorry for the stupid questions, I really don't know much about flops.
    Hail to the King:
    Opteron 165 w/ DFI Ultra-D w/ BBA 1900XT 512 mb GSkill PC4200 1 GB x 2
    74 gb WD Raptor x 2 Raid 0 MSI (ATI550) Tuner
    w/ Windows XP Media Center Edition & OCZ Powerstream 520
    DD TDX H2O block w/ Maze4 GPA block DD D4 pump
    single 120mm Fan Heater core w/ shroud In Lian-Li fish tank window case

    RIP (Rest In Pieces):
    P4 3.0 ghz @ 3.75 ghz Aerocool HT-101 IC7-G
    Radeon 9800 Pro 430 W Antec True Power

  5. #5
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    Quote Originally Posted by OC Noob
    And why use CPU's when GPU's are so good at gflops?
    Well, off the top of my head, flexibility and integer operations. GPUS are very good at what they do, and they are moving to be more programmable, but they cannot do the sort of general purpose things CPUs can do. Additionally, many programs use integer code, something at which desktop CPUs are much better than GPUs. Additionally, the 256GFLOPs figure is for single precision FP operations, CPUs generally do double precision. The figure for the cell is much lower when double-precision FP ops are used instead of single precision.
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  6. #6
    Member germanjulian's Avatar
    Join Date
    Apr 2002
    Location
    Frankfurt/London
    it all depends!

    thats the correct thing to say. depending how the cpu's are build, their bandwith, cache, instruction sets and especially how programs are written = SPEED.

    the cell processor will be amazing... no doubt about it... it will kick some a**... if it will take over the PC word ...... maybe.... but I doubt that
    /|\ Asus P5W DH Deluxe, Intel C2D E6600, 4GB Corsair XMS2-6400C4 DDR2, E-VGA GeForce 7800 GT, Asus Xonar D2, 160GB Intel X25-m SSD, 1TB HD, Coolermaster Cosmos, etc. see my website google my name/|\

  7. #7
    Member OC Noob's Avatar
    Join Date
    Jun 2002
    Location
    Phoenix, AZ USA
    hmmm, thanks for the info.


    I'm still not sold on this thing. Sounds like a lot of hype and numbers that are meaningless. I guess we'll see what it does when "the rubber meets the road."


    I should say I'm not sold on this thing when it comes to PCs or PC chip applications (servers). For consoles I have no doubts it will be great. They are going to be coded specifically for Cell where PC software won't or is limited to what is coded for it.

    By the time it gets even a decent software base Itanium could be out with an MS OS to support it or some other advanced chip.
    Hail to the King:
    Opteron 165 w/ DFI Ultra-D w/ BBA 1900XT 512 mb GSkill PC4200 1 GB x 2
    74 gb WD Raptor x 2 Raid 0 MSI (ATI550) Tuner
    w/ Windows XP Media Center Edition & OCZ Powerstream 520
    DD TDX H2O block w/ Maze4 GPA block DD D4 pump
    single 120mm Fan Heater core w/ shroud In Lian-Li fish tank window case

    RIP (Rest In Pieces):
    P4 3.0 ghz @ 3.75 ghz Aerocool HT-101 IC7-G
    Radeon 9800 Pro 430 W Antec True Power

  8. #8
    Inactive Pokémon Moderator JigPu's Avatar
    10 Year Badge
    Join Date
    Jun 2001
    Location
    Vancouver, WA
    Quote Originally Posted by OC Noob
    A P4 3.2 can do a theoretical 59 gflops
    Where'd you get that bit of info? I can't for the life of me get the math (or even Google ) to give me that number...

    59 GFLOPs @ 3.2GHz = 18 floating point ops/cycle

    Somebody wanna enlighten me as to how 18 floating point values can appear to be modified in only one clock tick? I mean, even if you use SSE, that only messes with 4 pieces of data, leaving you with 14 yet to be touched. If we go for dual parallel SSE pipes, we're still 10 ops too short. Assume that hyperthreading will magically allow you to perform two operations in one cycle as well ( ), and we're STILL 2 operations too short...

    JigPu
    .... ASRock Z68 Extreme3 Gen3
    .... Intel Core i5 2500 ........................ 4 thread ...... 3300 MHz ......... -0.125 V
    2x ASUS GTX 560 Ti ............................... 1 GiB ....... 830 MHz ...... 2004 MHz
    .... G.SKILL Sniper Low Voltage ............. 8 GiB ..... 1600 MHz ............ 1.25 V
    .... OCZ Vertex 3 ................................. 120 GB ............. nilfs2 ..... Arch Linux
    .... Kingwin LZP-550 .............................. 550 W ........ 94% Eff. ....... 80+ Plat
    .... Nocuta NH-D14 ................................ 20 dB ..... 0.35 C°/W ................ 7 V


    "In order to combat power supply concerns, Nvidia has declared that G80 will be the first graphics card in the world to run entirely off of the souls of dead babies. This will make running the G80 much cheaper for the average end user."
    "GeForce 8 Series." Wikipedia, The Free Encyclopedia. 7 Aug 2006, 20:59 UTC. Wikimedia Foundation, Inc. 8 Aug 2006.

  9. #9
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...

    OC Noob: MS already has an OS out for the Itanium. Or were you talking about the Cell?
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  10. #10
    Member
    Join Date
    May 2002
    Location
    Purdue University, IN
    I read that cell is really only 25flops when its actually accurate in the floating point. Also the code needs to be specialized to work with the cell and in the pc market Specialized code is hard to come by

  11. #11
    Inactive Pokémon Moderator JigPu's Avatar
    10 Year Badge
    Join Date
    Jun 2001
    Location
    Vancouver, WA
    Quote Originally Posted by Gnufsh
    Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...
    I thought those were integer, not FP... Oh well, goes to show how much attention I pay to chips these days

    Regardless, fully loading the 2 adders (which I assume aren't SIMD), is only 4 ops/cycle. Again, assuming that HT somehow magically lets you execcute two instructions in the same clock cycle gives only 8 OPS/cyce. Still quite a bit less than 18....

    JigPu
    .... ASRock Z68 Extreme3 Gen3
    .... Intel Core i5 2500 ........................ 4 thread ...... 3300 MHz ......... -0.125 V
    2x ASUS GTX 560 Ti ............................... 1 GiB ....... 830 MHz ...... 2004 MHz
    .... G.SKILL Sniper Low Voltage ............. 8 GiB ..... 1600 MHz ............ 1.25 V
    .... OCZ Vertex 3 ................................. 120 GB ............. nilfs2 ..... Arch Linux
    .... Kingwin LZP-550 .............................. 550 W ........ 94% Eff. ....... 80+ Plat
    .... Nocuta NH-D14 ................................ 20 dB ..... 0.35 C°/W ................ 7 V


    "In order to combat power supply concerns, Nvidia has declared that G80 will be the first graphics card in the world to run entirely off of the souls of dead babies. This will make running the G80 much cheaper for the average end user."
    "GeForce 8 Series." Wikipedia, The Free Encyclopedia. 7 Aug 2006, 20:59 UTC. Wikimedia Foundation, Inc. 8 Aug 2006.

  12. #12
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    I assume that, as long as we're doing theoretical calculation, we can use both sse and 387 code. WHich means 4/clock from the FP adders, 1/clock from the full FPU, 4-8 more from sse... Oh, wait, still short.

    This post points to 24GFLOPS from the p4:
    http://forums.anandtech.com/messagev...&enterthread=y
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  13. #13
    Member
    Join Date
    Jun 2002
    Location
    Isla Vista, CA
    That's not possible, the x87 FPU and the SSE unit share the same execution hardware. They even share an issue port. And you cannot issue multiply and add instructions in parallel, you can't even issue them alone. In order to get the full throughput of 1 SSE instructions per cycle, you must alternate multiply and add instructions. The previous poster is correct, at most you can get 4 SP FP ops per cycle on a P4 which leads to 12.8 GFLOPS on a 3.2 GHz P4.

    The post on anandtech is incorrect as the person is getting 24 by assuming there's a multiply-accumulate (FMAC) instruction. SSE/2/3 does not have FMAC instructions like these cells or AltiVec has.

  14. #14
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    Ah, I assumed the SSE/x87 execution hardware was different (I'm fairly certain it is in the P3, and I'm 100% certain it is in some x86 cpus).
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  15. #15
    Member OC Noob's Avatar
    Join Date
    Jun 2002
    Location
    Phoenix, AZ USA
    Quote Originally Posted by Gnufsh
    Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...

    OC Noob: MS already has an OS out for the Itanium. Or were you talking about the Cell?
    sorry have to be quick, but yeah. I was talking about Cell. Just stuck in that Itanium could have an MS OS by then... but anything could happen

    EDIT: Didin't know its software was by MS. Thanks for the info. I should have said an MS OS for home users.


    Quote Originally Posted by JigPu
    Where'd you get that bit of info? I can't for the life of me get the math (or even Google ) to give me that number...

    59 GFLOPs @ 3.2GHz = 18 floating point ops/cycle

    Somebody wanna enlighten me as to how 18 floating point values can appear to be modified in only one clock tick? I mean, even if you use SSE, that only messes with 4 pieces of data, leaving you with 14 yet to be touched. If we go for dual parallel SSE pipes, we're still 10 ops too short. Assume that hyperthreading will magically allow you to perform two operations in one cycle as well ( ), and we're STILL 2 operations too short...

    JigPu

    I did an internet search for gflops and pentium and it was said on a few boards where people had done the math.

    I'll link it when I have more time


    EDIT:


    Found it referenced at a handful of places, but am having a hard time finding the boards that did the math to get that. Not saying its right either, but I'd like to find it so you guys can take a look and tell me if its right or wrong.

    http://www.abovetopsecret.com/forum/thread75492/pg2

    http://www.geek.com/news/geeknews/20...0728026217.htm

    http://www.nvnews.net/vbulletin/showthread.php?p=434532


    Oh well, I give up for tonight. Sounds like that number isn't right anyway.
    Last edited by OC Noob; 02-19-05 at 02:12 AM.
    Hail to the King:
    Opteron 165 w/ DFI Ultra-D w/ BBA 1900XT 512 mb GSkill PC4200 1 GB x 2
    74 gb WD Raptor x 2 Raid 0 MSI (ATI550) Tuner
    w/ Windows XP Media Center Edition & OCZ Powerstream 520
    DD TDX H2O block w/ Maze4 GPA block DD D4 pump
    single 120mm Fan Heater core w/ shroud In Lian-Li fish tank window case

    RIP (Rest In Pieces):
    P4 3.0 ghz @ 3.75 ghz Aerocool HT-101 IC7-G
    Radeon 9800 Pro 430 W Antec True Power

  16. #16
    Member
    Join Date
    Jun 2003
    Location
    Amsterdam, NL
    Gigaflops are not the reason you have a CPU for gaming, the GPU will always obliterate it. A CPU for gaming needs higher integer, which cell is sounding like it sucks at. GJ IBM, build a chip that looks great on paper, but still lacks a kep component for gaming.

  17. #17
    Member
    Join Date
    Jun 2002
    Location
    Isla Vista, CA
    Quote Originally Posted by Gnufsh
    Ah, I assumed the SSE/x87 execution hardware was different (I'm fairly certain it is in the P3, and I'm 100% certain it is in some x86 cpus).
    As far as I know, no x86 MPU uses separate execution hardware. The only chip I'm aware of that has separate vector units is the PPC G4 and 970. Rarely, if ever, will you have code that uses both scalar and vector instructions within any reasonable window, so such a feature would be pretty useless as well as costly.

  18. #18
    Member
    Join Date
    Jun 2002
    Location
    Isla Vista, CA
    Quote Originally Posted by man_utd
    Gigaflops are not the reason you have a CPU for gaming, the GPU will always obliterate it. A CPU for gaming needs higher integer, which cell is sounding like it sucks at. GJ IBM, build a chip that looks great on paper, but still lacks a kep component for gaming.
    IMHO, it sounds like Cell is suppose to replace GPU's rather than CPU's. Having fully-programmable Vector processors are much better from a programming point of view than shader-programming on the GPU. There will still be a graphics processor, but I doubt it'll be that powerful. Looks like we're moving back to the days of software rendering and the graphics subsystem being there just for drawing.

  19. #19
    Senior Member Gnufsh's Avatar
    Join Date
    Dec 2001
    Location
    June Lake, California
    Quote Originally Posted by imgod2u
    As far as I know, no x86 MPU uses separate execution hardware. The only chip I'm aware of that has separate vector units is the PPC G4 and 970. Rarely, if ever, will you have code that uses both scalar and vector instructions within any reasonable window, so such a feature would be pretty useless as well as costly.
    really? Good to know. I wonder why gcc has this option then:
    -mfpmath=unit
    Generate floating point arithmetics for selected unit unit. The choices for unit are:

    387
    Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.

    This is the default choice for i386 compiler.
    sse
    Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.

    For i387 you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For x86-64 compiler, these extensions are enabled by default.

    The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.

    This is the default choice for the x86-64 compiler.
    sse,387
    Attempt to utilize both instruction sets at once. This effectively double the amount of available registers and on chips with separate execution units for 387 and SSE the execution resources too. Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well resulting in instable performance.
    And it's an i386 option. Perhaps some cpus, while the registers are shared (I think they have to be), have seperate execution hardware? Or, I could be wrong. It's happened before and it'll happen again.
    Lost access to the classifieds? Look here.
    Forum Policies
    Sig Rules

    "Men occasionally stumble over the truth, but most of them pick themselves up and hurry off as if nothing ever happened."
    -Sir Winston Churchill

  20. #20
    Member
    Join Date
    Jun 2002
    Location
    Isla Vista, CA
    Quote Originally Posted by Gnufsh
    really? Good to know. I wonder why gcc has this option then:

    And it's an i386 option. Perhaps some cpus, while the registers are shared (I think they have to be), have seperate execution hardware? Or, I could be wrong. It's happened before and it'll happen again.
    The reason is that some code runs faster in x87 (even on the P4) and some code runs faster in SSE. ICC does the same thing, it spits out a mixture of x87 and SSE code. However, they're rarely close enough to be executed in parallel by the instruction window, even an instruction window as large as the P4's.

    As far as I know, in hardware, the 2 instruction extensions use the same registers. They're treated separately by the ISA, but in reality, they're written to the same ones. You can, however, issue an x87 instruction and an SSE one at once, SSE does have scalar instructions and that may be what they're refering to when they say both can be executed in parallel. Although at that point, I'm not sure why you don't just use 2 parallel SSE scalar instructions.

Page 1 of 2 1 2 LastLast

Posting Permissions

  • You may not post new threads
  • You may not post replies
  • You may not post attachments
  • You may not edit your posts
  •