• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Cell, P4, gflops and mhz... (PS3/Xbox2/PC)

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

OC Noob

Member
Joined
Jun 28, 2002
Location
Phoenix, AZ USA
I posted this in a gaming thread and thought someone here could probably answer my question. This is, after all, the CPU section:)

"This is one thing Sony does extremely well. Hype. I'm still trying to understand if this thing is the greatest thing since sliced bread or just an overhyped piece of silicon.

They say it does 256 Gflops @ 4.6 ghz. Can it even reach those speed and is that just theoretical?

A P4 3.2 can do a theoretical 59 gflops and a Radeon X800 can do 200 gflops according to its documentation.

According to that presentation on Cell, PS3 bandwidth limits Cell to 1/10 of the power (256 gflop) so 25 gflops. If thats its theoretical limit then a P4 is theoretically faster then what Cell will be in the PS3.


What the heck, I need some computing genius to make some sense out of this crapola."
 
From what I understand, Flops are what it is pretty good at, as long as it's not bandwith limited (which it is, due to lack of a decent sized cache). 4.6ghz, I see as being theoretical speed, just doesn't seem realistic to me.
 
So does it mean anything that Cell can do 256 gflops at 4.6 ghz, P4 can do 59 @ 3.2 ghz and an X800 can do 200 gflops?

Probably all theoretical numbers, but can we use them to make any kind of a comparison between the three?

And why use CPU's when GPU's are so good at gflops?


Sorry for the stupid questions, I really don't know much about flops.
 
OC Noob said:
And why use CPU's when GPU's are so good at gflops?
Well, off the top of my head, flexibility and integer operations. GPUS are very good at what they do, and they are moving to be more programmable, but they cannot do the sort of general purpose things CPUs can do. Additionally, many programs use integer code, something at which desktop CPUs are much better than GPUs. Additionally, the 256GFLOPs figure is for single precision FP operations, CPUs generally do double precision. The figure for the cell is much lower when double-precision FP ops are used instead of single precision.
 
it all depends!

thats the correct thing to say. depending how the cpu's are build, their bandwith, cache, instruction sets and especially how programs are written = SPEED.

the cell processor will be amazing... no doubt about it... it will kick some a**... if it will take over the PC word ...... maybe.... but I doubt that
 
hmmm, thanks for the info.


I'm still not sold on this thing. Sounds like a lot of hype and numbers that are meaningless. I guess we'll see what it does when "the rubber meets the road."


I should say I'm not sold on this thing when it comes to PCs or PC chip applications (servers). For consoles I have no doubts it will be great. They are going to be coded specifically for Cell where PC software won't or is limited to what is coded for it.

By the time it gets even a decent software base Itanium could be out with an MS OS to support it or some other advanced chip.
 
OC Noob said:
A P4 3.2 can do a theoretical 59 gflops
Where'd you get that bit of info? I can't for the life of me get the math (or even Google :D) to give me that number...

59 GFLOPs @ 3.2GHz = 18 floating point ops/cycle

Somebody wanna enlighten me as to how 18 floating point values can appear to be modified in only one clock tick? I mean, even if you use SSE, that only messes with 4 pieces of data, leaving you with 14 yet to be touched. If we go for dual parallel SSE pipes, we're still 10 ops too short. Assume that hyperthreading will magically allow you to perform two operations in one cycle as well ( :rolleyes: ), and we're STILL 2 operations too short...

JigPu
 
Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...

OC Noob: MS already has an OS out for the Itanium. Or were you talking about the Cell?
 
I read that cell is really only 25flops when its actually accurate in the floating point. Also the code needs to be specialized to work with the cell and in the pc market Specialized code is hard to come by
 
Gnufsh said:
Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...
I thought those were integer, not FP... Oh well, goes to show how much attention I pay to chips these days :D

Regardless, fully loading the 2 adders (which I assume aren't SIMD), is only 4 ops/cycle. Again, assuming that HT somehow magically lets you execcute two instructions in the same clock cycle gives only 8 OPS/cyce. Still quite a bit less than 18....

JigPu
 
That's not possible, the x87 FPU and the SSE unit share the same execution hardware. They even share an issue port. And you cannot issue multiply and add instructions in parallel, you can't even issue them alone. In order to get the full throughput of 1 SSE instructions per cycle, you must alternate multiply and add instructions. The previous poster is correct, at most you can get 4 SP FP ops per cycle on a P4 which leads to 12.8 GFLOPS on a 3.2 GHz P4.

The post on anandtech is incorrect as the person is getting 24 by assuming there's a multiply-accumulate (FMAC) instruction. SSE/2/3 does not have FMAC instructions like these cells or AltiVec has.
 
Gnufsh said:
Well, there are also the 2 FP ADDers that operate at twice the frequency of the chip...

OC Noob: MS already has an OS out for the Itanium. Or were you talking about the Cell?

sorry have to be quick, but yeah. I was talking about Cell. Just stuck in that Itanium could have an MS OS by then... but anything could happen:D

EDIT: Didin't know its software was by MS. Thanks for the info. I should have said an MS OS for home users.


JigPu said:
Where'd you get that bit of info? I can't for the life of me get the math (or even Google :D) to give me that number...

59 GFLOPs @ 3.2GHz = 18 floating point ops/cycle

Somebody wanna enlighten me as to how 18 floating point values can appear to be modified in only one clock tick? I mean, even if you use SSE, that only messes with 4 pieces of data, leaving you with 14 yet to be touched. If we go for dual parallel SSE pipes, we're still 10 ops too short. Assume that hyperthreading will magically allow you to perform two operations in one cycle as well ( :rolleyes: ), and we're STILL 2 operations too short...

JigPu


I did an internet search for gflops and pentium and it was said on a few boards where people had done the math.

I'll link it when I have more time:)


EDIT:


Found it referenced at a handful of places, but am having a hard time finding the boards that did the math to get that. Not saying its right either, but I'd like to find it so you guys can take a look and tell me if its right or wrong.

http://www.abovetopsecret.com/forum/thread75492/pg2

http://www.geek.com/news/geeknews/2004Jul/wbc20040728026217.htm

http://www.nvnews.net/vbulletin/showthread.php?p=434532


Oh well, I give up for tonight. Sounds like that number isn't right anyway.
 
Last edited:
Gigaflops are not the reason you have a CPU for gaming, the GPU will always obliterate it. A CPU for gaming needs higher integer, which cell is sounding like it sucks at. GJ IBM, build a chip that looks great on paper, but still lacks a kep component for gaming.
 
Gnufsh said:
Ah, I assumed the SSE/x87 execution hardware was different (I'm fairly certain it is in the P3, and I'm 100% certain it is in some x86 cpus).

As far as I know, no x86 MPU uses separate execution hardware. The only chip I'm aware of that has separate vector units is the PPC G4 and 970. Rarely, if ever, will you have code that uses both scalar and vector instructions within any reasonable window, so such a feature would be pretty useless as well as costly.
 
man_utd said:
Gigaflops are not the reason you have a CPU for gaming, the GPU will always obliterate it. A CPU for gaming needs higher integer, which cell is sounding like it sucks at. GJ IBM, build a chip that looks great on paper, but still lacks a kep component for gaming.

IMHO, it sounds like Cell is suppose to replace GPU's rather than CPU's. Having fully-programmable Vector processors are much better from a programming point of view than shader-programming on the GPU. There will still be a graphics processor, but I doubt it'll be that powerful. Looks like we're moving back to the days of software rendering and the graphics subsystem being there just for drawing.
 
imgod2u said:
As far as I know, no x86 MPU uses separate execution hardware. The only chip I'm aware of that has separate vector units is the PPC G4 and 970. Rarely, if ever, will you have code that uses both scalar and vector instructions within any reasonable window, so such a feature would be pretty useless as well as costly.
really? Good to know. I wonder why gcc has this option then:
-mfpmath=unit
Generate floating point arithmetics for selected unit unit. The choices for unit are:

387
Use the standard 387 floating point coprocessor present majority of chips and emulated otherwise. Code compiled with this option will run almost everywhere. The temporary results are computed in 80bit precision instead of precision specified by the type resulting in slightly different results compared to most of other chips. See -ffloat-store for more detailed description.

This is the default choice for i386 compiler.
sse
Use scalar floating point instructions present in the SSE instruction set. This instruction set is supported by Pentium3 and newer chips, in the AMD line by Athlon-4, Athlon-xp and Athlon-mp chips. The earlier version of SSE instruction set supports only single precision arithmetics, thus the double and extended precision arithmetics is still done using 387. Later version, present only in Pentium4 and the future AMD x86-64 chips supports double precision arithmetics too.

For i387 you need to use -march=cpu-type, -msse or -msse2 switches to enable SSE extensions and make this option effective. For x86-64 compiler, these extensions are enabled by default.

The resulting code should be considerably faster in the majority of cases and avoid the numerical instability problems of 387 code, but may break some existing code that expects temporaries to be 80bit.

This is the default choice for the x86-64 compiler.
sse,387
Attempt to utilize both instruction sets at once. This effectively double the amount of available registers and on chips with separate execution units for 387 and SSE the execution resources too. Use this option with care, as it is still experimental, because the GCC register allocator does not model separate functional units well resulting in instable performance.
And it's an i386 option. Perhaps some cpus, while the registers are shared (I think they have to be), have seperate execution hardware? Or, I could be wrong. It's happened before and it'll happen again.
 
Gnufsh said:
really? Good to know. I wonder why gcc has this option then:

And it's an i386 option. Perhaps some cpus, while the registers are shared (I think they have to be), have seperate execution hardware? Or, I could be wrong. It's happened before and it'll happen again.

The reason is that some code runs faster in x87 (even on the P4) and some code runs faster in SSE. ICC does the same thing, it spits out a mixture of x87 and SSE code. However, they're rarely close enough to be executed in parallel by the instruction window, even an instruction window as large as the P4's.

As far as I know, in hardware, the 2 instruction extensions use the same registers. They're treated separately by the ISA, but in reality, they're written to the same ones. You can, however, issue an x87 instruction and an SSE one at once, SSE does have scalar instructions and that may be what they're refering to when they say both can be executed in parallel. Although at that point, I'm not sure why you don't just use 2 parallel SSE scalar instructions.
 
Back