PDA

View Full Version : Why increase voltage? What is the proverbial OC wall? Read for some answers!

icesaber
12-30-04, 09:49 PM
A lot of people have the abstract idea that a CPU is like an electric motor, and to make it go faster you have to up the voltage. I want to clear up the myth, for those who don't already know. I'd like to touch on why CPUs generally have a maximum point they will OC to before they just won't do any more as well. Also, anybody who can put it more eloquently may obviously feel free to correct me.

The truth of the matter is that the amount of voltage needed for a CPU to operate is based on tolerances for binary values. In a CPU or any transistor for that matter, there isn't really a 0 or 1 value; there are only low and high voltage. The tolerances are what I call the minimum point at which the transistor recognizes the voltage as a 1 (I will refer to this as "low tolerance"), as well as the maximum point where a 0 is recognized (high tolerance). It can be helpful to look at this pair of tolerances as a band, since any value that falls between the two may produce an unknown output (varies by chip). As a side note, some chips use "low assertion," which simply means that a low voltage is read as a 1 instead of vice-versa.
When you overclock a CPU, less voltage is available per cycle to the transistors in the operational units. There may no longer be enough time for the voltage to propagate across those microscopic wires between transistors for the necessary voltage to accumulate, we could get erratic responses from the CPU as some of the 1 bits actually become unknown values within the band or even 0's if the band is narrow enough to allow it, since they're below the high tolerance now.
If we increase the voltage, we increase the current, thereby decreasing the amount of time required for that voltage to build up at the transistor and produce a 1.
Realistically, the tolerances are the difference between a high-end and low-end chip. Let's say you have a 3.0GHz P4 and a 3.4GHz one. If they have the same core, it is very possible that the only differences are tolerances. The transistors in the 3.4 may have a reduced low tolerance so that less voltage is required to produce the correct output without needing to increase voltage. So in reality, a faster chip does not necessarily have a "higher quality" core, but just a downward-shifted tolerance band.

The reason why we can only get so much performance out of the chip is because of the system of overhead used in the pipelining process. Just like any assembly line, the pipeline in a CPU can only perform each step as fast as the slowest stage in the pipeline. For example, if the pipeline stages have the times 8-2-2-3-2-10, we would have to operate every step at 10 or above so all following instructions in the pipeline have time to complete the last stage. Most chips are actually set up to have a safety region, a kind of overhead. In the case of the above example, we may actually make the time 15 or even 20 just to make sure everything completes correctly (also varies by speed of CPU, see 3.0-3.4 comparison, same core).
If we overclock it too far, we may actually make the time allowed less than that minimum, like 9. If the amount of time gets too low, instructions may be unable to complete one or more stages of the pipeline, producing erratic and almost always unbootable results. This is one of the proverbial walls to overclocking a CPU. The other primary walls are heat and electron migration.
Heat is an obvious one, because the more heat is produced the more voltage is required to perform the same task (reduced electrical efficiency). Realistically, we may get better performance out of a chip at a slightly slower speed. For example, my Athlon XP mobile Barton 2600+ may put out better synthetic benchmark results at 2.6GHz, but it boots faster, generally plays games smoother and moves around the OS better at 2.5GHz, solely based on the 5 degrees Celsius difference.
Electron migration is the point at which a chip will most likely die. It's when so much voltage is being put through a wire that some of it leaks off to a neighboring wire, producing erratic results. You can compare it to a case of a river during a heavy rainstorm; Once the water level overruns the edges, it may actually erode fresh streams and offshoots from the original river, which will keep flowing after the storm has passed. If this happens to the CPU, the chip is almost guaranteed finished. You may as well make it a new hood ornament. Since heat is directly related to molecular motion, it's plain to see that higher operating temperatures can easily increase the risk of electron migration. The simple solution to this is better cooling. Bear in mind that electron migration really occurs no matter what because of temperatures and the nature of electrons, but that's why the tolerances exist. That way, a certain amount can occur without producing unexpected results. There is no sure way to tell when this has killed your chip, it is just one possible way that a CPU can burn out (although it's quite common). So if your CPU suddenly burns out, this may be the culprit.

There's obviously a lot more to it than this, but questions and comments are still welcome as always.

*edited 2:09PM Friday 31 Dec 2004*
- inconsistencies: added "tolerance band" abstraction, tx to jbloudg20
*edited 6:23PM Saturday 01 Jan 2005*
- inconsistencies: details to electron migration and low assertion, tx to Captain Newbie

musawi
12-30-04, 09:55 PM
Great post, I vote sticky!

stratcatprowlin
12-31-04, 12:53 AM
I suppose this would pertain to gpu's as well? It would explain the performance drop from raising the core too high sometimes.Although a lack of voltage could be a reason there too.

Flip-Mode
12-31-04, 08:49 AM
Learned something new, thanks.
Vote Sticky!

jnev_89
12-31-04, 11:18 AM
my head is still spinning, but a good read nonetheless. another vote for sticky

jbloudg20
12-31-04, 11:26 AM
As far as the threshold, there isnt a clear this is high, this is low voltage point. Usually there is a middle where there is no value. For a typical CMOS chip between 0 and 1.5 volts is low, and from 3.5 to 5 is a high. In this middle "gray area" the state is uncertain. A given manufacturer cannot guarantee how their chip will react in this condition.

DohDoh
12-31-04, 01:11 PM
Sticky! :thup:

icesaber
12-31-04, 01:17 PM
As far as the threshold, there isnt a clear this is high, this is low voltage point. Usually there is a middle where there is no value. For a typical CMOS chip between 0 and 1.5 volts is low, and from 3.5 to 5 is a high. In this middle "gray area" the state is uncertain. A given manufacturer cannot guarantee how their chip will react in this condition.

Tx for the input, jbloudg20, much appreciated. I've updated the post accordingly, let me know if that seems more accurate. Accuracy is my goal here.
And tx to everyone else for the support. if this does get it, it'll have been my first sticky.
I realize I don't post much, although I've been a member for some time... I just thought it was time to at least try to make a contribution.

Captain Newbie
12-31-04, 01:27 PM
You have essentially boiled down Hitechjb1's hundreds if not thousands of posts into a nice summary.

Electron migration is one of several killers. Carelessness in overclocking would contribute to more processor deaths IMO since people take processors in BIG steps with improper cooling. ( = dead silicon) You may say it's the same thing; really, it is as the electron migration is a result of the huge steps.

I would be inclined to add that electron migration rates and temperatures are directly related, since the definition of temperature is the root mean square velocity of the substance's particles. In essence, they're really the same thing with the same results. This (as you have explained) explains why most air coolers won't go as far as liquid or phase-change coolers. Stuff bounces around at a slightly slower velocity and as such is less inclined to hop off the wire it's supposed to be.

It should be noted that electromigration happens at all voltages and all temperatures greater than zero degrees kelvin (absolute zero) because of quantum uncertainty. However, 90%-99% of the time, everything is where it's supposed to be, so we can throw the quantum argument out.

Some architectures actually use a high voltage as a zero, and a low voltage as a one (low-assertion), but the end result is the same. Increasing voltage gives more "potential" as it were to charge the transistors--essentially as you put it. Just being picky.

From this we can conclude that there are actually two maximum frequency/voltage points: One that is variable with temperature/electromigration (V sub rms), and another that is purely architectural. Regrettably, I don't think any of us have the resources (ergo, processor blueprints and an electrical engineering degree) to determine what either or both points are except through trial and error.

Coolness. Edit, reformat, and sticky it in a prominent place.

JimboZ88
12-31-04, 01:37 PM
This clears up So much for me. This thread NEEDS to be a sticky, since i know that most people on the forums don't know why they have to increase the voltage on an o/c. Thanks for a very informative post icesaber, and props to Captain Newbie for his response as well.

Captain Newbie
12-31-04, 02:07 PM
In general, I believe that the scientist-overclocker is a dying breed, being replaced ever so slowly but surely by the enthusiast overclocker, who doesn't really care about such things as long as it is OMFG! my PC r0x0rs teh big one!!1!11!11!!!!1one. Overclocking is really all science...

stratcatprowlin
12-31-04, 04:12 PM
No doubt it needs to be stuck!

icesaber
01-01-05, 02:03 AM
It's late, will edit again in the morning. Tx again to everyone for the input.

beau_zo_brehm
01-01-05, 02:11 AM
If all of this information is factual, which I assume it is, then I also vote for sticky! :)

jenko
01-01-05, 08:09 AM
Great post!
So overclocking and running a cpu hot can hurt real world performance?thats something i didnt know.

icesaber
01-02-05, 08:55 PM
I don't have all the technical info behind it, but in my experience, overclocking too much while the heat is too high can actually reduce performance. I know it definitely has something to do with reduced efficiency of the logic circuits at higher temperatures, most likely the cache memory since the cache is the biggest piece of latency in the pipeline architecture (to my knowledge) and would therefore produce the most noticeable difference... hence better performance from extreme cooling. I'm running air-cooled, so I'm quite sure that this chip can handle 2.6GHz and beyond beautifully if it were watercooled. No cash for that for now.

Actually, my OC is stable as long as the vcore doesn't go below 1.7v, but my NF7 seems to droop... 1.75 is the bios setting I use, when the monitor actually reads 1.72. Not sure if I should trust that reading, seems dubious.

stratcatprowlin
01-02-05, 09:53 PM
How cold can you get your room? Maybe try a night of benchies to see if a cold night can help you find out if the cold might negate the performance drop?

icesaber
01-02-05, 09:59 PM
With the ambient case temp at 19 degrees Celsius, I still get the performance hit. That's all the windows open on a snowy sub-freezing day.

geko
01-03-05, 07:08 PM
Hey this might explain why a couple of times ive got better bench results (over 50 points or so) from a lowered voltage at the same speed.

Elif Tymes
01-03-05, 08:57 PM
Its an excellent, iformative article.

I ran into alot of the same issues when I was overclocking, now I'm stable, and fast, and running fancy free :D

icesaber
01-04-05, 10:18 AM
Hey this might explain why a couple of times ive got better bench results (over 50 points or so) from a lowered voltage at the same speed.

indeed it would. more voltage => more heat; more heat => less efficient energy transfer.
unfortunately, that trick won't work on my cpu :) I already have the vcore as low as it can go before it becomes unstable...

Captain Newbie
01-08-05, 09:55 AM
I am not content to let this sink to the bottom of the forum, to be forgotten.

*bump*

Seriously stickyworthy.

stratcatprowlin
01-10-05, 04:15 PM
Bump for sticky!

Mr_Obvious
01-15-05, 04:15 PM
Two thumbs up, and if I had more than two, I'd put them up too. Excellent material!!

JigPu
01-21-05, 01:47 PM
/me will probably stick it, but before I do, I have some comments/questions.

The reason why we can only get so much performance out of the chip is because of the system of overhead used in the pipelining process. Just like any assembly line, the pipeline in a CPU can only perform each step as fast as the slowest stage in the pipeline. For example, if the pipeline stages have the times 8-2-2-3-2-10, we would have to operate every step at 10 or above so all following instructions in the pipeline have time to complete the last stage. Most chips are actually set up to have a safety region, a kind of overhead. In the case of the above example, we may actually make the time 15 or even 20 just to make sure everything completes correctly (also varies by speed of CPU, see 3.0-3.4 comparison, same core).

If we overclock it too far, we may actually make the time allowed less than that minimum, like 9. If the amount of time gets too low, instructions may be unable to complete one or more stages of the pipeline, producing erratic and almost always unbootable results. This is one of the proverbial walls to overclocking a CPU. The other primary walls are heat and electron migration.
The bolded sentance is confusing and making it hard for me to understand exactly what you are trying to say. If I am reading it right, you seem to be saying that the chip designers slow down the speed of each pipeline stage (by increasing the number of cycles an instruction will stay in the stage) to some "lowest common denominator". This makes sense, but there are some stages of the pipeline (for example, Fetch) that don't seem possible to predict the amount of time they'll take. If the data for a Fetch is in L1, you've got a cycle or two wait. If it's in L2, you've got 3 or 4 cycles. If it's in main memory, you've potentially got dozens of cycles. Paged to disk, and we're talking thoudands or more cycles. While they probaly do "size" each pipeline stage to some specific number of cycles, I don't think they nesscesaraly base it off the worst case scenario (since I doubt they're waiting 1000 cycles for each pipeline stage just in case they need something that has been paged). Should an instruction take longer than expected, the pipeline just stalls.

Also, if I am reading it correctly, how can overclocking decrease the time that they've designed each instruction to take? The pipelines will wait 10 cycles, regardless of if they're running at 3GHz or 4GHz. Unless they're designing so that each pipeline stage takes X seconds (which wouldn't really make sense), I don't see how this could actually happen....

Very good read except for that part :)
JigPu

icesaber
01-21-05, 02:42 PM
The case of a memory fetch, write etc are a special case. You are indeed correct, a fetch will stall an instruction for a given amount of time depending on where the fetch is from. My intent was to show that generally speaking, in the case of an operative stage of the pipeline, the speeds of the other stages are generally based on the slowest one. If you want to get really technical, some of the stages are actually longer stages split over a period of time, but the circuitry needs to be designed to handle that. You can't have a stage that is wired to run in one clock cycle run in two just because it will take that long, it will instead give unexpected outputs.
Some engineers would actively split the 10 unit time I gave into two seperate 5 unit intervals, so that the new longest cycle time was actually 8. However, introducing a new stage introduces another interlock penalty, so in effect it could potentially worsen performance. Such is the nature of electrical engineering :) it's all about hunting for that sweet spot. I don't know, I guess I was trying to not be too technical about it...

All stages in the pipeline (other than a fetch/write) are given the same amount of time to complete because we're working with synchronous machines. The way the timing is handled is generally the worst case scenario, depending on headroom; the headroom is extra time left during the cycle after the stage has actually completed its work. Overclocking effectively reduces the headroom available, and eventually you will find the limit of the particular chip when you not only run out of headroom but start reducing the amount of time available for the stage's work itself to complete. Then, the stage is unable to complete and 'interesting' errors will likely (almost guaranteed, actually) occur.

I'll edit later tonight, I have to leave for work shortly.

peace.

Captain Newbie
02-02-05, 07:37 PM
*kicks this thread which really should be stuck back up to the top of the forum*

02-02-05, 07:49 PM
There may no longer be enough time for the voltage to propagate across those microscopic wires between transistors for the necessary voltage to accumulate

Voltage does not propogate, electrons do.

Heat is an obvious one, because the more heat is produced the more voltage is required to perform the same task (reduced electrical efficiency).
He means as heat increases the resistivity of the interconnects increase and hence a decrease in electrical efficiency. Also more resistance = higher heat output, so an overheated component is converting electrical energy to heat energy.

A pretty good effort. Stick!
:thup:

Gnufsh
02-02-05, 09:34 PM
Electron migration is the point at which a chip will most likely die. It's when so much voltage is being put through a wire that some of it leaks off to a neighboring wire, producing erratic results. You can compare it to a case of a river during a heavy rainstorm; Once the water level overruns the edges, it may actually erode fresh streams and offshoots from the original river, which will keep flowing after the storm has passed. If this happens to the CPU, the chip is almost guaranteed finished. You may as well make it a new hood ornament. Since heat is directly related to molecular motion, it's plain to see that higher operating temperatures can easily increase the risk of electron migration. The simple solution to this is better cooling. Bear in mind that electron migration really occurs no matter what because of temperatures and the nature of electrons, but that's why the tolerances exist. That way, a certain amount can occur without producing unexpected results. There is no sure way to tell when this has killed your chip, it is just one possible way that a CPU can burn out (although it's quite common). So if your CPU suddenly burns out, this may be the culprit.
If what ou're talking about is electromigration, you are a little off. "voltage" isn't leaking to neighboring wires, neither is current. The "wires" (interconnects, in a cpu) actually end up moving due to momentum transfer form electrons. Here's a great link:
http://www.csl.mete.metu.edu.tr/Electromigration/emig.htm
They have some awseome pictures of actual circuits being opened or shorted by electro migration. The rest is right on, though, incerased heat and increased current increase electromigration. On the other hand, it became a much smaller issue when companies began to use Cu interconnects instead of Al ones.

02-02-05, 09:58 PM
In the spirit of the excellent point raised by Gnufsh allow me to add a bit of current research to it.

What has been ignored thus far is the Casimir Force. Hendrik Casimir, a Dutch Physicist wrote a paper which said that there is an attractive force between two closely spaced surfaces even in vacuum. This attraction is not electrostatic but by the exchange of virtual photons between the two surfaces.

Vacuum as we know it is not empty, but a bath of fluctuating energy. Particle pairs are spontaneously created and destroyed. The particles carry with them momentum and hence energy. They cannot be detected, because that would violate Heisenberg's uncertainity principle, between Energy and Time. In brief, the particles are so short lived that they cannot be detected.

How does this apply to semiconductors? As you pack things together, there could be motion as said by Gnufsh and the Casimir Force. The Casimir Force exists regardless of whether there is a current or not. This has been ignored by Engineers till now.

An excellent write-up can be found here:-
http://physicsweb.org/articles/world/15/9/6

Pay particular attention to this paper in the reference list:
K Lamoreaux 1997 Demonstration of the Casimir force in the 0.6 to 6 micrometer range Phys. Rev. Lett. Vol,78 5

Gnufsh
02-03-05, 10:30 AM
The case of a memory fetch, write etc are a special case. You are indeed correct, a fetch will stall an instruction for a given amount of time depending on where the fetch is from. My intent was to show that generally speaking, in the case of an operative stage of the pipeline, the speeds of the other stages are generally based on the slowest one. If you want to get really technical, some of the stages are actually longer stages split over a period of time, but the circuitry needs to be designed to handle that. You can't have a stage that is wired to run in one clock cycle run in two just because it will take that long, it will instead give unexpected outputs.
Some engineers would actively split the 10 unit time I gave into two seperate 5 unit intervals, so that the new longest cycle time was actually 8. However, introducing a new stage introduces another interlock penalty, so in effect it could potentially worsen performance. Such is the nature of electrical engineering :) it's all about hunting for that sweet spot. I don't know, I guess I was trying to not be too technical about it...

All stages in the pipeline (other than a fetch/write) are given the same amount of time to complete because we're working with synchronous machines. The way the timing is handled is generally the worst case scenario, depending on headroom; the headroom is extra time left during the cycle after the stage has actually completed its work. Overclocking effectively reduces the headroom available, and eventually you will find the limit of the particular chip when you not only run out of headroom but start reducing the amount of time available for the stage's work itself to complete. Then, the stage is unable to complete and 'interesting' errors will likely (almost guaranteed, actually) occur.

I'll edit later tonight, I have to leave for work shortly.

peace.
There is some work being done on clockless processors. I know Intel did some a while back, and I think Sun is working on it now. You have to add in additional circuitry to synchronise some things, but you end up getting a faster processor by letting some tasks complete faster than they otherwise would have.

It looks like ARM is actually going to use some similar technology in chips:
http://www.handshakesolutions.com/ARM_processor.html

For those who don't know, I believe ARM is a company that designs CPUs, but I don't believe they actually make any. This one is just going to be a design people can liscense to make, and was supposed to be available Q1 2005.

prankstar008
02-16-05, 10:11 PM
as they say in pig latonia
tickysay

(or would it be ickystay)
either way....could this get some glue?

LoneWolf121188
02-21-05, 07:47 PM
why hasn't this been stickied yet?!?!?! This answered a lot of question for me!

stratcatprowlin
02-22-05, 12:22 AM
Bump.