A lot of people have the abstract idea that a CPU is like an electric motor, and to make it go faster you have to up the voltage. I want to clear up the myth, for those who don't already know. I'd like to touch on why CPUs generally have a maximum point they will OC to before they just won't do any more as well. Also, anybody who can put it more eloquently may obviously feel free to correct me.
The truth of the matter is that the amount of voltage needed for a CPU to operate is based on tolerances for binary values. In a CPU or any transistor for that matter, there isn't really a 0 or 1 value; there are only low and high voltage. The tolerances are what I call the minimum point at which the transistor recognizes the voltage as a 1 (I will refer to this as "low tolerance"), as well as the maximum point where a 0 is recognized (high tolerance). It can be helpful to look at this pair of tolerances as a band, since any value that falls between the two may produce an unknown output (varies by chip). As a side note, some chips use "low assertion," which simply means that a low voltage is read as a 1 instead of vice-versa.
When you overclock a CPU, less voltage is available per cycle to the transistors in the operational units. There may no longer be enough time for the voltage to propagate across those microscopic wires between transistors for the necessary voltage to accumulate, we could get erratic responses from the CPU as some of the 1 bits actually become unknown values within the band or even 0's if the band is narrow enough to allow it, since they're below the high tolerance now.
If we increase the voltage, we increase the current, thereby decreasing the amount of time required for that voltage to build up at the transistor and produce a 1.
Realistically, the tolerances are the difference between a high-end and low-end chip. Let's say you have a 3.0GHz P4 and a 3.4GHz one. If they have the same core, it is very possible that the only differences are tolerances. The transistors in the 3.4 may have a reduced low tolerance so that less voltage is required to produce the correct output without needing to increase voltage. So in reality, a faster chip does not necessarily have a "higher quality" core, but just a downward-shifted tolerance band.
The reason why we can only get so much performance out of the chip is because of the system of overhead used in the pipelining process. Just like any assembly line, the pipeline in a CPU can only perform each step as fast as the slowest stage in the pipeline. For example, if the pipeline stages have the times 8-2-2-3-2-10, we would have to operate every step at 10 or above so all following instructions in the pipeline have time to complete the last stage. Most chips are actually set up to have a safety region, a kind of overhead. In the case of the above example, we may actually make the time 15 or even 20 just to make sure everything completes correctly (also varies by speed of CPU, see 3.0-3.4 comparison, same core).
If we overclock it too far, we may actually make the time allowed less than that minimum, like 9. If the amount of time gets too low, instructions may be unable to complete one or more stages of the pipeline, producing erratic and almost always unbootable results. This is one of the proverbial walls to overclocking a CPU. The other primary walls are heat and electron migration.
Heat is an obvious one, because the more heat is produced the more voltage is required to perform the same task (reduced electrical efficiency). Realistically, we may get better performance out of a chip at a slightly slower speed. For example, my Athlon XP mobile Barton 2600+ may put out better synthetic benchmark results at 2.6GHz, but it boots faster, generally plays games smoother and moves around the OS better at 2.5GHz, solely based on the 5 degrees Celsius difference.
Electron migration is the point at which a chip will most likely die. It's when so much voltage is being put through a wire that some of it leaks off to a neighboring wire, producing erratic results. You can compare it to a case of a river during a heavy rainstorm; Once the water level overruns the edges, it may actually erode fresh streams and offshoots from the original river, which will keep flowing after the storm has passed. If this happens to the CPU, the chip is almost guaranteed finished. You may as well make it a new hood ornament. Since heat is directly related to molecular motion, it's plain to see that higher operating temperatures can easily increase the risk of electron migration. The simple solution to this is better cooling. Bear in mind that electron migration really occurs no matter what because of temperatures and the nature of electrons, but that's why the tolerances exist. That way, a certain amount can occur without producing unexpected results. There is no sure way to tell when this has killed your chip, it is just one possible way that a CPU can burn out (although it's quite common). So if your CPU suddenly burns out, this may be the culprit.
There's obviously a lot more to it than this, but questions and comments are still welcome as always.
*edited 2:09PM Friday 31 Dec 2004*
- inconsistencies: added "tolerance band" abstraction, tx to jbloudg20
*edited 6:23PM Saturday 01 Jan 2005*
- inconsistencies: details to electron migration and low assertion, tx to Captain Newbie
The truth of the matter is that the amount of voltage needed for a CPU to operate is based on tolerances for binary values. In a CPU or any transistor for that matter, there isn't really a 0 or 1 value; there are only low and high voltage. The tolerances are what I call the minimum point at which the transistor recognizes the voltage as a 1 (I will refer to this as "low tolerance"), as well as the maximum point where a 0 is recognized (high tolerance). It can be helpful to look at this pair of tolerances as a band, since any value that falls between the two may produce an unknown output (varies by chip). As a side note, some chips use "low assertion," which simply means that a low voltage is read as a 1 instead of vice-versa.
When you overclock a CPU, less voltage is available per cycle to the transistors in the operational units. There may no longer be enough time for the voltage to propagate across those microscopic wires between transistors for the necessary voltage to accumulate, we could get erratic responses from the CPU as some of the 1 bits actually become unknown values within the band or even 0's if the band is narrow enough to allow it, since they're below the high tolerance now.
If we increase the voltage, we increase the current, thereby decreasing the amount of time required for that voltage to build up at the transistor and produce a 1.
Realistically, the tolerances are the difference between a high-end and low-end chip. Let's say you have a 3.0GHz P4 and a 3.4GHz one. If they have the same core, it is very possible that the only differences are tolerances. The transistors in the 3.4 may have a reduced low tolerance so that less voltage is required to produce the correct output without needing to increase voltage. So in reality, a faster chip does not necessarily have a "higher quality" core, but just a downward-shifted tolerance band.
The reason why we can only get so much performance out of the chip is because of the system of overhead used in the pipelining process. Just like any assembly line, the pipeline in a CPU can only perform each step as fast as the slowest stage in the pipeline. For example, if the pipeline stages have the times 8-2-2-3-2-10, we would have to operate every step at 10 or above so all following instructions in the pipeline have time to complete the last stage. Most chips are actually set up to have a safety region, a kind of overhead. In the case of the above example, we may actually make the time 15 or even 20 just to make sure everything completes correctly (also varies by speed of CPU, see 3.0-3.4 comparison, same core).
If we overclock it too far, we may actually make the time allowed less than that minimum, like 9. If the amount of time gets too low, instructions may be unable to complete one or more stages of the pipeline, producing erratic and almost always unbootable results. This is one of the proverbial walls to overclocking a CPU. The other primary walls are heat and electron migration.
Heat is an obvious one, because the more heat is produced the more voltage is required to perform the same task (reduced electrical efficiency). Realistically, we may get better performance out of a chip at a slightly slower speed. For example, my Athlon XP mobile Barton 2600+ may put out better synthetic benchmark results at 2.6GHz, but it boots faster, generally plays games smoother and moves around the OS better at 2.5GHz, solely based on the 5 degrees Celsius difference.
Electron migration is the point at which a chip will most likely die. It's when so much voltage is being put through a wire that some of it leaks off to a neighboring wire, producing erratic results. You can compare it to a case of a river during a heavy rainstorm; Once the water level overruns the edges, it may actually erode fresh streams and offshoots from the original river, which will keep flowing after the storm has passed. If this happens to the CPU, the chip is almost guaranteed finished. You may as well make it a new hood ornament. Since heat is directly related to molecular motion, it's plain to see that higher operating temperatures can easily increase the risk of electron migration. The simple solution to this is better cooling. Bear in mind that electron migration really occurs no matter what because of temperatures and the nature of electrons, but that's why the tolerances exist. That way, a certain amount can occur without producing unexpected results. There is no sure way to tell when this has killed your chip, it is just one possible way that a CPU can burn out (although it's quite common). So if your CPU suddenly burns out, this may be the culprit.
There's obviously a lot more to it than this, but questions and comments are still welcome as always.
*edited 2:09PM Friday 31 Dec 2004*
- inconsistencies: added "tolerance band" abstraction, tx to jbloudg20
*edited 6:23PM Saturday 01 Jan 2005*
- inconsistencies: details to electron migration and low assertion, tx to Captain Newbie
Last edited: