Stability – What Is It? A Perspective

Those of us that overclock our CPUs talk about stability of an overclocked CPU without defining the term. The usual measure of stability used in overclocking reports is related to BSOD or “hang ups”. Being an old time hardware engineer who started designing digital hardware before the first microprocessor was invented, the issue of how fast we could get hardware to go with stability has always been near and dear to my heart.

In the old days we used to do all kinds of worst case timing analysis by hand to figure out how the timing in a circuit would change depending on various parameters, including temperature. Finally we would take several examples of the circuit and run them in a temperature chamber with a scope on the critical signals. Usual practice was to find the critical combination of clock rate and temperature and back off the clock rate by 20% or more as a safety factor after we had tweaked the circuit as much as we could.

With VLSI design the process of timing analysis is automated. The step of putting a circuit in a temperature chamber is still used but it is too expensive to tweak the circuit since this requires a new silicon run. The concept of the safety factor is still being used with the “back off” value most likely even more than 20%. This safety factor is why we can often take processors to 50% over their rated clock rate.

What are the sources of instability? The basic physics of semiconductor devices says that each transistor can switch a given load in certain amount of time at given voltage and temperature. As the voltage rises the switching rate rises slowly. As the temperature rises the switching rate drops slowly. The exact rates are determined by the fine details of the actual transistor design. These parameters are usually closely held design secrets.

How this affects us in overclocking is that when two signals are supposed arrive at the same register at the same time, plus or minus a fraction of a nanosecond and one of them is late because the transistor driving it has slowed down then the register, or whatever, will have the wrong information. If the register is supposed to contain a branch address then we will go to a wrong place in the code. If the register contains data then the computation will produce a wrong answer.

The rank and file overclocker does not have access to the detailed design data and almost certainly will never have access to this information. You can be fairly certain that only a handful of people at Intel or AMD have this data and they are not going to talk. What we do is turn up the clock/voltage until things stop working and then back off to some comfort point. Let us consider a stability scale based on a level of testing with definitions as follows:

In Spec – This is case where the processor is running at design voltage and clock rate and within the rated temperature range. The manufacturer “guarantees” that the processor will work without a glitch since they run some form of acceptance test using rather fancy test equipment, $100K-500K per test station, and ship only the chips that pass.

Tested – This is the case where the processor is running at a voltage or clock rate higher than the original design and has passed the acceptance test. The rank and file overclocker does not have access to the test station required to perform these tests. It is very likely that the “high end” processors from Intel and AMD fall in this category.

Formally Tested– This is the case where the processor has passed some form of rigorous test. In a rigorous test the test verifies that the processor produces a known answer for a given set of input data. These tests are reproducible. The tests that are part of Prime95 fall in this category, but only test the particular FPU and related operations used in Prime95.

Informally Tested – This is the case where the processor has passed a semi-informal test. In a semi-informal test some of the results are checked for being in an allowable range. The test is not usually reproducible. The checks made by Prime95 on every interation fall in this category.

Very Informally Tested – This is the case where the processor/software runs without obvious errors. The test is not reproducible. Most of the tests using CPU intensive games fall in this category.

Smoke Tested – This is the case where the processor is able to boot Windows without crashing. The test is not really reproducible since the Windows environment keeps changing. The name refers to the old time test method used for power amplifiers of turning up the power until smoke was produced.

POST – This is the case where the processor is able to pass the Power On Self Test in the BIOS. If you have ever looked at this test in a BIOS you will realize that it is extremely primitive and comparable to testing a car by getting it to start.

Clearly there is a wide range of levels of confidence available from this range of testing levels. I propose that we rate stability on the basis of the level of test that has been passed. Since I am a hard core Prime95 freak I use the Prime95 tests as my initial measure of stability. In addition I do not consider a processor/clock rate combination stable until it has been able to run Prime95 7/24 for several weeks without error.

For those that might think this is a bit extreme I have had the experience where an air cooled 600E appeared to run fine at 854, but generated a Prime95 error about every 20-30 hours until I backed off the clock to 843 where it runs without error. Prime95 runs in the background and you can use the machine for whatever you want without affecting Prime95 other than to slow it down a bit.

Email Terry


Be the first to comment

Leave a Reply