• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

CPU unstable no matter what I do, and it's getting worse

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

nstgc

Registered
Joined
Dec 21, 2013
I purchased parts for a new PC back in November and started building early December. Despite nearly three months of effort, including two RMAs and a second RAM kit, my CPU seems even less stable than before. In fact, every time I think it's stable, it seems to suddenly become highly unstable. That is, I'll find BIOS settings which will let it run MemTest86 and MPrime (Prime95 for Linux) for hours on end, but then the next day, I'll try again and MPrime will repeatedly fail within minutes. I started with an undervolt via IA AC and IA DC and a static VCache offset. I was able to run MPrime with `IA AC = 0.19`, `IA DC = 1.02`, `VCache = -0.05`, RAM at DDR5-7000, and everything else at "ASUS stock" settings for about 12 hours before I patted myself on the back and called it stable. The next day it was anything but stable. Since then, over the past two months, I've been slowly increased the voltage from there and reduced the power until it was running at Intel stock with no undervoltage. Then yesterday it started failing MemTest86, too, something it had been passing consistently for over a month.

I've replaced the CPU, motherboard, ASUS themselves saying it was either the CPU or motherboard after a lot of back and forth and trying an entirely different RAM kit and different BIOS. With current settings and my overkill cooler, temps stay below 80°C under even the heaviest load, which lately isn't very heavy since the CPU can't clock up to it's advertised turbo speed (5.2GHz on P-core/3.7GHz on E-cores). In case it matters, both CPUs had an SP rating of 72.

The only thing I can think of is it's the PSU, but before I submit yet another RMA request, I wanted to ask all of you what you think. I've been building computers for over 20 years, but I've never encountered a run of problems like this.
 
Last edited:
Is the RAM on the motherboard's QVL list?

Does it work without enabling XMP?

Does it 'work' when you're actually using it (outside of stress tests)?
 
Is the RAM on the motherboard's QVL list?

Does it work without enabling XMP?

Does it 'work' when you're actually using it (outside of stress tests)?
Yes, both kits were. After a lot of back and forth, ASUS themselves said it was either the CPU or motherboard.

I have not tried it without XMP, but I should. After the latest MemTest failure I kind of threw up my hands. MPrime doesn't fail on the RAM test, only the CPU cache tests. Lately, rather than segfaults, I'm getting rounding errors.

No, it's failed outside of stress tests. I think it was due to some improperly compiled software. Recompiling it fixed it, but that still means there was a hardware failure, just at compile time rather than run time. Also `s-tui`, a TUI monitoring suite has a tendency to lock up on the new machine while MPrime is running. I haven't run much else in a while. Back in January, I had run a few games (FF14 and Soulstone Survivor), and they worked. I haven't tried that in a while, though.
 
Last edited:
Let's try the basics. I don't see anywhere saying you tested the CPU stock and ram with XMP off. If that doesn't work stably, something is really wrong. The only thing you can tinker with running CPU stock is power limit. Voltages are a no go area. Decline all performance "optimisations" the mobo may offer, if necessary explicit disable not just auto.

Without personal experience, it is my understanding DDR5-7000 is pushing it and it might just be out of reach for your combination of CPU, ram and mobo.

Undervolting wont help matters either. You're trading stability risk for potential performance. Max turbo clocks are not guaranteed to be hit. They're opportunistic.
 
In just the opening paragraph, my thought went directly to the PSU. RMA of the CPU, MB and RAM? PSU.

1) Try what @mackerel posted above.
2) How old is the PSU? Do you have a spare PSU to test with?
 
Let's try the basics. I don't see anywhere saying you tested the CPU stock and ram with XMP off. If that doesn't work stably, something is really wrong. The only thing you can tinker with running CPU stock is power limit. Voltages are a no go area. Decline all performance "optimisations" the mobo may offer, if necessary explicit disable not just auto.

Without personal experience, it is my understanding DDR5-7000 is pushing it and it might just be out of reach for your combination of CPU, ram and mobo.

Undervolting wont help matters either. You're trading stability risk for potential performance. Max turbo clocks are not guaranteed to be hit. They're opportunistic.
Actually, I've done all of that except disable XMP, and all tests relating to the RAM were fine until yesterday. I will test it later today or tomorrow for sure.

Part of what's so frustrating is that it was perfectly happy with DDR5-7000 just a week ago, and had been through all the tests up until now. At least since I got the new CPU. The previous CPU couldn't do above 6400.

As for whether I have a spare PSU... not really. I have the one in my old PC, but it's in use. I do have an old PSU from 10 years ago. I swapped it out because I thought it might be dead. Now I think it was probably the UPS instead. Either way it's 10 years old and I don't know for certain that I have all the cables still.
 
Last edited:
Without personal experience, it is my understanding DDR5-7000 is pushing it and it might just be out of reach for your combination of CPU, ram and mobo.
Preliminary testing shows that dropping XMP helps. While definitely good information,I feel that doesn't really answer the question of what's going on. The computer had been stable, but has been growing increasingly less stable over a period of a few months. That's not normal. Even if things work now with DDR5-5600, I don't have any reason to assume it will last.
 
Intel is generally good to ~7k without flinching. That doesn't mean there arent IMC duds in the batch. Its possible your cpus were on the edge at that speed.

If you don't need such fast ram, get something slower... 6600-6800 with the lowest CL. You're still overclocking (max for the platform is 5600) and you wouldn't notice the difference performamce-wise. While you don't have faith it will last, you also have no idea it won't (the reasons are you're picking really fast RAM and not all IMC's can handle it). I'd rather swap ram out out again than a power supply. Nothing you've said yet screams power supply to me.

If it's the psu, it should still crap out during a full (cpu + ram + gpu) stress test with the ram at jedec speeds. If it doesn't it's the ram/board/cpu ecosystem. Since you're picking such fast ram each time, that's where I'd look first and drop it down a bit.
 
Last edited:
... read the first post (first, of several, mentions of stress testing). ;)
Ok so he already did. Just a thought if he had another motherboard the same one, he has he could test all the parts on that board and see if the same thing happens.I would think it can be done through the prosses of elimination.
 
Intel is generally good to ~7k without flinching. That doesn't mean there arent IMC duds in the batch. Its possible your cpus were on the edge at that speed.

I know the first CPU I had couldn't go about 6400. As for "on the edge"... that's an understatement. Dropping the speed from 7000 to 6933 resulted in being able to run MPrime for at least 3 hours for both the small and extra-small sizes. I haven't run the larger size yet, but that, ironically, hasn't been an issue. I'll test more tomorrow and for longer. I tend to stress when leaving my machines on over night and I have enough trouble sleeping.

If you don't need such fast ram, get something slower... 6600-6800 with the lowest CL. You're still overclocking (max for the platform is 5600) and you wouldn't notice the difference performamce-wise. While you don't have faith it will last, you also have no idea it won't (the reasons are you're picking really fast RAM and not all IMC's can handle it). I'd rather swap ram out out again than a power supply. Nothing you've said yet screams power supply to me.

For me, it isn't a matter of speed so much as... I have expectation. In particular, I hope for my new RAM to have a first-word latency no worse than my current RAM, and I would like it be readable in the same amount of time. So if a 4GB DiMMs is being read at 1866 in X seconds, then for a 24GB module to be read in X seconds it would need to be run at DDR5-11196, which is horribly unreasonable. Still anything less than DDR5-11196 feels wrong to me. I came to terms with DDR5-7200 as that's pretty close had my current CPU's IMC been dual- instead of quad-channel.

And I do have every intention of dropping the speed down further for "long-term stable". I like to test with faster settings and lower voltages, then bring them back to something more stable.

Which is actually a bit of a problem this time around. When at Intel's stock power settings, the CPU can't hit 5.5/4.4 GHz (E/P) during MPrime. it was always my intention to down clock the E-cores to 3.9GHz, but I really do expect the P-cores to run at 5.5 GHz.

If it's the psu, it should still crap out during a full (cpu + ram + gpu) stress test with the ram at jedec speeds. If it doesn't it's the ram/board/cpu ecosystem. Since you're picking such fast ram each time, that's where I'd look first and drop it down a bit.

I don't understand how my experience can be explained by the IMC, but it definitely seems that's the case. Which is good since I'm both tired of RMA'ing stuff and also routing all those cables again is the the stuff of nightmares. I with this case was just a quarter inch wider behind the motherboard tray. (Even better is that it seems like I just needed to drop the RAM speed down by less one percent!)

I do worry that it has something to do with the motherboard defaulting to unsafe voltages, though. The vast majority of my tests on this CPU were done with a manually set IA AC, so hopefully that minimized any damage.
Post magically merged:

Ok so he already did. Just a thought if he had another motherboard the same one, he has he could test all the parts on that board and see if the same thing happens.I would think it can be done through the prosses of elimination.
I'm not sure exactly what you're suggesting. I don't have a spare motherboard of the same type, but this is my second motherboard as I had RMA'ed the previous one.
 
Last edited:
the CPU can't hit 5.5/4.4 GHz (E/P) during MPrime.
When running these stress tests are you gett8ng any power limit or other throttling (I see not temps). In also not terribly surprised... during stress testing, I dont see max turbo clocks either... but that's windows and heavier stress tests too/planned behavior on for the chip.

Sorry if I missed something you posted, but why are you changing any voltage and assuming the board is overvolting (to a problematic point)? If you want to run efficiently, make sure you're stable at stock voltage (everything on Auto, not manually set) THEN tweak down.
I like to test with faster settings and lower voltages, then bring them back to something more stable.
Is it possible you brought instabilty with this methodology? :)

Also, why would you downclock the ecores? I don't know your use scenario but that seems odd.

I hope for my new RAM to have a first-word latency no worse than my current RAM,
I vaguely recall first word latency calcs but don't recall capacity having anything to do with it (?). It was like BL + trcd + cl / effective transfer rate/2000... or something. This is a but above my head though so I'll stay in my lane, lol.
 
Last edited:
That ram capacity - speed relationship requirement is a new one on me. If I understand the intent correctly, the wish is for higher capacity ram to be entirely read in the same time as smaller modules. To me this seems very random. You end up needing insanely speedy modules at higher capacity. For what real world benefit? I mean, for me, I'd rather have fast ram regardless of the capacity. How much ram there is, is a different axis of requirement.

And on the turbo speeds, again it is opportunistic and depends on the workload. It is not a given that ALL workloads will allow the CPU to boost to max. Lighter intensity loads would have a higher chance, perhaps Cinebench R15 for example. But anything AVX era is going to be tougher. I'm not familiar with mprime but I saw a description of it as "Prime95 for Linux". If it is at all like Prime95, then on recent Intel CPUs it will be hitting AVX2 hard. Power usage per clock will be pretty high and thus lower opportunity for boosting.
 
I went to the ASUS website. The download center: https://www.asus.com/support/download-center/

There I entered your motherboard (Motherboard: ASUS ROG Strix Z790-A Gaming WiFi II) and that opened the CPU and Memory comparability lists.
I entered Team Group and 7200 for ram speed and there are NO compatible memory kits listed. I also searched Team Group and 48GB, nothing compatible.

You list RAM: Team Group T-Create DDR5-7200 CL34 (CTCED548G7200HC34ADC01) as the ram you have.

In that search the fastest & largest capacity for Team Group DDR5 is:

Team Group FF4D532G6400HC40BDC01 2x 16GB XMP 6400 6400 SS SK Hynix 40-40-40-84 1.35 1,2
 
Last edited:
Sorry if I missed something you posted, but why are you changing any voltage and assuming the board is overvolting (to a problematic point)? If you want to run efficiently, make sure you're stable at stock voltage (everything on Auto, not manually set) THEN tweak down.
The reason I worry about overvoltaging (despite my own tinkering which is lowering the voltages) is due to reports of ASUS's BIOS settings pushing crazy voltages into chips by default, in addition to amps. I never intentionally increase voltages above stock, even for short term testing.

Is it possible you brought instabilty with this methodology? :)

Also, why would you downclock the ecores? I don't know your use scenario but that seems odd.

In this case "faster" means "stock". I have not attempted any kind of overclock. As for the voltages, as mentioned, I worry about ASUS deciding electroshock therapy is good for my CPU. I did try stock voltages, by the way. I would not cry "no matter what I do" when I'm knowingly running at lower voltages. I think I mention that in my opening post.

As for the E-cores, my logic has a few points to it, which I believe to be valid, but please, do correct me if I'm wrong.
  • They eat into the CPU's total power budget and I care more about the P-cores.
  • Their energy efficiency takes a nose dive after 3.5GHz, which only gets worse.
  • I can't imagine needing that much multi-threaded performance now or in the next several years, even though I compile a lot of software.
Relevant to that first point is my intent to set PL1 to the Intel stock limit of 125W for actual long-term use.
I vaguely recall first word latency calcs but don't recall capacity having anything to do with it (?). It was like BL + trcd + cl / effective transfer rate/2000... or something. This is a but above my head though so I'll stay in my lane, lol.
Yeah, you can ignore that. That's just me being weird and nonsensical. It's a mental thing relating to my notion of wanting the new thing to be better than the old by every metric. My brain invented a nonsense metric that has no real world effect.
Post magically merged:

That ram capacity - speed relationship requirement is a new one on me. If I understand the intent correctly, the wish is for higher capacity ram to be entirely read in the same time as smaller modules. To me this seems very random. You end up needing insanely speedy modules at higher capacity. For what real world benefit? I mean, for me, I'd rather have fast ram regardless of the capacity. How much ram there is, is a different axis of requirement.
Yeah, the capacity speed relationship is just something my brain cooked up.

And on the turbo speeds, again it is opportunistic and depends on the workload. It is not a given that ALL workloads will allow the CPU to boost to max. Lighter intensity loads would have a higher chance, perhaps Cinebench R15 for example. But anything AVX era is going to be tougher. I'm not familiar with mprime but I saw a description of it as "Prime95 for Linux". If it is at all like Prime95, then on recent Intel CPUs it will be hitting AVX2 hard. Power usage per clock will be pretty high and thus lower opportunity for boosting.
I also ran MPrime with AVX and SSE. And yeah, my head is kind of still stuck 10 years in the past where my current Haswell CPU is all too happy to clock up to it's max speed. When running integer work loads or games, it can clock all the way up.
Post magically merged:

I went to the ASUS website. The download center: https://www.asus.com/support/download-center/

There I entered your motherboard (Motherboard: ASUS ROG Strix Z790-A Gaming WiFi II) and that opened the CPU and Memory comparability lists.
I entered Team Group and 7200 for ram speed and there are NO compatible memory kits listed. I also searched Team Group and 48GB, nothing compatible.

You list RAM: Team Group T-Create DDR5-7200 CL34 (CTCED548G7200HC34ADC01) as the ram you have.

In that search the fastest & largest capacity for Team Group DDR5 is:

Team Group FF4D532G6400HC40BDC01 2x 16GB XMP 6400 6400 SS SK Hynix 40-40-40-84 1.35 1,2
That is part of another discussion earlier in this saga of mine: https://www.overclockers.com/forums/threads/intel-13th-vs-14th-gen-and-qvl.804572

I tried two different RAM kits, by the way, the other was G.Skill's F5-7600J3848F24GX2-TZ5RW.

That said, it's interesting that you didn't find that kit, as it was on the list last I checked. And as mentioned, I spoke to ASUS support several times over a six week period (the time was not mentioned).
 
Last edited:
  • They eat into the CPU's total power budget and I care more about the P-cores.
  • Their energy efficiency takes a nose dive after 3.5GHz, which only gets worse.
Aren't you lowering voltage and therefore power use anyway? So, maybe that (lowering ecores) raises the ceiling, but to what end? You aren't using the headroom unless you wanted to overclock the pcores. It won't help the pcores reach that boost clock with the heavy load. You'll have to adjust the boost clocks to all pcores the same multiplier.

Also, consider testing whatever work you're actually doing. The heavy load you're putting on the cores is why it's running the clock speeds it is. Try your compile and see how many cores/threads it uses and what speeds it hits.
 
Aren't you lowering voltage and therefore power use anyway? So, maybe that (lowering ecores) raises the ceiling, but to what end? You aren't using the headroom unless you wanted to overclock the pcores. It won't help the pcores reach that boost clock with the heavy load. You'll have to adjust the boost clocks to all pcores the same multiplier.
I am lowering the voltages, but this is no golden sample. Also, I intend on running it mostly at the 125W limit, not the more typical 253W limit.

That said, you're write in your implication that I'm merely guessing at what needs to be done to ensure the P-cores can run at 5.5GHz. Hopefully I'm wrong, but I won't know that for a while yet as I'm still working to convince myself that it's stable, and that means running it as hard I can. After three months of trying to get the thing stable, it's going to take a lot for me to be convinced that it is in fact okay on an emotional level. At this point, that's what this is about: letting me stop worrying. I still have a few more tests to run now that I'm done with MPrime.

However, even if that's not the case, I care about energy efficiency. I don't see downclocking the E-cores as something that will adversely effect my experience with the computer, but it seems as though it should reduce energy usage. The two most computationally intensive things I do on my computer are gaming and compiling. The former doesn't make good use of E-cores in the first place, and the latter is something I can just wait on as I do something else. I might not be one of those people screaming "death to the E-core" or whatever, but I also don't see them as a huge boon. Which might make you wonder why I didn't just go with AMD. Well, I really do like the notion of the big.LITTLE architecture scheme with one set of cores for heavy lifting and another for odds and ends and saving power. Also, the oneAPI, which I care about for some reason (I am a mathematician, but I don't actually do any computational work), is optimized for Intel only. Also, also, AMD's 7000 series chips have a reputation of catching fire. Even if that's unlikely, and resolved with recent BIOS updates, I'd still rather not worry about it.

> You'll have to adjust the boost clocks to all pcores the same multiplier.
I'm not sure what you're saying here, probably because it's late. 😅

I hope that doesn't sound dismissive; it isn't meant to be. Hopefully it's more just rambling (I am quite tired right now).

Until I move on to games benchmarking (which is the only usage where CPU speed might make a significant difference), I really am guessing at this point. Currently my efforts have been in ensuring the E-cores were tested at 3.9GHz. I feel comfortable with them at that frequency, but would need to retest if I were to raise that up. If the processor didn't have so many threads, or if the ring bus speed was still tied to E-core spell, I wouldn't consider cutting them back, but it does have a lot of threads and from Raptor onward, E-cores don't adversely affect the Ring.

Also, consider testing whatever work you're actually doing. The heavy load you're putting on the cores is why it's running the clock speeds it is. Try your compile and see how many cores/threads it uses and what speeds it hits.
I only yesterday finished a battery of MPrime runs. In total it was 36h with AVX2, AVX, and SSE each tested at all three size levels plus blended. Well, I didn't test SSE and the largest size since at that point it was pretty clear how it'd go. That's the test that seems least likely to fail (or take the longest to fail). Today I gave it a rest.

I have a script written that `make`s whatever is in the working directory, hashes those files which are reproducible (which requires manual discovery on my part), and collects the hashes of those files' data and a single chunk (so one hash per run). These hashes are stored in a table (or dictionary or map depending on language) with a count number, along with an error tally. If all is working properly, there should only be one hash and no errors.

One thing I didn't think about when picking NixOS for my next distro (I'm kind of tired of having to work to maintain Arch) is how much software needs to be locally compiled. Early on with this PC I actually had a run of serious instability because some important system software was compiled incorrectly, but the building process didn't go so poorly as to trip any alerts for me. So properly compiling software is what I consider most critical.

Now that the new PC seems reasonably stable, I probably won't be doing as much with it. I'll boot it up every now and then to run this or that stability test (I still have that script, StressAppTest, and Stress-NG), but I don't feel rushed any longer. When I couldn't determine the fault, or how to make it stable, I felt like I needed to hurry up since RMA's are time sensitive. But everything seems to be working reasonably well, so now it's just a matter of chasing away anxiety. Assuming my current PC doesn't die soon, I probably won't actually start using the new one as my daily driver until Summer.
 
Last edited:
I intend on running it mostly at the 125W limit, not the more typical 253W limit.
It's curious, this methodology. I've never seen someone so intent on lowering the performance of their CPU when there isn't a need (thermally, for example). Why wouldn't you run it at full power to get whatever workflows done in a more timely manner?

Well, I really do like the notion of the big.LITTLE architecture scheme with one set of cores for heavy lifting and another for odds and ends and saving power.
That's AMD's next gen, not this gen... well, next-gen desktop. Current gen AMD desktop does not utilize big.LITTLE.

> You'll have to adjust the boost clocks to all pcores the same multiplier.
I'm not sure what you're saying here, probably because it's late. 😅

I hope that doesn't sound dismissive; it isn't meant to be. Hopefully it's more just rambling (I am quite tired right now).
Not dismissive at all. What I'm saying is you have this goal to work with Pcores. Intel's turbo boost uses higher clocks for fewer cores. For example, if you use 1-2 threads, it will boost to 5.5. Four threads, 5.3, and all threads, 5.2 (example values, note). So the fewer threads used, the faster the clock speeds are up to the max boost. If you're running stress tests and pounding the cores, it will run at a lower speed by default/the reason it's not hitting max boost - they don't work that way). In order to get all the cores running at max boost clocks, you will need to adjust the CPU multiplier to have all of the Pcores match each other.

Worth noting on the ecores, if you're work spills over to them (which, why would you not let them, I'm assuming your work is heavily threaded by the testing you're doing...), wouldn't you want them running optimally and not a few hundred Mhz slower? Is your goal really to be more efficient and lower performance/slow workflows? If so, that's OK, I just never heard of that for the meager power savings it is. Most users want to get things done in a more timely manner and don't worry about a few W saved.

To be honest with what I think you're doing, you may need to RAISE Vcore to get all the pcores running at the boost clock (or accept the fact when running that way it runs a few hundred MHz slower). Just lowering the ecore MHz without a voltage adjustment yields negliglble power differences. Since Vcore is tied into P and E cores, you're likely not going to save as much power as you think when you have to adjust the multiplier and voltage up a bit for the pcores to meet your goal (apologies if I misunderstood).


Hopefully I'm wrong, but I won't know that for a while yet as I'm still working to convince myself that it's stable, and that means running it as hard I can. After three months of trying to get the thing stable, it's going to take a lot for me to be convinced that it is in fact okay on an emotional level.
Now that the new PC seems reasonably stable, I probably won't be doing as much with it. I'll boot it up every now and then to run this or that stability test (I still have that script, StressAppTest, and Stress-NG),
You built this new machine to........... test stability :p ?!!

I'm not the right doctor for this problem, lol. My default response is, respectfully, to get over it and use that PC. Stop making up metrics, get reasonable RAM (speed, like under DDR5-7000) on the QVL, set XMP and enjoy that machine you spent good $ on. :)
 
Last edited:
Now that the new PC seems reasonably stable, I probably won't be doing as much with it. I'll boot it up every now and then to run this or that stability test (I still have that script, StressAppTest, and Stress-NG), but I don't feel rushed any longer. When I couldn't determine the fault, or how to make it stable, I felt like I needed to hurry up since RMA's are time sensitive. But everything seems to be working reasonably well, so now it's just a matter of chasing away anxiety. Assuming my current PC doesn't die soon, I probably won't actually start using the new one as my daily driver until Summer.

Why... did you buy this machine? :)

And I'm being DEAD SERIOUS. Not joking around at all.

Was it just for the purpose of benchmarking and overclocking?

I mean your current main rig (in your sig) is older than the build BEFORE my previous build. (I used to have a 4690...kf I think.)

So a 14700 would be a great upgrade. In fact I contemplated selling my 12900k and going for that chip instead. (Too much hassle in the end for only a moderate boost in performance when I was just about to get a GIGANTIC boost in performance.)

If you do anything at all with your current rig... then EVERYTHING would be better with your new rig.

...except overclocking apparently. :)

If you're really not gonna use it though... Sell all the parts and buy a motorcycle!

Motorcycles are AWESOME! You're gonna look so cool...

Get yourself some sunglasses, too.

Hang out at the beach... If there's no beach near you... keep rolling until you get to one. You've got all day, buddy... :beer:
 
Back