• Welcome to Overclockers Forums! Join us to reply in threads, receive reduced ads, and to customize your site experience!

Need processor with lots of cores!

Overclockers is supported by our readers. When you click a link to make a purchase, we may earn a commission. Learn More.

nulik

Registered
Joined
Jan 28, 2012
I am currently coding a genetic algorithm, and I need as much processors as possible in a single chip, because they are all going to use the same RAM. (that's some application specifics)

But Wikipedia page about Zen architecture says the largest processor has 32 cores. I would love to see an AMD processor with 128 cores (256 threads), that would be ideal, but they even don't have 64 cores available!
I heard there is going to be product release with Zen 3 microarchitecture at the end of the year, is AMD going to put on the market 128 - core processors?

Thanks in advance!
 
What options there are depends on the software. What exactly is meant by "single chip"? Single socket? No NUMA nodes? Uniform shared L3 cache?

The Threadripper 3990X as already mentioned is the most affordable way to get 64 cores in a consumer-ish level single socket system. Price per core, it isn't bad. It's just a lot of cores!

If the software isn't held back by NUMA, then you could go multiple socket Epyc and have multiple 64 core CPUs. But that isn't an area I've looked at particularly. It isn't gonna be cheap!

As for going beyond 64 cores in a single socket, look at the current 64-core models. There's eight 8-core chiplets in there with a fat IO die. There isn't physical space to put much more in. To add more cores would need to do one or more of the following:
Shrink the IOD
More cores per chiplet without increasing area, implicitly smaller cores (each core has lower transistor budget, has to either be more transistor efficient, do less, or less cache)
There is a bit of slack space so possibly the chiplets could get a little bigger, but probably not enough to implement 2x more cores in itself.
Use smaller process (probably not next gen but after that)
Move to a bigger package
 
I have read that EPYC processors can achieve 25 GByte/sec RAM transfer speed for each core, that's enough for me. If I replicate the same performance in parallel, making 128 cores running at this speed, it will be wonderful. I am not sure about NUMA, you are rising a good question. But typically single core it is going to have its own memory space where it is going to work for a few microseconds, then it is going to pick another memory space and work another few microseconds there. I am not sure if I need NUMA nodes though. But there is little shared memory in this application as it is kind of embarrassingly parallel problem, but there is some sharing that must be done for the genes, however this can be done at a few minute intervals, so it is not going to be a bottleneck. Having everything running in a single system is going to be much faster than using a network, that's why I want single system design with lots of cores.
 
What kind of budget? Once you move into Epyc territory things get very expensive.
Example the 7742 is a $7000 CPU at 64 cores but you can use a double socket board to up that to 128 cores but by the time you factor in RAM you'll have almost $20K invested
 
I have read that EPYC processors can achieve 25 GByte/sec RAM transfer speed for each core, that's enough for me. If I replicate the same performance in parallel, making 128 cores running at this speed, it will be wonderful.

Not all cores will get 25 GB/s to ram at the same time. The total ram bandwidth for one CPU is of the ball park 200GB/s from 8 channels at 3200. That will have to be divided between all cores. This is in part why they had to go large L3 cache. It is difficult to get data to all these cores...
 
Not all cores will get 25 GB/s to ram at the same time. The total ram bandwidth for one CPU is of the ball park 200GB/s from 8 channels at 3200. That will have to be divided between all cores. This is in part why they had to go large L3 cache. It is difficult to get data to all these cores...

hmm, hmmm....
that's what I kind of suspected, lots of cores packed on a single chip won't be so effective because of RAM bottleneck. 20k of investment in hardware is not what I exactly want. Looks like I will have to go for distributed (networked) approach and buy lots of cheap ARM SoC boards. There is a huge movement on ARM going on right now, today while looking for options I found a 80-core ARM server: https://www.anandtech.com/show/1557...n1-soc-for-hyperscalers-against-rome-and-xeon
They are targeting EPYC as competition.
 
Just how much bandwidth does each core/thread/whatever take up? Does it fit in the L3 cache, keeping in mind it is broken up in CCX size chunks. Can you mitigate it somewhat by having multiple threads work on the same data?

In a quick skim of that ARM chip, it still has the same 8 channel ram for the same total bandwidth. Is there anything it does differently to make it not a problem there?

If multiple systems are a consideration, what about Cascade Lake X? Compared to the AMD mainstream offerings, they're roughly per-core cost competitive but get you double the ram channels (4 vs 2). If you can make use of AVX-512 that might provide a further boost.
 
Just how much bandwidth does each core/thread/whatever take up? Does it fit in the L3 cache, keeping in mind it is broken up in CCX size chunks. Can you mitigate it somewhat by having multiple threads work on the same data?

In a quick skim of that ARM chip, it still has the same 8 channel ram for the same total bandwidth. Is there anything it does differently to make it not a problem there?

If multiple systems are a consideration, what about Cascade Lake X? Compared to the AMD mainstream offerings, they're roughly per-core cost competitive but get you double the ram channels (4 vs 2). If you can make use of AVX-512 that might provide a further boost.

Well, I didn't measure the RAM bandwidth per core yet, because i am currently coding it, but I can say it is going to be heavy. The shared data in memory is about 8GBytes (but I can process it sequentially in chunks of course) and each virtual "creature" has its own memory for processing this 8GB data. They are all combining assembly instruction and executing them randomly trying to find a mathematical solution. So , it is sort of like running a virtual machine on each core, but not that advanced because only mathematical instructions are used (ADD, MUL, MOV, INC, DEC, CMOV, SETB, SETL, and so on) , so basically there is a lot of computing going on, 99.99% of which is going to garbage , but eventually you will find the solution after many hours of training.
So, for example, if I go for the networking approach, I wouldn't be able to evolve fast, and I would need to do pauses between epochs to kill bad individuals from the population, but if it is going to cost like 10 times less, then it is worth it. As I am checking out on Alibaba, there are ARM octacore mobile phones selling for $80 bucks, if I find some octacore mobile phone boards at $50 bucks per card (without display and case), and they run say at 1.5Ghz, then for 20k bucks I could buy 400 of those boards, which would give me 8 x 400 = 3,200 cores in total (excluding networking devices costs) . ARM cores would be of course slower than AMD EPYC cores, but I wonder by how much ??? 128 cores EPYC vs 3,200 cores , that's 25 times bigger in computing power. I think I need to find some study about cheapest computing devices in terms of instructions executed per core (including RAM speed) and that would be the solution to my problem. But there are very little people doing parallel computing on commodity hardware, so it is a bit difficult to discover what thing runs faster than another. I guess I would need to buy one of each thing and test, test on EPYC, test on ARM server, test on ARM mobile phone, etc...
 
Back