Just how much bandwidth does each core/thread/whatever take up? Does it fit in the L3 cache, keeping in mind it is broken up in CCX size chunks. Can you mitigate it somewhat by having multiple threads work on the same data?
In a quick skim of that ARM chip, it still has the same 8 channel ram for the same total bandwidth. Is there anything it does differently to make it not a problem there?
If multiple systems are a consideration, what about Cascade Lake X? Compared to the AMD mainstream offerings, they're roughly per-core cost competitive but get you double the ram channels (4 vs 2). If you can make use of AVX-512 that might provide a further boost.
Well, I didn't measure the RAM bandwidth per core yet, because i am currently coding it, but I can say it is going to be heavy. The shared data in memory is about 8GBytes (but I can process it sequentially in chunks of course) and each virtual "creature" has its own memory for processing this 8GB data. They are all combining assembly instruction and executing them randomly trying to find a mathematical solution. So , it is sort of like running a virtual machine on each core, but not that advanced because only mathematical instructions are used (ADD, MUL, MOV, INC, DEC, CMOV, SETB, SETL, and so on) , so basically there is a lot of computing going on, 99.99% of which is going to garbage , but eventually you will find the solution after many hours of training.
So, for example, if I go for the networking approach, I wouldn't be able to evolve fast, and I would need to do pauses between epochs to kill bad individuals from the population, but if it is going to cost like 10 times less, then it is worth it. As I am checking out on Alibaba, there are ARM octacore mobile phones selling for $80 bucks, if I find some octacore mobile phone boards at $50 bucks per card (without display and case), and they run say at 1.5Ghz, then for 20k bucks I could buy 400 of those boards, which would give me 8 x 400 = 3,200 cores in total (excluding networking devices costs) . ARM cores would be of course slower than AMD EPYC cores, but I wonder by how much ??? 128 cores EPYC vs 3,200 cores , that's 25 times bigger in computing power. I think I need to find some study about cheapest computing devices in terms of instructions executed per core (including RAM speed) and that would be the solution to my problem. But there are very little people doing parallel computing on commodity hardware, so it is a bit difficult to discover what thing runs faster than another. I guess I would need to buy one of each thing and test, test on EPYC, test on ARM server, test on ARM mobile phone, etc...