Alright so I don't have time to go through a complete tear down (I'm typing these up in meetings).
Here is a quick glimpse at what can cause a common bottleneck in desktop mobos: An Intel CPU has a bunch of available lanes (a lot more than what is utilized). Each of these lanes are grouped into channels (since PCI-E is point to point). Each channel has a window to receive and transmit data. So if the channels are distributed correctly, and the signals on the channels are laid out correctly (think physical link), than the n-number of devices connected to these channels will have enough time to move their data to and from the CPU. What happens when you don't have these basics designed correctly (or gimped due to costs), than you will see bottlenecks in different tests.
A great example is multi-gpu with m.2. The GPUs typically will receive the best designed channels so that they never have issues. Yet an m.2 that is now just entering the market will not. Why gimp these channels? Simple answer: cost. More lanes = more board layers. PCI-2.x, 3.0 and future will be harder and harder to implement. PCI-E has a great recovery system, but that creates high overhead (who cares with PC). Faster cores helps mask this problem. Higher PCI-E bus somewhat helps, but ruins the channel signal integrity.
To add: Think channels as highways. With an X amount of cars present, the more lanes you have, the smaller your window. The smaller number of lanes increase the window. PCI-E uses windows for data transfer. Only 1 window at a time for communication.