RISC vs CISC. Most processors run on Reduced Instruction Sets today. Our beloved X86 is one of those that use a Complex to Reduced ISA (instruction set architecture) pre-converter. I believe from history that Cyrix started this back with it's 486 processors and later NexGen, AMD and Intel followed with the processes. This involves taking an instruction that preforms multple tasks such as (read data, add immediate value, store to mem or reg then bump the address pointer). This can be converted to sever simpler and smaller codes that perform each operation seperately. But would'nt that take longer you might ask? Actually no due to the pipelining now used and as SuperNade pointed out and in the articles, multiple processes are in action at the same time. In effect long before the add takes place, the data has been read in while the previous instruction(s) excuted. Once the add takes place then the next instruction in the pipeline can start, the data gets stored back and the address bumped. Exceptions can occur when the same address or register is hit by different operations. This would cause the pipeline ordering to adjust for inlining to prevent corruption. One thing that helps is with compilers that perform code optimization. Optomized code will see the same memory being hit and convert the operation to use registers instead. Now once loaded to a register the two instructions will load into the pipeline for back to back excution resulting in one read to ops (operations) and one write and bump.
Also note the two L1 caches on todays processors. One L1 is for code the other L1 is for data. This allows data to be streamed in and code to handled seperately. With the exception of branches/jumps and calls, code executes linearly where data is on mostly blocks or single datems. The bottleneck occurs in the L2 when the cache controller has to deal with pulling data from different parts of memory or brached. Instruction prefetch helps by signalling the request before that actual instruction hits the excute stage. By design the processor "can" work on floating point ops, preprocess code and excute other code while the data is brought in.