L1 is per-core. It cannot be shared (or at least if it were shared it would be vastly slower)
L2 is faster than L3 (and costs more). When you're talking 8-16MB of SRAM cache, that can put a dent in the price of the final unit. At least that was the argument a few years ago. I don't know if SRAM manufacturing has improved or anything, or any reason L2 is harder to share than L3, given that the Q9xxx series from Intel used giant L2...
Consider the higher the number, the farther from the cores it is, so the slower it is, and the more cores it is shared between, the greater the latency. So, L1 per core, L2 per "module", L3 per socket. e.g. AMD FX-8xxx has 8 L1 caches, 4 L2 caches, and 1 L3 cache.