I must admit that the L1 cache implementation on P4's seemed meager to me. I suppose it has to do with the long pipelined design rendering large L1 ineffective...
On the other hand, L2 cache is a beefy 512Kb and it runs at CPU speed, so it won't cost much to access data there. Older cpu's were often paired with slower L2 cache, which meant that a hit in L1 was a lot better than one in L2.
Another point to note (correct me if I'm wrong) is data replication from L1 to L2. With the P3 and earlier chips, all data in L1 was replicated in L2, such that you could consider total cache size as that of L2 (unlike AMD, as they don't replicate that data). If I recall correctly, data is not replicated in the P4's caches.
I can't find any references online that confirm whether the P4's cache is replicated or not. As far as I know it IS still replicated.
That's not a bad thing however. Latency is usually lower on a replicated cache structure. And if you run into the situation where data is in the L1 but missing in the L2 in one cycle, followed by an L1 miss and L2 hit in another, you take a bigger performance hit with a non-replicated design.