Random Blackwell Thoughts

mackerel · Mar 20, 2024

(am I going crazy, I thought I posted this already, but I don't see it. Trying again. Maybe I need more coffee)

This is mainly about the recently released AI products and we may be some way off info for the gaming parts. Still it may offer a clue of what's happening.

Blackwell die size seems to be pushing TSMC's reticle size limit. Nvidia said it was 2x Hopper, which was already close to that limit. The 2x coming from the two chips glued together. Plugging this into a yield calculator, we'd expect around 50% yield (defect-free dies) of a possible 60-ish per wafer, depending on exact dimensions and placement.

Wafer pricing is the next question. Back a step, it is on 4NP process, an evolution of nvidia's custom 4N process at TSMC. Note TSMC's general nodes are Nx, not xN. Previously I used the first link below to estimate pricing. I just found the 2nd link which has different numbers. 4N class is based off N5 so I'm assuming it will be similar or possibly a bit more given it is custom. If we take the higher end value of $17k, that puts a per die cost of just under $300 each, good or bad. Again, around 50% are expected to be defect free. There may be some more loss due to binning, but cut down offerings will soak those up. GB200 seems to get the best dies, with B200 and B100 getting lower quality ones. This doesn't account for packaging costs or HBM.
https://www.techpowerup.com/272267/alleged-prices-of-tsmc-silicon-wafers-appear#g272267-2
https://www.tomshardware.com/news/tsmc-expected-to-charge-25000usd-per-2nm-wafer

Why not some variation of N3? Cost? Capacity?

Nvidia claim 10TB/s bandwidth over the die to die connection. I think this is the highest claimed of any product we have visibility of in the computing space. Apple's M2 Ultra is the other example I can think of, with a claimed 2.5 TB/s. Especially given it is a consumer tier product, that's something. Intel's Sapphire Rapids offering could be interesting to compare but I've been unable to dig up numbers for its internal bandwidth. RDNA3 GPUs split MCDs from the GCD, and claim a peak bandwidth of 5.3TB/s, but this isn't connecting multiple execution dies together so it will never scale as much.

The GPU goal must be to have multiple chips working as one to help enable better performance scaling without the pain SLI/Crossfire has.

Railgun · Mar 20, 2024

mackerel said:
Why not some variation of N3? Cost? Capacity?

The GPU goal must be to have multiple chips working as one to help enable better performance scaling without the pain SLI/Crossfire has.

Or not yet ready for Nvidia prime time.

Would we see a "consumer" dual chip part? Don't know. Wouldn't be the first time, albeit with a different architecture as you point out, but there's still a performance uplift on the existing node optimizations, but how much specifically is TBD.

A dual chip consumer card is going to be a power monster, so I can't see it coming down that pipe just yet. Maybe I'm wrong, and I'm holding out for a 5000 series at this point. The 3080ti is performing what I need it to do at 5k2k well, so no rush for me.

mackerel · Mar 20, 2024

The reason going for multi-chip with Blackwell here is that each individual chip is about as big as you can physically make. Well, in theory they could do what Cerebras systems did with their wafer scale "chip" and have the connections on-wafer. I think the problem there is that yield would become much worse unless they build in a lot more redundancy like Cerebras does. So two "smaller" chips, relatively speaking, does help with the yield.

Consumer chips aren't likely to go that big any time soon because the cost would make the 4090 look like loose change. At the end of the day it comes down to packaging cost. Is making two smaller chips and gluing them together cheaper than making one big chip? But this is some serious glue. Blackwell has a claimed 10TB/s bandwidth between the two chips. Apple M2 Ultra has a claimed 2.5TB/s bandwidth, and I think that might be the best proof of concept it could be affordable for a high end GPU. It doesn't really make sense in mid to low end, yet.

Before anyone points at Ryzen and say, why not do that. Each Infinity Fabric link inside Ryzen (Zen 4) might be ball park 0.2TB/s at best. It's a different type of connection covering relatively long distances and I don't feel is suited to GPUs. Their RDNA3 GCD-MCD aggregate bandwidth is starting to get interesting, if they could apply it between compute dies and not compute to storage.

Random Blackwell Thoughts

mackerel

Member

Railgun

Member

mackerel

Member

Similar threads