Estimated costs for GPU dies

mackerel · Jun 27, 2023

I wrote this on another forum, but thought it might be interesting here too.

Below are my vague estimates of GPU die costs based on questionable data and assumptions. I'm really selling this aren't I?

Ada
AD102 $378
AD103 $179
AD104 $122
AD106 $70

Ampere
GA102 $114
GA103 $79
GA104 $53
GA106 $31

RDNA3
NAVI31 + 6x MCD $166
NAVI33 $41

RDNA2
NAVI21 $158
NAVI22 $79
NAVI23 $49
NAVI24 $19

Arc
DG2-512 $100
DG2-128 $29

How did I get these numbers?

I found the cost of a wafer in a general web search. Many of the sources I've never heard of before and I have no way to verify the sources of those numbers. The only ones I'd have some mild confidence in are the TSMC 7/5/3 ones as they are more frequently discussed, but it could also be everyone copying everyone else. I've assumed N6 is same as N7, and N4 is same as N5. Note however nvidia don't use N4, but 4N. I had wondered if I made a typo when I saw that but apparently it is a custom variation of the process for nvidia.

Right, so that's wafer cost taken as far as I can. Next, how many dies can we get from each? This is a function of two main variables: die area and defect rate. TSMC have openly stated a defect rate for N7 of below 0.1/cm2, and N5 was tracking similarly. I have found zero information about Samsung defect rate. Pick a number, any number! I decided to keep it at 0.1 for everything to keep calculation simple. With this we can get a number of expected "good" dies per wafer. Good in this sense means zero defects. In practice, some defect dies may still be usable if the bad part is mapped out. I thought I could estimate this by looking at how much effective area is reduced from cut-down products but this is going to be very time consuming so I'm not doing it at this point. The other unknown is binning. Just because a die is free from defects doesn't mean it'll necessarily meet the performance specification, so there may be some loss from that path. Again, it may be used in lower tier bin to recycle it. Given the complexity I've not taken either of these into consideration. I assumed a good die is defect free and will meet performance specifications.

If you're wondering how much of an impact defect rate makes in this, not a massive amount. Looking at NAVI21, going from 0.1 to 0.09 defect rate means we go from 59 to 62 expected good dies, reducing the cost from $158 to $150. If we pick a small die like NAVI24, we go from 498 to 503 expected good dies, $18.7 to $18.5.

EarthDog · Jun 27, 2023

Interesting napkin math. Thanks for sharing!!

EDIT: I wonder about the binning process... I'd think that has to be a bit fluid during ramping up. NV may have a performance threshold for X, but if only 60% (random values) of the dies are reaching that proposed threshold for X, they would have to lower it to balance things out, no? I'd assume this was done pre-production (read: before the ramp-up to build stock), but still. So if that one die only offers 60% AD102, but can be cut back successfully to AD103...that significantly increases the cost of the AD102. We're not talking about differences of 1% like defect rate.

I really do appreciate you taking the time. Most couldn't get the numbers to fit in a large ballpark, but I think you've managed to do it.

mackerel · Jun 27, 2023

I looked at GA102 from the perspective of cut-down offerings. GA102 was used from a model of 3070 Ti to 3090 Ti, the latter using the full GA102. I'll only look down to the 3080 10GB, which was about 81% core configuration, and the cores themselves made up approximately 58% of the die. The 3080 10GB has about 11% of total silicon available to contain a defect compared to a full die product.

Let me rephrase that. Assuming a single defect, it has a 58% chance to land in the core area where some can get turned off. As long as that defect doesn't cover more than 11% of the divisible units in that area, you can make a 3080 10GB out of it. I have no idea how big a defect is, and there is a chance of multiple defects in close proximity. I can't model either of those. If I make the assumption that if a defect happens in the right place it is always contained, I can work out the effect it has on overall yield. Looking back at my numbers, GA102 has a 55% yield for 44 good dies out of a total 80. We have 36 defect dies, but with a 58% chance to land in the core area where it can get mapped out, that's 20 of those defect dies that could have sufficient functionality to be 3080 10GB grade. End result, we still have 55% yield for a perfect die, but a further 25% could make 3080 10GB tier for 80% usable dies.

Edit: I just spotted a mistake. I didn't consider ROP scaling in the above, so that could lead to some small error. I think it too much of a detail to try and integrate given the massive unknowns I have elsewhere. Similarly I was assuming all other parts still had to be perfect. Maybe some of those could also tolerate some defect outside my consideration.

Estimated costs for GPU dies

mackerel

Member

EarthDog

Gulper Nozzle Co-Owner

mackerel

Member

Similar threads