GPU | VRAM | Price (€) | Bandwidth (TB/s) | TFLOP16 | €/GB | €/TB/s | €/TFLOP16 |
---|---|---|---|---|---|---|---|
NVIDIA H200 NVL | 141GB | 36284 | 4.89 | 1671 | 257 | 7423 | 21 |
NVIDIA RTX PRO 6000 Blackwell | 96GB | 8450 | 1.79 | 126.0 | 88 | 4720 | 67 |
NVIDIA RTX 5090 | 32GB | 2299 | 1.79 | 104.8 | 71 | 1284 | 22 |
AMD RADEON 9070XT | 16GB | 665 | 0.6446 | 97.32 | 41 | 1031 | 7 |
AMD RADEON 9070 | 16GB | 619 | 0.6446 | 72.25 | 38 | 960 | 8.5 |
AMD RADEON 9060XT | 16GB | 382 | 0.3223 | 51.28 | 23 | 1186 | 7.45 |
This post is part “hear me out” and part asking for advice.
Looking at the table above AI gpus are a pure scam, and it would make much more sense to (atleast looking at this) to use gaming gpus instead, either trough a frankenstein of pcie switches or high bandwith network.
so my question is if somebody has build a similar setup and what their experience has been. And what the expected overhead performance hit is and if it can be made up for by having just way more raw peformance for the same price.
The table you’re referencing leaves out CUDA/ tensor cores (count+gen) which is a big part of the gpus, and also not factoring in type of memory. From the comments it looks like you want to use a large MoE model. You aren’t going to be able to just stack raw power and expect to be able to run this without major deterioration of performance if it runs at all.
Don’t forget your MoE model needs all-to-all communication for expert routing
Why do core counts and memory type matter when the table includes memory bandwith and tflop16?
The H200 has HBM and alot of tensor cores which is reflected in its high stats in the table and the amd gpus don’t have cuda cores.
I know a major deterioration is to be expected but how major? Even in extreme cases with only 10% efficiency of the total power then its still competitive against the H200 since you can get way more for the price, even if you can only use 10% of that.
Tflops is a generic measurement, not actual utilization, and not specific to a given type of workload. Not all workloads saturate gpu utilization equally and ai models will depend on cuda/tensor. the gen/count of your cores will be better optimized for AI workloads and better able to utilize those tflops for your task. and yes, amd uses rocm which i didn’t feel i needed to specify since its a given (and years behind cuda capabilities). The point is that these things are not equal and there are major differences here alone.
I mentioned memory type since the cards you listed use different versions ( hbm vs gddr) so you can’t just compare the capacity alone and expect equal performance.
And again for your specific use case of this large MoE model you’d need to solve the gpu-to-gpu communication issue (ensuring both connections + sufficient speed without getting bottlenecked)
I think you’re going to need to do actual analysis of the specific set up youre proposing. Good luck
Well, a few issues:
- For hosting or training large models you want high bandwidth between GPUs. PCIe is too slow, NVLink has literally a magnitude more bandwidth. See what Nvidia is doing with NVLink and AMD is doing with InfinityFabric. Only available if you pay the premium, and if you need the bandwidth, you are most likely happy to pay.
- Same thing as above, but with memory bandwidth. The HBM-chips in a H200 will run in circles around the GDDR-garbage they hand out to the poor people with filthy consumer cards. By the way, your inference and training is most likely bottlenecked by memory bandwidth, not available compute.
- Commercially supported cooling of gaming GPUs in rack servers? Lol. Good luck getting any reputable hardware vendor to sell you that, and definitely not at the power densities you want in a data center.
- TFLOP16 isn’t enough. Look at 4 and 8 bit tensor numbers, that’s where the expensive silicon is used.
- Nvidias licensing agreements basically prohibit gaming cards in servers. No one will sell it to you at any scale.
For fun, home use, research or small time hacking? Sure, buy all the gaming cards you can. If you actually need support and have a commercial use case? Pony up. Either way, benchmark your workload, don’t look at marketing numbers.
Is it a scam? Of course, but you can’t avoid it.
- I know the more bandwidth the better, but i wonder how does it scale. I can only test my own setup which is less then optimal for this purpose with pcie 4.0 x16 and no p2p, but it goes as follows: a single 4090 gets 40.9 t/s while 2 get 58.5 t/s using tensor parrelism tested on Qwen/Qwen3-8B-FP8 with vLLM. I am really curious how this scales over more then 2 pcie 5.0 cards with p2p, which all cards here listed except the 5090 support.
- The theory goes that yes while the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
- I don’t need to build a datacenter, i’m fine with building a rack myself in my garage. And i don’t think that requires higher volumes than just purchasing at different retailers
- I intend to run at fp8 so i wanted to show that instead of fp16 but its surprisingly difficult to find the numbers for that, only the H200 datasheet, cleary displays
FP16 Tensor Core
, the RTX pro 6000 datasheet keeps it vague with only mentioningAI TOPS
, which they define asEffective FP4 TOPS with sparsity
, and they didn’t even bother writing a datasheet for he 5090 only saying3352 AI TOPS
, which i suppose is fp4 then. the AMD datasheets only list fp16 and int8 matrix, whether int8 matrix is equal to fp8 i don’t know. So FP16 was the common denominator for all the cards i could find without comparing apples with oranges.
the H200 has a very impressive bandwith of 4.89 TB/s, but for the same price you can get 37 TB/s spread across 58 RX 9070s, but if this actually works in practice i don’t know.
Your math checks out, but only for some workloads. Other workloads scale out like shit, and then you want all your bandwidth concentrated. At some point you’ll also want to consider power draw:
- One H200 is like 1500W when including support infrastructure like networking, motherboard, CPUs, storage, etc.
- 58 consumer cards will be like 8 servers loaded with GPUs, at like 5kW each, so say 40kW in total.
Now include power and cooling over a few years and do the same calculations.
As for apples and oranges, this is why you can’t look at the marketing numbers, you need to benchmark your workload yourself.
I don’t need to build a datacenter, i’m fine with building a rack myself in my garage.
During the last GPU mining craze, I helped build a 3-rack mining operation. Gpus are unregulated pieces of power-sucking shit from a power management perspective. You do not have the power requirements to do this on residential power, even at 300amp service.
Think of a microwave’s behaviour ; yes, a 1000w microwave pulls between 700 and 900w while cooking, but the startup load is massive, almost 1800w sometimes, depending on how cheap the thing is.
GPUs also behave like this, but not at startup. They spin up load predictively, which means the hardware demands more power to get the job done, it doesn’t scale down the job to save power. Multiply by 58 rx9070. Now add cooling.
You cannot do this.
Thanks, While I still would like to know thr peformance scaling of a cheap cluster this does awnser the question, pay way more for high end cards like the H200 for greater efficiency, or pay less and have to deal with these issues.
Be specific!
-
What models size (or model) are you looking to host?
-
At what context length?
-
What kind of speed (token/s) do you need?
-
Is it just for you, or many people? How many? In other words should the serving be parallel?
In other words, it depends, but the sweetpsot option for a self hosted rig, OP, is probably:
-
One 5090 or A6000 ADA GPU. Or maybe 2x 3090s/4090s, underclocked.
-
A cost-effective EPYC CPU/Mobo
-
At least 256 GB DDR5
Now run ik_llama.cpp, and you can serve Deepseek 671B faster than you can read without burning your house down with H200s: https://github.com/ikawrakow/ik_llama.cpp
It will also do for dots.llm, kimi, pretty much any of the mega MoEs de joure.
But there’s all sorts of niches. In a nutshell, don’t think “How much do I need for AI?” But “What is my target use case, what model is good for that, and what’s the best runtime for it?” Then build your rig around that.
Is Nvidia still a defacto requirement? I’ve heard of AMD support being added to OLlama and etc, but I haven’t found robust comparisons on value.
It depends!
Exllamav2 was pretty fast on AMD, exllamav3 is getting support soon. Vllm is also fast AMD. But its not easy to setup; you basically have to be a Python dev on linux and wrestle with pip. Or get lucky with docker.
Base llama.cpp is fine, as are forks like kobold.cpp rocm. This is more doable without so much hastle.
The AMD framework desktop is a pretty good machine for large MoE models. The 7900 XTX is the next best hardware, but unfortunately AMD is not really interested in competing with Nvidia in terms of high VRAM offerings :'/. They don’t want money I guess.
And there are… quirks, depending on the model.
I dunno about Intel Arc these days, but AFAIK you are stuck with their docker container or llama.cpp. And again, they don’t offer a lot of VRAM for the $ either.
NPUs are mostly a nothingburger so far, only good for tiny models.
Llama.cpp Vulkan (for use on anything) is improving but still behind in terms of support.
A lot of people do offload MoE models to Threadripper or EPYC CPUs, via ik_llama.cpp, transformers or some Chinese frameworks. That’s the homelab way to run big models like Qwen 235B or deepseek these days. An Nvidia GPU is still standard, but you can use a 3090 or 4090 and put more of the money in the CPU platform.
You wont find a good comparison because it literally changes by the minute. AMD updates ROCM? Better! Oh, but something broke in llama.cpp! Now its fixed an optimized 4 days later! Oh, architecture change, not it doesn’t work again. And look, exl3 support!
You can literally bench it in a day and have the results be obsolete the next, pretty often.
Thanks!
That helps when I eventually get around to standing up my own AI server.Right now I can’t really justify the cost for my low volume of use, when I can get CloudFlare free tier access to mid-sized models. But it’s something want to bring into my homelab instead for better control and privacy.
My target model is Qwen/Qwen3-235B-A22B-FP8. Ideally its maxium context lenght of 131K but i’m willing to compromise. I find it hard to give an concrete t/s awnser, let’s put it around 50. At max load probably around 8 concurrent users, but these situations will be rare enough that oprimizing for single user is probably more worth it.
My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090
It gets nice enough peformance loading 32B models completely in vram, but i am skeptical that a simillar system can run a 671B at higher speeds then a snails space, i currently run vLLM because it has higher peformance with tensor parrelism then lama.cpp but i shall check out ik_lama.cpp.
Ah, here we go:
https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF
Ubergarm is great. See this part in particular: https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF#quick-start
You will need to modify the syntax for 2x GPUs. I’d recommend starting f16/f16 K/V cache with 32K (to see if that’s acceptable, as then theres no dequantization compute overhead), and try not go lower than q8_0/q5_1 (as the V is more amenable to quantization).
Thanks! Ill go check it out.
One last thing: I’ve heard mixed things about 235B, hence there might be a smaller, more optimal LLM for whatever you do.
For instance, Kimi 72B is quite a good coding model: https://huggingface.co/moonshotai/Kimi-Dev-72B
It might fit in vllm (as an AWQ) with 2x 4090s. It and would easily fit in TabbyAPI as an exl3: https://huggingface.co/ArtusDev/moonshotai_Kimi-Dev-72B-EXL3/tree/4.25bpw_H6
As another example, I personally use Nvidia Nemotron models for STEM stuff (other than coding). They rock at that, specifically, and are weaker elsewhere.
What do I need to run Kimi? Does it have apple silicon compatible releases? It seems promising.
Depends. You’re in luck, as someone made a DWQ (which is the most optimal way to run it on Macs, and should work in LM Studio): https://huggingface.co/mlx-community/Kimi-Dev-72B-4bit-DWQ/tree/main
It’s chonky though. The weights alone are like 40GB, so assume 50GB of VRAM allocation for some context. I’m not sure what Macs that equates to… 96GB? Can the 64GB can allocate enough?
Otherwise, the requirement is basically a 5090. You can stuff it into 32GB as an exl3.
Note that it is going to be slow on Macs, being a dense 72B model.
Qwen3-235B-A22B-FP8
Good! An MoE.
Ideally its maxium context lenght of 131K but i’m willing to compromise.
I can tell you from experience all Qwen models are terrible past 32K. What’s more, going over 32K, you have to run them in a special “mode” (YaRN) that degrades performance under 32K. This is particularly bad in vllm, as it does not support dynamic YaRN scaling.
Also, you lose a lot of quality with FP8/AWQ quantization unless it’s native FP8 (like deepseek). Exllama and ik_llama.cpp quants are much higher quality, and their low batch performance is still quite good. Also, VLLM has no good K/V cache quantization (its FP8 destroys quality), while llama.cpp’s is good, and exllama’s is excellent, making it less than ideal for >16K. Its niche is more highly parallel, low context size serving.
My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090
Honestly, you should be set now. I can get 16+ t/s with high context Hunyuan 70B (which is 13B active) on a 7800 CPU/3090 GPU system with ik_llama.cpp. That rig (8 channel DDR5, and plenty of it, vs my 2 channels) should at least double that with 235B, with the right quantization, and you could speed it up by throwing in 2 more 4090s. The project is explicitly optimized for your exact rig, basically :)
It is poorly documented through. The general strategy is to keep the “core” of the LLM on the GPUs while offloading the less compute intense experts to RAM, and it takes some tinkering. There’s even a project to try and calculate it automatically:
https://github.com/k-koehler/gguf-tensor-overrider
IK_llama.cpp can also use special GGUFs regular llama.cpp can’t take, for faster inference in less space. I’m not sure if one for 235B is floating around huggingface, I will check.
Side note: I hope you can see why I asked. The web of engine strengths/quirks is extremely complicated, heh, and the answer could be totally different for different models.
-
“AI” in it’s current form, is a scam. Nvidia is making the most of this grift. They are now worth more money in the world than any other company.
You really need to elaborate on the nature of the scam.
LLMs are experimental, alpha-level technologies. Nvidia showed investors how fast their cards could compute this information. Now investors can just tell the LLM what they want, and it will spit out something that probably looks similar to what they want. But Nvidia is going to sell as many cards as possible before the bubble bursts.
so… GPU were for crypto before, now it’s for LLM, weird world we live in
Any time you need a CPU that can do a shit load of basic math, a GPU will win every time.
You can design algorithms specifically to mess up parallelism by branching a lot. For example, if you want your password hashes to be GPU-resistant.
ML has been sold as AI and honestly that’s enough of a scam for me to call it one.
but I also don’t really see end users getting scammed just venture capital and I’m ok with this.
Correct. Pattern recognition + prompts to desire a positive result even if the answer isn’t entirely true. If it’s close enough to the desired pattern, it get pushed.
The AI cards prioritize compute density instead of frame rate, etc so you can’t directly compare price points between them like that without including that data. You could cluster gaming cards, though, using NVLink or the AMD Fabric thing. You aren’t going to get any where near the same performance, and you are really going to rely on quantization to make it work, but depending on your use case in self-hosting you probably don’t need a $30,000 card.
Its not a scam, but its also something you probably don’t need.
For your personal use, you probably shouldn’t get a “AI” GPU. If you start needing a terrabyte of VRAM and heat, space, and energy start getting real problems, reconsider.
Looking at the table above AI gpus are a pure scam
How much more power are your gaming GPUs going to use? How much more space will they use? How much more heat will you need to dissipate?
Well a scam for selfhosters, for datacenters it’s different ofcourse.
Im looking to upgrade to my first dedicated built server coming from only SBCs so I’m not sure how much of a concern heat will be, but space and power shouldn’t be an issue. (Within reason ofcourse)
Efficiency still matters very much when self hosting. You need to consider power usage (do you have enough amps in your service to power a single GPU? probably. what about 10? probably not) and heat (it’s going to make you need to run more A/C in the summer, do you have enough in your service to power an A/C and your massive amount of GPUs? not likely).
Homes are not designed for huge amounts of hardware. I think a lot of self hosters (including my past self) can forget that in their excitement of their hobby. Personally, I’m just fine not running huge models at home. I can get by with models that can run on a single GPU, and even if I had more GPUs in my server, I don’t think the results (which would still contain many hallucinations) would be worth the power cost, strain on my A/C, and possible electrical overload.
😑
While I would still say it’s excessive to respond with “😑” i was too quick in waving these issues away.
Another commenter explained that residential power physically does not suppply enough to match high end gpus is why even for selfhosters they could be worth it.
?
Initially a lot of the AI was getting trained on lower class GPUs and none of these AI special cards/blades existed. The problem is that the problems are quite large and hence require a lot of VRAM to work on or you split it and pay enormous latency penalties going across the network. Putting it all into one giant package costs a lot more but it also performs a lot better, because AI is not an embarrassingly parallel problem that can be easily split across many GPUs without penalty. So the goal is often to reduce the number of GPUs you need to get a result quickly enough and it brings its own set of problems of power density in server racks.