Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 3 hours ago

It depends!

Exllamav2 was pretty fast on AMD, exllamav3 is getting support soon. Vllm is also fast AMD. But its not easy to setup; you basically have to be a Python dev on linux and wrestle with pip. Or get lucky with docker.

Base llama.cpp is fine, as are forks like kobold.cpp rocm. This is more doable without so much hastle.

The AMD framework desktop is a pretty good machine for large MoE models. The 7900 XTX is the next best hardware, but unfortunately AMD is not really interested in competing with Nvidia in terms of high VRAM offerings :'/. They don’t want money I guess.

And there are… quirks, depending on the model.

I dunno about Intel Arc these days, but AFAIK you are stuck with their docker container or llama.cpp. And again, they don’t offer a lot of VRAM for the $ either.

NPUs are mostly a nothingburger so far, only good for tiny models.

Llama.cpp Vulkan (for use on anything) is improving but still behind in terms of support.

A lot of people do offload MoE models to Threadripper or EPYC CPUs, via ik_llama.cpp, transformers or some Chinese frameworks. That’s the homelab way to run big models like Qwen 235B or deepseek these days. An Nvidia GPU is still standard, but you can use a 3090 or 4090 and put more of the money in the CPU platform.

You wont find a good comparison because it literally changes by the minute. AMD updates ROCM? Better! Oh, but something broke in llama.cpp! Now its fixed an optimized 4 days later! Oh, architecture change, not it doesn’t work again. And look, exl3 support!

You can literally bench it in a day and have the results be obsolete the next, pretty often.

brucethemoose@lemmy.world · edit-2 3 hours ago

Depends. You’re in luck, as someone made a DWQ (which is the most optimal way to run it on Macs, and should work in LM Studio): https://huggingface.co/mlx-community/Kimi-Dev-72B-4bit-DWQ/tree/main

It’s chonky though. The weights alone are like 40GB, so assume 50GB of VRAM allocation for some context. I’m not sure what Macs that equates to… 96GB? Can the 64GB can allocate enough?

Otherwise, the requirement is basically a 5090. You can stuff it into 32GB as an exl3.

Note that it is going to be slow on Macs, being a dense 72B model.

brucethemoose@lemmy.world · edit-2 5 hours ago

One last thing: I’ve heard mixed things about 235B, hence there might be a smaller, more optimal LLM for whatever you do.

For instance, Kimi 72B is quite a good coding model: https://huggingface.co/moonshotai/Kimi-Dev-72B

It might fit in vllm (as an AWQ) with 2x 4090s. It and would easily fit in TabbyAPI as an exl3: https://huggingface.co/ArtusDev/moonshotai_Kimi-Dev-72B-EXL3/tree/4.25bpw_H6

As another example, I personally use Nvidia Nemotron models for STEM stuff (other than coding). They rock at that, specifically, and are weaker elsewhere.

brucethemoose@lemmy.world · edit-2 5 hours ago

Ah, here we go:

https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF

Ubergarm is great. See this part in particular: https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF#quick-start

You will need to modify the syntax for 2x GPUs. I’d recommend starting f16/f16 K/V cache with 32K (to see if that’s acceptable, as then theres no dequantization compute overhead), and try not go lower than q8_0/q5_1 (as the V is more amenable to quantization).

brucethemoose@lemmy.world · edit-2 5 hours ago

Qwen3-235B-A22B-FP8

Good! An MoE.

Ideally its maxium context lenght of 131K but i’m willing to compromise.

I can tell you from experience all Qwen models are terrible past 32K. What’s more, going over 32K, you have to run them in a special “mode” (YaRN) that degrades performance under 32K. This is particularly bad in vllm, as it does not support dynamic YaRN scaling.

Also, you lose a lot of quality with FP8/AWQ quantization unless it’s native FP8 (like deepseek). Exllama and ik_llama.cpp quants are much higher quality, and their low batch performance is still quite good. Also, VLLM has no good K/V cache quantization (its FP8 destroys quality), while llama.cpp’s is good, and exllama’s is excellent, making it less than ideal for >16K. Its niche is more highly parallel, low context size serving.

My current setup is already: Xeon w7-3465X 128gb DDR5 2x 4090

Honestly, you should be set now. I can get 16+ t/s with high context Hunyuan 70B (which is 13B active) on a 7800 CPU/3090 GPU system with ik_llama.cpp. That rig (8 channel DDR5, and plenty of it, vs my 2 channels) should at least double that with 235B, with the right quantization, and you could speed it up by throwing in 2 more 4090s. The project is explicitly optimized for your exact rig, basically :)

It is poorly documented through. The general strategy is to keep the “core” of the LLM on the GPUs while offloading the less compute intense experts to RAM, and it takes some tinkering. There’s even a project to try and calculate it automatically:

https://github.com/k-koehler/gguf-tensor-overrider

IK_llama.cpp can also use special GGUFs regular llama.cpp can’t take, for faster inference in less space. I’m not sure if one for 235B is floating around huggingface, I will check.

Side note: I hope you can see why I asked. The web of engine strengths/quirks is extremely complicated, heh, and the answer could be totally different for different models.

brucethemoose@lemmy.world · edit-2 6 hours ago

Be specific!

What models size (or model) are you looking to host?
At what context length?
What kind of speed (token/s) do you need?
Is it just for you, or many people? How many? In other words should the serving be parallel?

In other words, it depends, but the sweetpsot option for a self hosted rig, OP, is probably:

One 5090 or A6000 ADA GPU. Or maybe 2x 3090s/4090s, underclocked.
A cost-effective EPYC CPU/Mobo
At least 256 GB DDR5

Now run ik_llama.cpp, and you can serve Deepseek 671B faster than you can read without burning your house down with H200s: https://github.com/ikawrakow/ik_llama.cpp

It will also do for dots.llm, kimi, pretty much any of the mega MoEs de joure.

But there’s all sorts of niches. In a nutshell, don’t think “How much do I need for AI?” But “What is my target use case, what model is good for that, and what’s the best runtime for it?” Then build your rig around that.

brucethemoose@lemmy.world · edit-2 12 days ago

Pro is 120hz.

But they are expensive as heck. I only got the 16 Plus because its a carrier loss leader, heh.

And wouldn’t fix some of my other quibbles with iOS’s inflexibility. My ancient jailbroken iPhone 4 was more customizable than now, and Apple is still slowly, poorly implementing features I had a decade ago. It’s mind boggling, and jailbreaking isn’t a good option anymore.

brucethemoose@lemmy.world · edit-2 12 days ago

My last Android phone was a Razer Phone 2, SD845 circa 2018. Basically stock Android 9.

And it was smooth as butter. It had a 120hz screen while my iPhone 16 is stuck at 60, and I can feel it. And it flew through some heavy web apps I use while the iPhone chugs and jumps around, even though the new SoC should objectively blow away even modern Android devices.

It wasn’t always this way; iOS used to be (subjectively) so much faster that it’s not even funny, at least back when I had an iPhone 6S(?). Maybe there was an inflection point? Or maybe it’s only the case with “close to stock” Android stuff that isn’t loaded with bloat.

brucethemoose@lemmy.world · edit-2 12 days ago

Random aside, I switched from Android to iOS a year ago. I miss Android already.

The UI is more convoluted an clunky than iOS from years ago, just as uncustomizable, and performs shockly bad on heavy webpages on a brand new 16+. It’s got no freaking RAM, no sd card slot. Some free FOSS apps are nonexistant or paid only.

Security and OOTB privacy is better and app support is generally better, but that’s about it? I’d probably keep an iPhone around to bank on when I eventually switch…

brucethemoose@lemmy.world · edit-2 15 days ago

What @mierdabird@lemmy.dbzer0.com said, but the adapters arent cheap. You’re going to end up spending more than the 1060 is worth.

A used desktop to slap it in, that you turn on as needed, might make sense? Doubly so if you can find one with an RTX 3060, which would open up 32B models with TabbyAPI instead of ollama. Some configure them to wake on LAN and boot an LLM server.

brucethemoose@lemmy.world · 17 days ago

You can still use the IGP, which might be faster in some cases.

brucethemoose@lemmy.world · edit-2 17 days ago

Oh actually that’s a great card for LLM serving!

Use the llama.cpp server from source, it has better support for Pascal cards than anything else:

https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md

Gemma 3 is a hair too big (like 17-18GB), so I’d start with InternVL 14B Q5K XL: https://huggingface.co/unsloth/InternVL3-14B-Instruct-GGUF

Or Mixtral 24B IQ4_XS for more ‘text’ intelligence than vision: https://huggingface.co/unsloth/Mistral-Small-3.2-24B-Instruct-2506-GGUF

I’m a bit ‘behind’ on the vision model scene, so I can look around more if they don’t feel sufficient, or walk you through setting up the llama.cpp server, but basically it provides an endpoint which you can hit with the same API as ChatGPT.

brucethemoose@lemmy.world · edit-2 19 days ago

1650

You mean GPU? Yeah, it’s good, I was strictly talking about purchasing a laptop for LLM usage, as most are less than ideal for the money. Laptop vram pools are relatively small and SO-DIMMS are usually very slow.

Things will get much better once the “Max” AMD SKUs proliferate.

brucethemoose@lemmy.world · edit-2 19 days ago

Yeah, just paying for LLM APIs is dirt cheap, and they (supposedly) don’t scrape data. Again I’d recommend Openrouter and Cerebras! And you get your pick of models to try from them.

Even a framework 16 is not good for LLMs TBH. The Framework desktop is (as it uses a special AMD chip), but it’s very expensive. Honestly the whole hardware market is so screwed up, hence most ‘local LLM enthusiasts’ buy a used RTX 3090 and stick them in desktops or servers, as no one wants to produce something affordable apparently :/

brucethemoose@lemmy.world · edit-2 20 days ago

I was a bit mistaken, these are the models you should consider:

https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ

https://huggingface.co/AnteriorAI/gemma-3-4b-it-qat-q4_0-gguf

https://huggingface.co/unsloth/Jan-nano-GGUF (specifically the UD-Q4 or UD-Q5 file)

they are state-of-the-art at this size, as far as I know.

brucethemoose@lemmy.world · edit-2 20 days ago

8GB?

You might be able to run Qwen3 4B: https://huggingface.co/mlx-community/Qwen3-4B-4bit-DWQ/tree/main

But honestly you don’t have enough RAM to spare, and even a small model might bog things down. I’d run Open Web UI or LM Studio with a free LLM API, like Gemini Flash, or pay a few bucks for something off openrouter. Or maybe Cerebras API.

…Unfortunely, LLMs are very RAM intensive, and >4GB (more realistically like 2GB) is not going to be a good experience :(

brucethemoose@lemmy.world · edit-2 20 days ago

Actually, to go ahead and answer, the “fastest” path would be LM Studio (which supports MLX quants natively and is not time intensive to install), and a DWQ quantization (which is a newer, higher quality variant of MLX models).

Hopefully one of these models, depending on how much RAM you have:

https://huggingface.co/mlx-community/Qwen3-14B-4bit-DWQ-053125

https://huggingface.co/mlx-community/Magistral-Small-2506-4bit-DWQ

https://huggingface.co/mlx-community/Qwen3-30B-A3B-4bit-DWQ-0508

https://huggingface.co/mlx-community/GLM-4-32B-0414-4bit-DWQ

With a bit more time invested, you could try to set up Open Web UI as an alterantive interface (which has its own built in web search like Gemini): https://openwebui.com/

And then use LM Studio (or some other MLX backend, or even free online API models) as the ‘engine’

Alternatively, especially if you have a small RAM pool, Gemma 12B QAT Q4_0 is quite good, and you can run it with LM Studio or anything else that supports a GGUF. Not sure about 12B-ish thinking models off the top of my head, I’d have to look around.

brucethemoose@lemmy.world · edit-2 20 days ago

Honestly perplexity, the online service, is pretty good.

But first question is: how much RAM does your Mac have? This is basically the factor for what model you can and should run.

brucethemoose@lemmy.world · edit-2 20 days ago

I don’t understand.

Ollama is not actually docker, right? It’s running the same llama.cpp engine, it’s just embedded inside the wrapper app, not containerized. It has a docker preset you can use, yeah.

And basically every LLM project ships a docker container. I know for a fact llama.cpp, TabbyAPI, Aphrodite, Lemonade, vllm and sglang do. It’s basically standard. There’s all sorts of wrappers around them too.

You are 100% right about security though, in fact there’s a huge concern with compromised Python packages. This one almost got me: https://pytorch.org/blog/compromised-nightly-dependency/

This is actually a huge advantage for llama.cpp, as it’s free of python and external dependencies by design. This is very unlike ComfyUI which pulls in a gazillian external repos. Theoretically the main llama.cpp git could be compromised, but it’s a single, very well monitored point of failure there, and literally every “outside” architecture and feature is implemented from scratch, making it harder to sneak stuff in.

brucethemoose@lemmy.world · edit-2 20 days ago

OK.

Then LM Studio. With Qwen3 30B IQ4_XS, low temperature MinP sampling.

That’s what I’m trying to say though, there is no one click solution, that’s kind of a lie. LLMs work a bajillion times better with just a little personal configuration. They are not magic boxes, they are specialized tools.

Random example: on a Mac? Grab an MLX distillation, it’ll be way faster and better.

Nvidia gaming PC? TabbyAPI with an exl3. Small GPU laptop? ik_llama.cpp APU? Lemonade. Raspberry Pi? That’s important to know!

What do you ask it to do? Set timers? Look at pictures? Cooking recipes? Search the web? Look at documents? Do you need stuff faster or accurate?

This is one reason why ollama is so suboptimal, with the other being just bad defaults (Q4_0 quants, 2048 context, no imatrix or anything outside GGUF, bad sampling last I checked, chat template errors, bugs with certain models, I can go on). A lot of people just try “ollama run” I guess, then assume local LLMs are bad when it doesn’t work right.

brucethemoose@lemmy.world · edit-2 9 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama