Guide to Self Hosting LLMs Faster/Better than Ollama

brucethemoose@lemmy.world · edit-2 2 days ago

Yeah. But it also messes stuff up from the llama.cpp baseline, and hides or doesn’t support some features/optimizations, and definitely doesn’t support the more efficient iq_k quants of ik_llama.cpp and its specialzied MoE offloading.

And that’s not even getting into the various controversies around ollama (like broken GGUFs or indications they’re going closed source in some form).

…It just depends on how much performance you want to squeeze out, and how much time you want to spend on the endeavor. Small LLMs are kinda marginal though, so IMO its important if you really want to try; otherwise one is probably better off spending a few bucks on an API that doesn’t log requests.

brucethemoose@lemmy.world · edit-2 2 days ago

In case I miss your reply, assuming a 3080 + 64 GB of RAM, you want the IQ4_KSS (or IQ3_KS, for more RAM for tabs and stuff) version of this:

https://huggingface.co/ubergarm/GLM-4.5-Air-GGUF

Part of it will run on your GPU, part will live in system RAM, but ik_llama.cpp does the quantizations split and GPU offloading in a particularly efficient way for these kind of ‘MoE’ models. Follow the instructions on that page.

If you ‘only’ have 32GB RAM or less, that’s tricker, and the next question is what kind of speeds do you want. But it’s probably best to wait a few days and see how Qwen3 80B looks when it comes out. Or just go with the IQ4_K version of this: https://huggingface.co/ubergarm/Qwen3-30B-A3B-Thinking-2507-GGUF

And you don’t strickly need the hyper optimization of ik_llama.cpp for a small model like Qwen3 30B. Something easier like lm studio or the llama.cpp docker image would be fine.

Alternatively, you could try to squeeze Gemma 27B into that 11GB VRAM, but it would be tight.

brucethemoose@lemmy.world · edit-2 2 days ago

How much system RAM, and what kind? DDR5?

ik doesn’t have great documentation, so it’d be a lot easier for me to just point you places, heh.

brucethemoose@lemmy.world · edit-2 2 days ago

At risk of getting more technical, ik_llama.cpp has a good built in webui:

https://github.com/ikawrakow/ik_llama.cpp/

Getting more technical, its also way better than ollama. You can run models way smarter than ollama can on the same hardware.

For reference, I’m running GLM-4 (667 GB of raw weights) on a single RTX 3090/Ryzen gaming rig, at reading speed, with pretty low quantization distortion.

And if you want a ‘look this up on the internet for me’ assistant (which you need for them to be truly useful), you need another docker project as well.

…That’s just how LLM self hosting is now. It’s simply too hardware intense and ad hoc to be easy and smart and cheap. You can indeed host a small ‘default’ LLM without much tinkering, but its going to be pretty dumb, and pretty slow on ollama defaults.

brucethemoose@lemmy.world · 5 days ago

Nothing is more powerful than an iPhone in all conditions

I guess my use case is just totally different.

I have zero interest in phone gaming, as (even if the App Store wasn’t so MTX-ridden) I’d rather do that on PC and just read stuff when on the go. I need performance for two things: web apps (which Android inexplicably seems to win on) and, theoretically local LLM assistants, and iPhones just dont have enough RAM to make the latter worth it.

As for disk speed, my use case is transferring media on/off. But there’s no MicroSD! And even USB transfers are bizzare on Apple.

…So all their performance is totally useless to me, I guess.

brucethemoose@lemmy.world · edit-2 6 days ago

Do you like your iPhone?

I just came to iOS from a Razer Phone 2 (S9 generation), but I used jailbroken iPhones forever, until the 6 or so.

… And I don’t like it. All the gestures, everything is unintuitive, and I can’t change hardly anything! My old jailbroken 5 was more feature rich, yet somehow simple and snappy too. Every app nickels and dimes you, and I feel like I’m struggling to restrict thier access to things.

I made the mistake of getting it for older family thinking it’d be easier, and they can hardly even function.

brucethemoose@lemmy.world · 7 days ago

“AI” are still tools.

The issue is their underlying technology, as of now, is way more fundamentally limited than ‘Tech Bro’ types will tell you. Don’t get me wrong, they’re neat tools, but they are fundamentally incapable of taking over intricate decision making processes. They’re just a layer of human assistance and automation.

I’m as big of a local LLM enthusiast as you’ll find, and I’m telling you: the AGI scaling acolytes are full of shit, and the research community knows it.

Imagine finding out that you won’t be able to pay off your debt cause most fastfood restaurants will use AI/ Bots that can serve, prepare, clean etc. 24/7 while a useless human needs breaks, wants money and needs days off and can only work 8 hour shifts.

This sucks.

…But honestly, in the long run, it’s not so bad. Working fast food sucks and it would be great if people could do something else instead.

As a little silver lining, there’s a good chance ‘AI,’ as it is now, is goin to ‘race to the bottom,’ and a lot heavy lifting will be done on your phone or some computer you own. So you’ll have a little assistant to help you with stuff, self hosted, not corporate cloud controlled. Think Lemmy vs Reddit in that regard.

brucethemoose@lemmy.world · edit-2 7 days ago

See the 1947 US Army video: “Don’t be a Sucker”:

https://archive.org/details/DontBeaS1947

It’s always been a hypocritical ideal. Even the US Military acknowledged our xenophobic tendencies, and the constant struggle against them. And slowly doing better. That’s the point.

…But I think the radical shift of the “attention economy” is what makes it feel like the ideal is finally collapsing. The population is sucked into doomscrolling Fox News (for example) at such scale that makes this US Army video feel quaint.

If they published the same thing today, no one would even notice. There’s too much noise. And that is unprecedented.

brucethemoose@lemmy.world · 8 days ago

While a sort of darkly logical/amusing thought, it may not work like that. In a bad pandemic, there are just too many factors affecting mortality besides vaccine reluctance.

brucethemoose@lemmy.world · edit-2 8 days ago

Honestly if COVID didn’t get through to people, would a higher mortaility rate matter? The next pandemic could be a grossly necrotic parasite, and we’d still have cable news comforming to whatever ‘side’ they’re on and one side virtue signaling by skipping vaccines or safety measures or whatever. It’d trend the heck out of social media.

Who’s listening to scientists, unfiltered? No one. Long objective tones are boring. The information envionment is not getting any better from here.

Reality is what people see in their feeds, now. And what fits their tribe.

brucethemoose@lemmy.world · edit-2 10 days ago

Oh, and there are other graphics makers that could theoretically work on linux, like Imagination’s PowerVR, and some Chinese startups. Qualcomm’s already trying to push into laptops with Adreno (which has roots in AMD/ATI, hence ‘Adreno’ is an anagram for ‘Radeon’)

The problem is making a desktop-sized GPU has a massive capital cost (over $1,000,000,000, maybe even tens of billions these days) just to ‘tape out’ a single chip, much less a line, and AMD/Nvidia are just so far ahead in terms of architecture. It’s basically uneconomical to catch up without a massive geopolitical motivation like there is in China.

brucethemoose@lemmy.world · edit-2 10 days ago

It’s even better than that:

They all come from Taiwan Semiconductor (TSMC).

There used to be more of a split between many fabs. Then it was TSMC/Global Foundries/Samsung Foundry. Then it was TSMC/Samsung Foundry. Now AFAIK all GPUs are TSMC, with Nvidia’s RTX 3000 series (excluding the A100) being the last Samsung chip. Even Intel fabs Arc there, as far as I know.

Hopefully Intel won’t kill Arc, as they are planning to move it back to their fabs.

brucethemoose@lemmy.world · edit-2 10 days ago

Fortunately, Microsoft is too incompetent to pull this off on Windows.

They tried. See the metro app push in Windows 8+. But it’s kind of incredible how much they bungled it; even now, it would be totally dysfunctional with Win32 apps locked down.

And if Windows doesn’t do it, hardware makers aren’t really interested in that sort of thing.

Stuff like SteamOS does worry me a tiny bit. It’s obviously fine now, but I can see a future where, say, Valve (or any hardware seller with some kind of successful storefront) starts to not like rising competition on their own stuff.

brucethemoose@lemmy.world · edit-2 11 days ago

Yeah, see, that makes sense. A random app and an optional account number are not reliable notification systems. They can’t just assume everyone will opt into those.

brucethemoose@lemmy.world · edit-2 11 days ago

Because however one feels about blockchain tech and its future, past companies within the crypto industry are notorious for selling the moon, being shady, and cashing out early. ‘ZCash’ appears to be a good example, particularly because a small group exerts such a high level of control over it.

And if the parallel holds, and at least some of that applies Jay Gaeber’s own personal experience and expectations of what a company’s trajectory should look like, it doesn’t bode well for Bluesky.

brucethemoose@lemmy.world · edit-2 11 days ago

Mobile 5090 would be an underclocked, binned desktop 5080, AFAIK:

https://en.wikipedia.org/wiki/List_of_Nvidia_graphics_processing_units#GeForce_50_series

In KCD2 (a fantastic CryEngine game, a great benchmark IMO), at QHD, the APU is a hair less half as fast. For instance, 39 FPS at QHD vs 84 FPS for the mobile 5090:

https://www.notebookcheck.net/Nvidia-GeForce-RTX-5090-Laptop-Benchmarks-and-Specs.934947.0.html

https://www.notebookcheck.net/AMD-Radeon-8060S-Benchmarks-and-Specs.942049.0.html

Synthetic benchmarks between the two

But these are both presumably running at high TDP (150W for the 5090). Also, the mobile 5090 is catastrophically overpriced and inevitably tied to a weaker CPU, whereas the APU is a monster of a CPU. So make of that what you will.

brucethemoose@lemmy.world · 11 days ago

through phone if you have a phone on your water account, through a system no one knew existed

I interpreted this as one system. So its:

Water website, you’d have to happen to stumble upon
Obscure opt-in phone system
Facebook

If that’s the case, the complaint is reasonable, as the water service is basically assuming Facebook (and word of mouth) are the only active notifications folks need.

But yeah, if OP opted out of SMS warnings or something, that’s more on them.

brucethemoose@lemmy.world · 11 days ago

Oh wow, that’s awesome! I didn’t know folks ran TDP tests like this, just that my old 3090 seems to have a minimum sweet spot around that same same ~200W based on my own testing, but I figured the 4000 or 5000 series might go lower. Apparently not, at least for the big die.

I also figured the 395 would draw more than 55W! That’s also awesome! I suspect newer, smaller GPUs like the 9000 or 5000 series still make the value proposition questionable, but still you make an excellent point.

And for reference, I just checked, and my dGPU hovers around 30W idle with no display connected.

brucethemoose@lemmy.world · edit-2 12 days ago

Eh, but you’d be way better off with an X3D CPU in that scenario, which is both significantly faster in games, about as fast outside them (unless you’re dram bandwidth limited) and more power efficient (because they clock relatively low).

You’re right about the 395 being a fine HTPC machine by itself.

But I’m also saying even an older 7900, 4090 or whatever would be way lower power at the same performance as the 395’s IGP, and whisper quiet in comparison. Even if cost is no object. And if that’s the case, why keep a big IGP at all? It just doesn’t make sense to pair them without some weirdly specific use case that can use both at once, or that a discrete GPU literally can’t do because it doesn’t have enough VRAM like the 395 does.

brucethemoose@lemmy.world · edit-2 12 days ago

Eh, actually that’s not what I had in mind:

Discrete desktop graphics idle hot. I think my 3090 uses at least 40W doing literally nothing.
It’s always better to run big dies slower than small dies at high clockspeeds. In other words, if you underclocked a big desktop GPU to 1/2 its peak clockspeed, it would use less than a fourth of the energy and run basically inaudible… and still be faster than the iGPU. So why keep a big iGPU around?

My use case was multitasking and compute stuff. EG game/use the discrete GPU while your IGP churns away running something. Or combine them in some workloads.

Even the 395 by itself doesn’t make a ton of sense for an HTPC because AMD slaps so much CPU on it. It’s way too expensive and makes it power thirsty. A single CCD (8 cores instead of 16) + the full integrated GPU would be perfect and lower power, but AMD inexplicably does not offer that.

Also, I’ll add that my 3090 is basically inaudible next to a TV… key is to cap its clocks, and the fans barely even spin up.

brucethemoose@lemmy.world · edit-2 11 months ago

Guide to Self Hosting LLMs Faster/Better than Ollama