As the title says, i started by selfhosting OpenWebUI including Ollama on my RIG. I have been pretty happy but the more i dig into this stuff, more i understand that i am doing it wrong and i definitely need to switch to llama.cpp / ik_llama.cpp.
But i have a few questions…
-
I want a web based LLM chat GUI, because that’s my 80% usage for AI. If i go with llama.cpp, do i need to ditch OpenWebUI as well? Is there a better UI? Do i need an UI?
-
i am currently hosting it all with a docker compose file. Is this still doable if i switch? I can go bare-metal (Gentoo server, good skills on my side) but it’s the maintenance part, a “podman compose pull” is just easier… or i am lazy.
-
the server is headless and always accessed remotely via web or ssh, just to be clear.
My hardware is a NVIDIA RTX A4000 16GB VRAM on a I7-8700@3200Ghz with 64GB system RAM (shared with far too many services).
I’d not use ollama, it’s basically just a fancy wrapper around lama.cpp.
There’s also modules/docker containers to hot swap models with lama.cpp
My model hosting setup is: Lama.cpp -> Open web UI
Lama.cpp is running in a local shell on my Mac Mini, since setting up GPU support with metal is (or was?) a pain. And open web UI sits in a docker with a local storage mounted so it have persistence when updating or moving the docker.
16gigs vram however ain’t too much, you’ll be fairly limited to fairly low quants. It will be reasonably fast tho. If you can use most of your system ram you could go and host f.e. qwen 3.6 bf8(~56gb) or bf4 (~30gb). It would be slower but you also gain a lot of usability from that.
Or you host two models a smaller one on the GPU and bigger one with system ram so you can switch between “knowledge” and speed.
Using lama.cpp you’ll have to take a look at huggingface & use gguf models.
Llama.cpp has its own built-in web UI that is fairly decent. Not as full featured as open web UI, but depends what you’re after.



