As the title says, i started by selfhosting OpenWebUI including Ollama on my RIG. I have been pretty happy but the more i dig into this stuff, more i understand that i am doing it wrong and i definitely need to switch to llama.cpp / ik_llama.cpp.
But i have a few questions…
-
I want a web based LLM chat GUI, because that’s my 80% usage for AI. If i go with llama.cpp, do i need to ditch OpenWebUI as well? Is there a better UI? Do i need an UI?
-
i am currently hosting it all with a docker compose file. Is this still doable if i switch? I can go bare-metal (Gentoo server, good skills on my side) but it’s the maintenance part, a “podman compose pull” is just easier… or i am lazy.
-
the server is headless and always accessed remotely via web or ssh, just to be clear.
My hardware is a NVIDIA RTX A4000 16GB VRAM on a I7-8700@3200Ghz with 64GB system RAM (shared with far too many services).


OpenWebUI works with plain llama.cpp
16 is a bit small so try a MoE (e.g. QWEN 3.6 35BA3B) model and put experts on the CPU (although DDR4 may be underwhelming) which you can do with llama ( with offloading and drafting for T/s) but not ollama (spitting noise). Here’s a good starting point. You’ll likely get 60+T/s on say a 6 bit quant.
You can use a container approach, but llama.cpp is a bit of a moving target, with new cool features coming along regularly to support new models. I build it in a distrobox and running it is a simple call. When it doesn’t want to build anymore because dependencies have changed too much, I just spin up a new distrobox and leave the old one there for older models. I find it a good balance between flexibility and ease of maintenance, and technically it’s also a container approach. Take notes so you know how to set up the new one.