Shimitar

Shimitar

So, first of all i want to thank all the great people in this community that responded to my previous post (https://downonthestreet.eu/post/713603).

Here is a follow up with what i understood and decided to do, both to see if i got it right from you people and maybe might be of interest for others.

First of all, i will not be installing a second RTX A4000 on my server because i would need to upgrade the PSU but that being a proprietary HP it obviously has also a proprietary power connector on the mobo, thus i would need to figure out compatibility, plus i don’t really need them if not for the fun of maximizing out my server capabilities (for no good use).

I had to clear up in my mind what uses i needed out of the LLM capabilties, and that can be summarized down to: chat (general and coding support), AI support for my SubWave radio station, and the occasional usage of ComfyUI to mess around.

For all that, the RTX A4000 i have is plentiful considering i can leverage CPU and RAM as well. The iGPU on board my I7-8700 is basically old thrash and pointless to even think about using it.

So first a correction to what i wrote in the other post comments: no i found no usable way to boot my server using the iGPU as primary. Setting up the iGPU as primary video output in BIOS just caused the server to refuse boot (stuck at “hp safe boot” logo). I had to manually remove the NVIDIA, boot without, then plug it back in to be able to boot again. No way, and which video output i connected made no difference at all.

Ok, long story to an end. This is i think i will proceed:

Install llama.cpp by self compilation (it’s a Gentoo box after all. that sound just right)
Switch back to OpenWebUI for chat (i quite liked it, and it’s just a matter of respinning up the container without ollama) using the llama.cpp above
Keep ComfyUI the way it is
Point my radio to the llama.cpp

Now, on the llama.cpp i plan to run:

A general chat model (currently qwen3.5-9b-glm5.1-distill-v1 with 4k context window, but probably something bigger) tuned to offload to CPU so that it uses up at most 6GB of VRAM, still retaining good overall performance (tests to be carried out, any suggestion appreciated)
smaller models (like nomic-embed-text-v1.5, needed for the radio) or small TTS model (for the radio, probably not needed since already embedded) still on CPU only

This should leave 10GB of VRAM free for ComfyUI, that should be plentyfil for my needs without resorting to shutdown llama.cpp.

Is this a good plan?

Now, i finally understand why ollama is really just a starting point, and how it kept me from understanding stuff. At the same time, LocalAI is nice and still a bit bloated for what i need at the end of the day, which is running a few models and that’s it. I don’t need model discovery or such, i can always experiment replacing llama.cpp main chat model if i want to. I don’t really feel like experimenting in that direction tough, i want a tool to do stuff, not (yet another) opportunity to play with even more new techy nerdy stuff to fill up my free time :)

One last thing: i came across to this page here which helped me clear up some doubts and understand a few things too, just wanted to share in case.

Followup on "Some general questions on AI setup"

Followup on "Some general questions on AI setup"