Hi all!
i have written a couple of posts in the past, i am an illiterate having fun with LLMs and AI in general, who is being pulled in in a deeper hole by the days…
I have extensive experience with Linux (Gentoo lover since 20 years here) i am a sw dev now “promoted” to management, and avid tech user, so not really illiterate, but i know very little about all this LLM game.
I started with OpenWebUI + Ollama and played as an idiot with random models. Then come across an NVIDIA RTX A4000 (16gb VDDR6) and plugged into my I7-8700 server with 64gb RAM. The server has a Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] too, unused at this time (server is 100% headless anyway).
I am currently installing LocalAI to run llama.cpp and improve my models capability and speed, planning to ditch OpenWebUI and Ollama, if LocalAI + llama.cpp works fine.
My first usage was chatting with random local models. Then i discovered Fooocus and quickly upgraded to ComfyUI. Last, i have set up my SubWave radio station and i am having so much fun…
I have a few questions:
- Can i leverage both my NVIDIA and the iGPU at the same time?
- If i use the iGPU do i need to fixedly allocate RAM from the BIOS to it? Or will it use system RAM as needed?
- Using llama.cpp i want to leverage also CPU usage, since i have 64gb ram (also shared by many more self hosted stuff, tough) is there anything special i need to do to achieve that?
- What are a set of models that you guys recommend for my setup? I am currently using qwen2.5-coder:14b-instruct-q5_K_M with ollama, and i am pretty satisfied with it’s coding capabilities, but i want something more general purpose for my SubWave (AI assisted web radio channel)
- I might have the opportunity to install a second RTX A4000, identical to the first, on my server (need to check pci-e slot availability and power supply specs), would that make any sense at all?
- Power consumption wise, do the NVIDIA cards suck power also when not in active use?


Depends. Two GPUs can make sense if you want the chat/radio stack and ComfyUI to be able to run independently without fighting over VRAM. But it depends on how big the underlying models are and if they run at same time.
Using 1 card tho, logically, there are 3 things that can happen -
In other words, on a single card, you’re either manually managing load/unload cycles, eating CPU offload penalties on ComfyUI, or playing VRAM Tetris.
None of those are fun if you want both services available simultaneously…but that depends a lot on how big / greedy the models are. Do you want / need everything at same time?
Eg: there are small, medium and large versions of this, https://github.com/ace-step/ACE-Step-1.5 and there are small, medium and larger versions of image generators.
So I would say: yes, two cards are worth considering if you want both workloads available at the same time. The practical way to do this is pin each app to its own GPU via
CUDA_VISIBLE_DEVICESso they never see each other: LocalAI onCUDA_VISIBLE_DEVICES=0, ComfyUI onCUDA_VISIBLE_DEVICES=1and HDMI output via your iGPU / CPU for desktop etc.BTW, the cards don’t need to match either - a cheaper smaller card can handle the chat/TTS stack (or even CPU like we said above) while the bigger one handles image gen. If you are happy to manually switch between AI chat/TTS and ComfyUI, then two cards may not change much besides convenience.
PS: Worth considering a three-way split too, if you want everything all at once / separated.
Small TTS models like Piper or Kokoro run fine on CPU, and for a radio context where you have even a few seconds of buffer, the latency is hidden. That frees up VRAM on your chat GPU.
Actually, that’s how I would leverage 1 GPU for everything / mix and match you CPU / GPU but ICBW
PS: Might be worth chatting to your LLM (lol) about this too (or using a cloud one). These are not general AI questions and I might be wrong :)
Yes. All you say make sense. Tts can and shall go on CPU, who cares.
It’s handling both general chat and ComfyUI the biggest point, but to be honest o don’t need them both at the same time. The point is how easy is to switch since this is an unattended server, shutting down containers might not be the easiest approach. Specially if some other family member want to use it.
I will switch screen out on the igpu, it’s unused anyway, and see if I can run one card only, but I am struggling in running almost any model on LocalAI with llama.cpp I need to dig in more time on the issue.
Adding the second Nvidia will require a new psu, the mobo has the slot, even if it’s only 16x downgraded to 4x according to Claude it will work just fine for normal inference. But I will keep it as a last resort, I don’t particularly enjoy fiddling with that bios.
In any case I want both the llm and comfy to run on GPU and not offload so CPU.
On the container stuff - can’t help much. I’m a bare metal sorta guy. Ask BruceTheMoose from above, who also posts on !selfhosted. One thing I will tell you - shunting around a GPU between containers in Linux sounds like a pain in the ass. I would be tempted to keep Comfyi and llama.cpp in one container so you don’t have to so pass thru / rebinding bullshit.
Ask claude for advice on LXC and CUDA pass thru here to reduce pain.
Re: second NVIDIA card - probably would yes, unless you get something like a Quadro P1000…which wouldn’t do much for you.
Re: LocalAI. I’ve never used it so can’t comment. I either use OWUI (heavy but feature rich) or the webui that’s inbuilt with llama.cpp (light, fast, but somewhat cut down).
If you’re willing to use ComfyUI, then chat (while tts stays elsewhere) then probably 1 gpu could do it. Try the Qwen 3.6 35B model I suggested - it should get you 25+ tok/s on that GPU (show Claude the YouTube video and tell it to pull the settings from the video description for you).