Some general questions on AI setup

Shimitar · edit-2 16 hours ago

Some general questions on AI setup

SuspiciousCarrot78@aussie.zone · edit-2 4 hours ago

Depends. Two GPUs can make sense if you want the chat/radio stack and ComfyUI to be able to run independently without fighting over VRAM. But it depends on how big the underlying models are and if they run at same time.

Using 1 card tho, logically, there are 3 things that can happen -

If there’s enough headroom: everything coexists (if models are small enough or well orchestrated)
ComfyUI throws an OOM and either crashes or falls back to CPU offload (slow, but it won’t usually fail silently)
LocalAI typically won’t auto-evict a loaded model just because another process wants VRAM (it’ll just sit there blocking)

In other words, on a single card, you’re either manually managing load/unload cycles, eating CPU offload penalties on ComfyUI, or playing VRAM Tetris.

None of those are fun if you want both services available simultaneously…but that depends a lot on how big / greedy the models are. Do you want / need everything at same time?

Eg: there are small, medium and large versions of this, https://github.com/ace-step/ACE-Step-1.5 and there are small, medium and larger versions of image generators.

So I would say: yes, two cards are worth considering if you want both workloads available at the same time. The practical way to do this is pin each app to its own GPU via CUDA_VISIBLE_DEVICES so they never see each other: LocalAI on CUDA_VISIBLE_DEVICES=0, ComfyUI on CUDA_VISIBLE_DEVICES=1 and HDMI output via your iGPU / CPU for desktop etc.

BTW, the cards don’t need to match either - a cheaper smaller card can handle the chat/TTS stack (or even CPU like we said above) while the bigger one handles image gen. If you are happy to manually switch between AI chat/TTS and ComfyUI, then two cards may not change much besides convenience.

PS: Worth considering a three-way split too, if you want everything all at once / separated.

TTS offloaded to CPU,
chat/music stack on one GPU,
image gen on the other GPU.

Small TTS models like Piper or Kokoro run fine on CPU, and for a radio context where you have even a few seconds of buffer, the latency is hidden. That frees up VRAM on your chat GPU.

Actually, that’s how I would leverage 1 GPU for everything / mix and match you CPU / GPU but ICBW

PS: Might be worth chatting to your LLM (lol) about this too (or using a cloud one). These are not general AI questions and I might be wrong :)

Shimitar · 3 hours ago

Yes. All you say make sense. Tts can and shall go on CPU, who cares.

It’s handling both general chat and ComfyUI the biggest point, but to be honest o don’t need them both at the same time. The point is how easy is to switch since this is an unattended server, shutting down containers might not be the easiest approach. Specially if some other family member want to use it.

I will switch screen out on the igpu, it’s unused anyway, and see if I can run one card only, but I am struggling in running almost any model on LocalAI with llama.cpp I need to dig in more time on the issue.

Adding the second Nvidia will require a new psu, the mobo has the slot, even if it’s only 16x downgraded to 4x according to Claude it will work just fine for normal inference. But I will keep it as a last resort, I don’t particularly enjoy fiddling with that bios.

In any case I want both the llm and comfy to run on GPU and not offload so CPU.

SuspiciousCarrot78@aussie.zone · 3 hours ago

On the container stuff - can’t help much. I’m a bare metal sorta guy. Ask BruceTheMoose from above, who also posts on !selfhosted. One thing I will tell you - shunting around a GPU between containers in Linux sounds like a pain in the ass. I would be tempted to keep Comfyi and llama.cpp in one container so you don’t have to so pass thru / rebinding bullshit.

Ask claude for advice on LXC and CUDA pass thru here to reduce pain.

Re: second NVIDIA card - probably would yes, unless you get something like a Quadro P1000…which wouldn’t do much for you.

Re: LocalAI. I’ve never used it so can’t comment. I either use OWUI (heavy but feature rich) or the webui that’s inbuilt with llama.cpp (light, fast, but somewhat cut down).

If you’re willing to use ComfyUI, then chat (while tts stays elsewhere) then probably 1 gpu could do it. Try the Qwen 3.6 35B model I suggested - it should get you 25+ tok/s on that GPU (show Claude the YouTube video and tell it to pull the settings from the video description for you).