Some general questions on AI setup

Shimitar · edit-2 16 hours ago

Some general questions on AI setup

SuspiciousCarrot78@aussie.zone · edit-2 6 hours ago

Understood, so my iGPU is too old, and i am better off with my NVIDIA only.

Sadly, yes

Also, putting two might be just a waste

It…depends. What are you planning on doing with em?

i doubt my PSU can handle them even if i had the proper PCIe slot.

Technically, PSU and PCIe slots can be shared or upgraded…which goes back to “it depends”. But if your gut is already telling you nah…

So, using llama.cpp if i use a model bigger than my VRAM will offload to CPU?

Yep. Technically, the other option is better (tell it to auto adjust / put as much as it can on GPU and then spill over onto CPU / RAM) The setting for that should be something like -ngl auto (in the latest versions of llama.cpp)

The NVIDIA RTX A4000 is nothing to sneeze at, 16GB or not - it’s just…old, so support for it is middling. You should definitely try one of the MoE models on that. See if this helps:

https://www.youtube.com/watch?v=8F_5pdcD3HY

The setting he uses are in his first comment.

If I had to pull a number out of my butt, you should be able to triple his thru put (given that the A4000 is almost 3x the bandwidth, has actual tensor cores etc)

Shimitar · 5 hours ago

I am trying to understand if it make sense to run two cards.

I am using ai for chat and to support my subwave radio station (ai dj + music tagging and text to speech, small models are fine) so I am going to test a qwen3.5 9b derived model to have free ram to run text2speech model as well.

The second usage for ai is image generation, I am using ComfyUI so it’s not clear to me how that plays with the other models… Will they be unloaded when I use comfy? Or will comfy just fail or offload to CPU?

So, if having two cards pinned one to LocalAI and one to ComfyUI is a good choice, I can look into the hardware doubts. If nothing would practically change, no …

SuspiciousCarrot78@aussie.zone · edit-2 4 hours ago

Depends. Two GPUs can make sense if you want the chat/radio stack and ComfyUI to be able to run independently without fighting over VRAM. But it depends on how big the underlying models are and if they run at same time.

Using 1 card tho, logically, there are 3 things that can happen -

If there’s enough headroom: everything coexists (if models are small enough or well orchestrated)
ComfyUI throws an OOM and either crashes or falls back to CPU offload (slow, but it won’t usually fail silently)
LocalAI typically won’t auto-evict a loaded model just because another process wants VRAM (it’ll just sit there blocking)

In other words, on a single card, you’re either manually managing load/unload cycles, eating CPU offload penalties on ComfyUI, or playing VRAM Tetris.

None of those are fun if you want both services available simultaneously…but that depends a lot on how big / greedy the models are. Do you want / need everything at same time?

Eg: there are small, medium and large versions of this, https://github.com/ace-step/ACE-Step-1.5 and there are small, medium and larger versions of image generators.

So I would say: yes, two cards are worth considering if you want both workloads available at the same time. The practical way to do this is pin each app to its own GPU via CUDA_VISIBLE_DEVICES so they never see each other: LocalAI on CUDA_VISIBLE_DEVICES=0, ComfyUI on CUDA_VISIBLE_DEVICES=1 and HDMI output via your iGPU / CPU for desktop etc.

BTW, the cards don’t need to match either - a cheaper smaller card can handle the chat/TTS stack (or even CPU like we said above) while the bigger one handles image gen. If you are happy to manually switch between AI chat/TTS and ComfyUI, then two cards may not change much besides convenience.

PS: Worth considering a three-way split too, if you want everything all at once / separated.

TTS offloaded to CPU,
chat/music stack on one GPU,
image gen on the other GPU.

Small TTS models like Piper or Kokoro run fine on CPU, and for a radio context where you have even a few seconds of buffer, the latency is hidden. That frees up VRAM on your chat GPU.

Actually, that’s how I would leverage 1 GPU for everything / mix and match you CPU / GPU but ICBW

PS: Might be worth chatting to your LLM (lol) about this too (or using a cloud one). These are not general AI questions and I might be wrong :)

Shimitar · 3 hours ago

Yes. All you say make sense. Tts can and shall go on CPU, who cares.

It’s handling both general chat and ComfyUI the biggest point, but to be honest o don’t need them both at the same time. The point is how easy is to switch since this is an unattended server, shutting down containers might not be the easiest approach. Specially if some other family member want to use it.

I will switch screen out on the igpu, it’s unused anyway, and see if I can run one card only, but I am struggling in running almost any model on LocalAI with llama.cpp I need to dig in more time on the issue.

Adding the second Nvidia will require a new psu, the mobo has the slot, even if it’s only 16x downgraded to 4x according to Claude it will work just fine for normal inference. But I will keep it as a last resort, I don’t particularly enjoy fiddling with that bios.

In any case I want both the llm and comfy to run on GPU and not offload so CPU.

SuspiciousCarrot78@aussie.zone · 3 hours ago

On the container stuff - can’t help much. I’m a bare metal sorta guy. Ask BruceTheMoose from above, who also posts on !selfhosted. One thing I will tell you - shunting around a GPU between containers in Linux sounds like a pain in the ass. I would be tempted to keep Comfyi and llama.cpp in one container so you don’t have to so pass thru / rebinding bullshit.

Ask claude for advice on LXC and CUDA pass thru here to reduce pain.

Re: second NVIDIA card - probably would yes, unless you get something like a Quadro P1000…which wouldn’t do much for you.

Re: LocalAI. I’ve never used it so can’t comment. I either use OWUI (heavy but feature rich) or the webui that’s inbuilt with llama.cpp (light, fast, but somewhat cut down).

If you’re willing to use ComfyUI, then chat (while tts stays elsewhere) then probably 1 gpu could do it. Try the Qwen 3.6 35B model I suggested - it should get you 25+ tok/s on that GPU (show Claude the YouTube video and tell it to pull the settings from the video description for you).