Hi all!

i have written a couple of posts in the past, i am an illiterate having fun with LLMs and AI in general, who is being pulled in in a deeper hole by the days…

I have extensive experience with Linux (Gentoo lover since 20 years here) i am a sw dev now “promoted” to management, and avid tech user, so not really illiterate, but i know very little about all this LLM game.

I started with OpenWebUI + Ollama and played as an idiot with random models. Then come across an NVIDIA RTX A4000 (16gb VDDR6) and plugged into my I7-8700 server with 64gb RAM. The server has a Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] too, unused at this time (server is 100% headless anyway).

I am currently installing LocalAI to run llama.cpp and improve my models capability and speed, planning to ditch OpenWebUI and Ollama, if LocalAI + llama.cpp works fine.

My first usage was chatting with random local models. Then i discovered Fooocus and quickly upgraded to ComfyUI. Last, i have set up my SubWave radio station and i am having so much fun…

I have a few questions:

  1. Can i leverage both my NVIDIA and the iGPU at the same time?
  2. If i use the iGPU do i need to fixedly allocate RAM from the BIOS to it? Or will it use system RAM as needed?
  3. Using llama.cpp i want to leverage also CPU usage, since i have 64gb ram (also shared by many more self hosted stuff, tough) is there anything special i need to do to achieve that?
  4. What are a set of models that you guys recommend for my setup? I am currently using qwen2.5-coder:14b-instruct-q5_K_M with ollama, and i am pretty satisfied with it’s coding capabilities, but i want something more general purpose for my SubWave (AI assisted web radio channel)
  5. I might have the opportunity to install a second RTX A4000, identical to the first, on my server (need to check pci-e slot availability and power supply specs), would that make any sense at all?
  6. Power consumption wise, do the NVIDIA cards suck power also when not in active use?
  • SuspiciousCarrot78@aussie.zone
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    6 hours ago

    Understood, so my iGPU is too old, and i am better off with my NVIDIA only.

    Sadly, yes

    Also, putting two might be just a waste

    It…depends. What are you planning on doing with em?

    i doubt my PSU can handle them even if i had the proper PCIe slot.

    Technically, PSU and PCIe slots can be shared or upgraded…which goes back to “it depends”. But if your gut is already telling you nah…

    So, using llama.cpp if i use a model bigger than my VRAM will offload to CPU?

    Yep. Technically, the other option is better (tell it to auto adjust / put as much as it can on GPU and then spill over onto CPU / RAM) The setting for that should be something like -ngl auto (in the latest versions of llama.cpp)

    The NVIDIA RTX A4000 is nothing to sneeze at, 16GB or not - it’s just…old, so support for it is middling. You should definitely try one of the MoE models on that. See if this helps:

    https://www.youtube.com/watch?v=8F_5pdcD3HY

    The setting he uses are in his first comment.

    If I had to pull a number out of my butt, you should be able to triple his thru put (given that the A4000 is almost 3x the bandwidth, has actual tensor cores etc)

    • ShimitarOPA
      link
      fedilink
      English
      arrow-up
      1
      ·
      5 hours ago

      I am trying to understand if it make sense to run two cards.

      I am using ai for chat and to support my subwave radio station (ai dj + music tagging and text to speech, small models are fine) so I am going to test a qwen3.5 9b derived model to have free ram to run text2speech model as well.

      The second usage for ai is image generation, I am using ComfyUI so it’s not clear to me how that plays with the other models… Will they be unloaded when I use comfy? Or will comfy just fail or offload to CPU?

      So, if having two cards pinned one to LocalAI and one to ComfyUI is a good choice, I can look into the hardware doubts. If nothing would practically change, no …

      • SuspiciousCarrot78@aussie.zone
        link
        fedilink
        English
        arrow-up
        1
        ·
        edit-2
        4 hours ago

        Depends. Two GPUs can make sense if you want the chat/radio stack and ComfyUI to be able to run independently without fighting over VRAM. But it depends on how big the underlying models are and if they run at same time.

        Using 1 card tho, logically, there are 3 things that can happen -

        • If there’s enough headroom: everything coexists (if models are small enough or well orchestrated)
        • ComfyUI throws an OOM and either crashes or falls back to CPU offload (slow, but it won’t usually fail silently)
        • LocalAI typically won’t auto-evict a loaded model just because another process wants VRAM (it’ll just sit there blocking)

        In other words, on a single card, you’re either manually managing load/unload cycles, eating CPU offload penalties on ComfyUI, or playing VRAM Tetris.

        None of those are fun if you want both services available simultaneously…but that depends a lot on how big / greedy the models are. Do you want / need everything at same time?

        Eg: there are small, medium and large versions of this, https://github.com/ace-step/ACE-Step-1.5 and there are small, medium and larger versions of image generators.

        So I would say: yes, two cards are worth considering if you want both workloads available at the same time. The practical way to do this is pin each app to its own GPU via CUDA_VISIBLE_DEVICES so they never see each other: LocalAI on CUDA_VISIBLE_DEVICES=0, ComfyUI on CUDA_VISIBLE_DEVICES=1 and HDMI output via your iGPU / CPU for desktop etc.

        BTW, the cards don’t need to match either - a cheaper smaller card can handle the chat/TTS stack (or even CPU like we said above) while the bigger one handles image gen. If you are happy to manually switch between AI chat/TTS and ComfyUI, then two cards may not change much besides convenience.

        PS: Worth considering a three-way split too, if you want everything all at once / separated.

        • TTS offloaded to CPU,
        • chat/music stack on one GPU,
        • image gen on the other GPU.

        Small TTS models like Piper or Kokoro run fine on CPU, and for a radio context where you have even a few seconds of buffer, the latency is hidden. That frees up VRAM on your chat GPU.

        Actually, that’s how I would leverage 1 GPU for everything / mix and match you CPU / GPU but ICBW

        PS: Might be worth chatting to your LLM (lol) about this too (or using a cloud one). These are not general AI questions and I might be wrong :)

        • ShimitarOPA
          link
          fedilink
          English
          arrow-up
          1
          ·
          3 hours ago

          Yes. All you say make sense. Tts can and shall go on CPU, who cares.

          It’s handling both general chat and ComfyUI the biggest point, but to be honest o don’t need them both at the same time. The point is how easy is to switch since this is an unattended server, shutting down containers might not be the easiest approach. Specially if some other family member want to use it.

          I will switch screen out on the igpu, it’s unused anyway, and see if I can run one card only, but I am struggling in running almost any model on LocalAI with llama.cpp I need to dig in more time on the issue.

          Adding the second Nvidia will require a new psu, the mobo has the slot, even if it’s only 16x downgraded to 4x according to Claude it will work just fine for normal inference. But I will keep it as a last resort, I don’t particularly enjoy fiddling with that bios.

          In any case I want both the llm and comfy to run on GPU and not offload so CPU.

          • SuspiciousCarrot78@aussie.zone
            link
            fedilink
            English
            arrow-up
            1
            ·
            3 hours ago

            On the container stuff - can’t help much. I’m a bare metal sorta guy. Ask BruceTheMoose from above, who also posts on !selfhosted. One thing I will tell you - shunting around a GPU between containers in Linux sounds like a pain in the ass. I would be tempted to keep Comfyi and llama.cpp in one container so you don’t have to so pass thru / rebinding bullshit.

            Ask claude for advice on LXC and CUDA pass thru here to reduce pain.

            Re: second NVIDIA card - probably would yes, unless you get something like a Quadro P1000…which wouldn’t do much for you.

            Re: LocalAI. I’ve never used it so can’t comment. I either use OWUI (heavy but feature rich) or the webui that’s inbuilt with llama.cpp (light, fast, but somewhat cut down).

            If you’re willing to use ComfyUI, then chat (while tts stays elsewhere) then probably 1 gpu could do it. Try the Qwen 3.6 35B model I suggested - it should get you 25+ tok/s on that GPU (show Claude the YouTube video and tell it to pull the settings from the video description for you).