Hi all!
i have written a couple of posts in the past, i am an illiterate having fun with LLMs and AI in general, who is being pulled in in a deeper hole by the days…
I have extensive experience with Linux (Gentoo lover since 20 years here) i am a sw dev now “promoted” to management, and avid tech user, so not really illiterate, but i know very little about all this LLM game.
I started with OpenWebUI + Ollama and played as an idiot with random models. Then come across an NVIDIA RTX A4000 (16gb VDDR6) and plugged into my I7-8700 server with 64gb RAM. The server has a Intel Corporation CoffeeLake-S GT2 [UHD Graphics 630] too, unused at this time (server is 100% headless anyway).
I am currently installing LocalAI to run llama.cpp and improve my models capability and speed, planning to ditch OpenWebUI and Ollama, if LocalAI + llama.cpp works fine.
My first usage was chatting with random local models. Then i discovered Fooocus and quickly upgraded to ComfyUI. Last, i have set up my SubWave radio station and i am having so much fun…
I have a few questions:
- Can i leverage both my NVIDIA and the iGPU at the same time?
- If i use the iGPU do i need to fixedly allocate RAM from the BIOS to it? Or will it use system RAM as needed?
- Using llama.cpp i want to leverage also CPU usage, since i have 64gb ram (also shared by many more self hosted stuff, tough) is there anything special i need to do to achieve that?
- What are a set of models that you guys recommend for my setup? I am currently using qwen2.5-coder:14b-instruct-q5_K_M with ollama, and i am pretty satisfied with it’s coding capabilities, but i want something more general purpose for my SubWave (AI assisted web radio channel)
- I might have the opportunity to install a second RTX A4000, identical to the first, on my server (need to check pci-e slot availability and power supply specs), would that make any sense at all?
- Power consumption wise, do the NVIDIA cards suck power also when not in active use?


Actually one addendum/correction:
You can use your integrated graphics to render the desktop. Plug your monitor into the UHD graphics instead of your Nvidia card.
This saves a notable about of VRAM you can use to fit more context, and speeds up inference a bit too. It also lets you configure models closer to your VRAM limit, as you no longer have the variability of apps randomly taking it.
Yes,but that is a headless server anyway. So I will indeed switch to using the igpu as primary video sooner or later (need to physically go to the server, I have a network Kem but that would be pretty useless after I get into bios and change the video output)
Ooh…hang on. Doesn’t a headless server in Linux require a dummy HDMI plug if you have an Igpu + GPU? You might need to confirm that.
The server is plugged to a network Kwm so there is an actual output. And it was working just fine even without anything plugged in, I can confirm. But the nkwm is just practical