Do you host your own AI?

SuspiciousCarrot78@aussie.zone · 8 hours ago

Do you host your own AI?

atzanteol@sh.itjust.works · 7 hours ago

I’ve tried a few times but with only 8gig of vram it’s simply not worth it.

brucethemoose@lemmy.world · 4 hours ago

How much CPU RAM do you have?

atzanteol@sh.itjust.works · 3 hours ago

64G. But CPU inference is painfully slow.

brucethemoose@lemmy.world · edit-2 3 hours ago

Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

You just need to use the right framework, and the right model.

I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

And speculative decoding like DFlash or MTP (which you can also get specific models for).

EDIT: Wrong link.

atzanteol@sh.itjust.works · 1 hour ago

I’ll check that out - speed isn’t my biggest issue so much as coding performance… The qwen 3.5 model I was using can write code, but it’s… Meh? Like sometimes it doesn’t even compile.

I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I’ll check out larger models.

brucethemoose@lemmy.world · edit-2 37 minutes ago

CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.

Franconian_Nomad@feddit.org · 7 hours ago

Have you tried qwen3.5-9b? It’s pretty solid for its size.

atzanteol@sh.itjust.works · 5 hours ago

Yeah, it’s “good for its size” but it’s just too flaky for me to use for any significant coding.

Franconian_Nomad@feddit.org · 3 hours ago

Yeah, I wouldn’t use it for coding. It’s a bit dumb unfortunately.