I’ll check that out - speed isn’t my biggest issue so much as coding performance… The qwen 3.5 model I was using can write code, but it’s… Meh? Like sometimes it doesn’t even compile.
I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I’ll check out larger models.
CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.
This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.
I’ll check that out - speed isn’t my biggest issue so much as coding performance… The qwen 3.5 model I was using can write code, but it’s… Meh? Like sometimes it doesn’t even compile.
I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I’ll check out larger models.
CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.
This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.