How to... (Maybe I am missing something)

Shimitar · 3 months ago

How to... (Maybe I am missing something)

Shimitar · 3 months ago

Thank you for the deep post!

Ok, I need you to ELI5 what you wrote because I am not a llm expert and… Got lost.

I have OWUI which provide the web interface. Then I have ollama that runs the models, and I have added models there.

I searched for llama. Cpp but i am unclear why make it different from ollama and if i can install models there.

Can you help me cast some light?

Also about models, I have a 16gb VRAM NVIDIA gpu that works fine with the models I have, what is the correlation here?

yellow [she/her]@lemmy.blahaj.zone · edit-2 3 months ago

Sure thing! Sorry about the delay, I haven’t really had a chance to sit down and write a proper response lately.

Ollama vs. llama.cpp

So, under the hood, Ollama is just a wrapper around llama.cpp that provides a lot of quality-of-life features like easy model downloading and automatic model switching. However, in exchange for that, you some of the customizability that llama.cpp provides, most notably the --n-cpu-moe flag, which allows you to offload the expert layers in a MoE (Mixture-of-Experts, explained below) model to the CPU which can have a big impact on speed. Plus, since llama.cpp doesn’t have a model management system, you can just manage model files manually, which I personally find to be a lot cleaner and more straightforward than Ollama’s system, especially if you’re downloading any models from Huggingface.

Additionally, if you do decide to switch off of Ollama, you could switch to ik_llama.cpp, which is a fork of llama.cpp that features a lot of performance enhancements that Ollama doesn’t have. I’ve personally gotten around a 2-3x performance uplift versus Ollama/llama.cpp with my setup, though that number may be different on yours. However, there are a small handful of models it doesn’t support that mainline llama.cpp does, and it only has proper support for Nvidia GPU and CPU inference. Also, you have to compile it yourself. Otherwise, it functions near-identically to mainline llama.cpp.

For reference, here’s my ik_llama.cpp start command for my current favorite workhorse model, Qwen3.5-35B-A3B.

#!/usr/bin/env bash

llama-server \
# Specify the model file
-m Qwen3.5-35B-A3B-Uncensored-HauhauCS-Aggressive-Q4_K_M.gguf \
# Set max context length to 65535 tokens (1 token is approx. 0.75 words)
-c 65535 \
# Offload all expert layers to the CPU
--n-cpu-moe 99 \
# Offload all other dense layers to the GPU
-ngl 99 \
--host 0.0.0.0 \
--port 8082 \
# Recommended sampling settings for Qwen3.5 in non-thinking mode
--temp 0.7 --top-p 0.8 --top-k 20 --min-p 0 --presence-penalty 1.5 \
# Apply the chat template, required for chat-template-kwargs to function
--jinja \
# Disable thinking (Qwen3.5-specific feature)
--chat-template-kwargs '{"enable_thinking":false}' \
# ik_llama exclusive flag, does... something that improves prompt processing speed by ~7% for me
--mqkv \
# ik_llama exclusive flag, repacks the weights at runtime for better efficiency, or something. improves prompt processing speed by ~7% for me
--rtr

(This doesn’t actually run as-is because of the comments, you’ll have to remove those. If you know how to format bash like this with inline comments and make it run, please let me know lol)

Alright, I’m convinced. Where do I get it?

If you’re on Windows, head to the llama.cpp github and grab the latest CUDA release (CUDA since you have an Nvidia GPU). If not, you’ll have to build it yourself. The ik_llama.cpp build instructions are the same, just clone https://github.com/ikawrakow/ik_llama.cpp/ instead of https://github.com/ggml-org/llama.cpp . (Instructions for how to download models are at the bottom of this comment.)

Once you have it installed/built, run the llama-server binary using a command like the one above. By default, it’ll serve a simple web UI that you can use to chat with the model and an OpenAI-compatible API at 127.0.0.1:8080. It also comes with llama-bench, a simple benchmarking tool you can use to measure the speed of your models.

What’s up with VRAM?

In general, VRAM, and to a lesser extent, system RAM, determines the parameter count, and thus the intelligence, of the models you can run. For every token, the entire model must be processed, so it’s a de facto requirement to have enough RAM to hold your model, plus context. (Technically, you could do some shenanigans with swap space/pagefiles, but no one really does this for actual work.) Roughly, 1B parameters = 1GB at a Q4 quantization (explained below), plus an extra ~40% RAM or so for context means you should be able to run ~12B models at a reasonable context length without spilling into system RAM.

Why is spilling into system RAM bad? Well, it’s slow. As I said before, the entire model must be processed for every token, so higher RAM bandwidth equals more tokens per second, and GPUs have RAM bandwidth in spades; your average dual-channel consumer motherboard lags behind even a modest GPU in RAM bandwidth by something like an order of magnitude. Additionally, parameters that are on system RAM are processed by the CPU, and CPUs are a lot slower than GPUs at matrix multiplication. So, you generally wanna avoid spilling into system RAM if you can avoid it.

MoE (Mixture-of-Experts) Models

So, what is an MoE, anyway? An MoE, or Mixture-of-Experts model, is a model that works in a way such that only part of the model is activated for every token. The way it works is that the model is split up into multiple parts, or “experts,” and an additional set of router parameters is trained to determine which experts should be activated for each token. This gives you some extra speed in exchange for a bit of intelligence; MoE models tend to be dumber than “dense,” or non-MoE models around the same parameter count. However, thanks to the extra parameters, MoE models are also smarter than dense models with the same number of activated parameters.

Another perk of the sparsity of an MoE is that it’s much more viable to run one in a hybrid GPU/CPU configuration with system RAM offload. In addition to the experts, there’s also a dense attention layer that is activated for every token, and that will usually be able to fit entirely onto the GPU VRAM with little trouble, leaving the experts to reside on the CPU. Since those experts aren’t activated very often, the performance hit from the slow system RAM isn’t as catastrophic as it is on dense models, making partial GPU inference a viable method for running those models.

Where to get models

Huggingface.co is the premier spot to download models. I often like to just browse the trending models to see what’s new and popular, but mostly I get new model news from (regrettably) r/localllama on Reddit.

Once you’ve found a model that you want, scroll down a bit to the model tree and click the Quantizations link. It’ll be on the sidebar on the right. Select a quantization provider you like, choose the quant level you want, and hit download. Easy-peasy.

Wait, what is quantization?

As you know, an LLM is really just a gigantic bundle of numbers, or parameters, that you can multiply together to get words. Those parameters need to be stored somehow, so the model creator will provide the model in full F32 precision, which allocates 32 bits to each parameter. You could run the model like that, but usually that will be too large for you to practically use, as that would be 8GB to 1B parameters. So, you quantize it, or compress the model into a smaller number of bits per parameter, sacrificing a bit of quality for size and speed. Well, technically, you won’t be quantizing it (unless you want to), you’ll usually be downloading it from someone else who already did it, but same deal.

(I realize that I forgot to mention this, but I’m not sure where to put it, so here it goes.) llama.cpp/ik_llama.cpp can only run models in the .gguf file format, so make sure the models you’re downloading are that.

There’s three main types of quant:

Legacy (Q4_0, Q8_0, etc.)

(P.S. The number after the Q is roughly the amount of bits allocated per parameter)

Basic straightforward quants with no fancy methods. If you’re using an older card, it may be slightly faster, but you’re usually better off using…

K-Quants (Q4_K_M, Q5_K_S, etc.)

Newer quantization method that gives different layers more bits depending on how important they are, is just better than Legacy in almost every way.

I-Quants (IQ4_XS, IQ3_XXS, etc.)

Uses more special methods to improve quality. In general, better than K-quants at the same size, but has a significantly higher compute cost thanks to the compression methods used.

As a rule of thumb, Q4_K_M is the sweet spot for size vs. quality, and going above Q8 is generally unnecessary due to diminishing returns. Aside from that, experiment, try new things, test on your own usecases. See what works for you.

Who should I download my quants from? There are so many options…

Bartowski and mradermacher are two quanters whom a lot of people trust. Unsloth is also pretty good, though if you use ik_llama.cpp you’ll want to avoid their Unsloth Dynamic quants (UD_K_XL) as ik_llama.cpp isn’t compatible with them. Among those three you should be able to find quants for pretty much every model you might want to use.

Feel free to ask any other questions you might have.

angrywaffle@piefed.social · 3 months ago

Since those experts aren’t activated very often, the performance hit from the slow system RAM isn’t as catastrophic as it is on dense models

Aren’t experts activated on every single token?

yellow [she/her]@lemmy.blahaj.zone · edit-2 3 months ago

The entire point of an MoE is that all of the experts aren’t activated on every single token. The only parts that are are (AFAIK) the expert router and the attention layers.