cross-posted from: https://sh.itjust.works/post/61139432

I seriously can’t believe how much progress he’s made for the FOSS community. He actually might take a bite out of the big 3’s profits with this

  • onlinepersona@programming.dev
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    16 hours ago

    Thank you for that writeup.

    Do you know how important the parameter size is? 12b, 24b, 128b, etc. Does it really improve performance or is it like megapixels in a camera: more megapixels don’t necessarily mean a better picture?

    And what’s “quantisation”. Context compression or something?

    I’ve been considering buying a better card to test models (also want to be personally sovereign), but NVIDIA on linux gives me the jeebies and, last i checked, AMD hasn’t released anything with more than 20GB in a while. In fact, figuring out hardware requirements has been tough and I’m considering just riding this whole thing out. Maybe the bubble will collapse and bring prices down to something reasonable.

    • cecilkorik@piefed.ca
      link
      fedilink
      English
      arrow-up
      1
      ·
      6 hours ago

      I’m not an expert by any means I’m just a dabbler, but my understanding is: In theory, more parameters make richer, wider, and deeper model knowledge possible, and with extensive enough training, those parameters could all be important. That said, there is a lot of megapixel-like inflation and there is no guarantee that any of those parameters are actually useful so in practice, really “advanced” models tend to do a better job of maximizing the usefulness of the limited parameters they do have to run on smaller devices. In general, I tend towards the highest parameter size of a particular model that I can reasonably run. My typical target range is between 8GB up to maybe 20GB, which depending on model might be in the 9b to 30b parameters range, and I might even be erring on the wrong side of this and maybe I’d even be better off with smaller parameter models.

      There’s also a lot of models nowadays that use “active” parameters, so the model itself will have X parameters, but then it will determine which of those parameters are most relevant to the task or query at hand, and prune off all but the most relevant ones, so you might have a 30B model, but as soon as you run it, it turns itself into a specialized 4B model. You still need to load the whole model into some kind of RAM typically so it can decide which parameters are relevant, but once it does, it will run much faster. This is another way you can try to run larger models on more limited hardware. Older “dense” models that don’t use this technique with all parameters always active are still typically preferred for some tasks like coding, but YMMV.

      Either way, it’s still sort of a crapshoot, there’s a lot of randomness and subjectiveness, and very small parameter models often seem to realistically be able to outperform much bigger models when they are “good”, “well-trained” advanced models, and they will typically be much faster, so if you don’t like the response, it’s much easier to just ask again or retry. I tend to trust the community wisdom when it comes to this, although I also think there’s a lot of cargo-culting and herd-following going on, I don’t know enough to do anything too much different from the herd myself, other than be willing to experiment a little. Latest is not always greatest, but in a field as quickly moving as this it often is. Don’t be afraid to try older models, or less popular models. You’ll often be disappointed, but not always.

      Quantization is a form of compression, basically instead of using floating point precision to weigh the “strengths” of the various parameters (default is typically F16 or 16 bits per parameter weight), they get quantized down to smaller groups of bits. Q4 means you’re using 4 bits (essentially ranking each parameter on an integer scale from 0 to 15 instead of a floating point from 0 to 1) and in practice this is usually almost as good. Q8 would be even closer to the original full-size model, but smaller quants like Q2 and Q3 start losing quality. Other quantization-related techniques like i-Matrix (imat) map these values non-linearly and situationally, which is particularly helpful on quantizations Q3 and smaller, which are then called IQ3. The community has adopted Q4 as pretty much the go-to quantization level as the best available compromise between having more parameters being squeezed into less memory without destroying the inherent accuracy of those parameters.