Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

        • brucethemoose@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          edit-2
          3 hours ago

          Not anymore. Not with hybrid offloading, where the GPU handles dense tensors and the CPU only runs the sparse MoEs. I’m running a 300B model on a single 3090, and its faster than I can read.

          You just need to use the right framework, and the right model.

          I’d suggest trying ik_llama.cpp and a MoE like one of these: https://huggingface.co/models?other=ik_llama.cpp&sort=modified&search=35B

          And speculative decoding like DFlash or MTP (which you can also get specific models for).

          EDIT: Wrong link.

          • atzanteol@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            1
            ·
            1 hour ago

            I’ll check that out - speed isn’t my biggest issue so much as coding performance… The qwen 3.5 model I was using can write code, but it’s… Meh? Like sometimes it doesn’t even compile.

            I did try tweaking llama.cpp to do some cpu offloading and it does seem to allow for much larger contexts at a modest performance loss. I’ll check out larger models.

            • brucethemoose@lemmy.world
              link
              fedilink
              English
              arrow-up
              1
              arrow-down
              1
              ·
              edit-2
              37 minutes ago

              CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

              This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.