Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

  • brucethemoose@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    arrow-down
    1
    ·
    edit-2
    2 hours ago

    CPU offloading is too slow unless you use a hybrid MoE model, with the --n-cpu-moe parameter, specifically.

    This only offloads “sparse” parts of the model to the CPU, which take up a lot of RAM but are very compute-lite to run. In practice, thats most of the size of modern MoE LLMs.