Conducting deep web searches and gathering sources is one of the main things I’ve been using LLMs for. How far away are we from being able to self-host something like Claude’s web search capabilities? Or even just a service where I’d pay with my money instead of my data?

  • vapeloki@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    2 days ago

    For those who want to know more, rough setup:

    • llama-cpp rocmfp4 fork
    • currently custom quantized qwen3.6 35B A3B model, working on publishing
    • be3 embedding and reranker, also GPU
    • gemma4-e4b via FastFlowLM on NPU!
    • OpenWebUI and searxng as docker containers on a Pi currently

    We get 70-100tok/s generation. Four slots with 256k context length each.

    We use a smaller Board with “only” 64GB of shared LPDDR5X. Bottleneck is memory speed, rocmfp4 quants help a lot.

    As soon as I get my imatrix calibration right, I will publish the quantized versions.

    Most existing quantized models are broken. The authors did some not supported stuff (like using a already quantized model and requantize it) that you may get issues with coherence or sudden Chinese words in the output.

    That is not an issue with rocmfp4 but with vibe coders and agent psychosis.

    • ejs@piefed.social
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 day ago

      Thank you so so much for pointing out ROCmFP4. I have been tinkering with my RDNA 3 framework on llama. I was struggling with ROCm llama.cpp and have been using vulcan in the meantime. I know there’s some issues on the llama.cpp github to try and fix my issue (UMA stuff), but haven’t come across this specific project. Gonna try it out

    • TropicalDingdong@lemmy.world
      link
      fedilink
      arrow-up
      5
      ·
      2 days ago

      Do you have a walk through for setup?

      I’m on the strix halo 128 gb variant and while I got ollama working fine, i haven’t gotten any of these multi headed setups working

      • vapeloki@lemmy.world
        link
        fedilink
        arrow-up
        5
        ·
        1 day ago

        I am on Gentoo for it, but everything with a decent rocm should work.

        Have a look for llama-swap, that handles multi head endpoints.

        Also, as you are on a big board, you can quantize yourself, as the BF16 version of qwen has only 72gb.

        I will try and post a full writeup next days. But feel free to dm me, if you need some guidance on quantize or more.

        I am using this fork currently: https://github.com/charlie12345/ROCmFPX

        Stuff happens fast currently, so may be worth to wait a week or two ig you need something super stable, but if you are up for experimenting, that’s the way to go

        • TropicalDingdong@lemmy.world
          link
          fedilink
          arrow-up
          3
          ·
          1 day ago

          THis is great, thanks. I’m on the z-13 and needed to use it for a work project, which is wrapping up soon. I’m planning on re-building it as a locally hosted agent support machine.

        • ShimitarA
          link
          fedilink
          English
          arrow-up
          2
          ·
          1 day ago

          Great man! Gentoo lover and long time addicted here… Keep it the good work!