Do you host your own ML / AI / LLM? What do you use, and what do you use it for?

  • queerlilhayseed@piefed.blahaj.zone
    link
    fedilink
    English
    arrow-up
    8
    arrow-down
    3
    ·
    5 hours ago

    Yup, ollama, various models. I initially downloaded it because I, along with thousands of other people, wanted to see what would happen if I made models debate with each other after RAGging them with various books (The Prince, The Art of War, The complete works of Shakespeare, etc.).

    The results were uninteresting and I abandoned the project pretty quickly. I’ll sometimes use them for code analysis but they’re too slow on my rig to be really useful.

    • SuspiciousCarrot78@aussie.zoneOP
      link
      fedilink
      English
      arrow-up
      5
      arrow-down
      2
      ·
      5 hours ago

      Did you use OWUIs native “call simultaneous models to answer” feature for that or one of the AI debate harnesses?

    • irmadlad@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      3
      ·
      4 hours ago

      wanted to see what would happen if I made models debate

      LOL I kind of do that…sort of. I’ll ask several AI the very same question to see what they spit out.

        • irmadlad@lemmy.world
          link
          fedilink
          English
          arrow-up
          1
          ·
          2 hours ago

          Well I’ll be damned. Of course the law of large numbers dictates someone, somewhere has the same thought.

      • queerlilhayseed@piefed.blahaj.zone
        link
        fedilink
        English
        arrow-up
        2
        ·
        3 hours ago

        One of the projects I started and never got to a satisfactory end state was basically that, plus a judging round. Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness. Then the results would get logged to a spreadsheet.

        It’s simple enough, but for N models it requires N + N^2 model calls so it takes forever to run any decent dataset on consumer hardware. If I had the resources and a way to run it that didn’t fry the planet, I think it would be a cool running set of comparative benchmarks. IDK if it’d be useful at all but I’m still interested to see the data.

        • irmadlad@lemmy.world
          link
          fedilink
          English
          arrow-up
          2
          ·
          3 hours ago

          Every model would respond to the same prompt, then every model would evaluate every other model’s response for accuracy and completeness

          If I understand correctly I sorta kinda do that. I’ll copy and paste one AI’s response into another and prompt something like 'Validate AI response: and paste it in. HAHA I thought I was being tricky but you’re already on it.

          • queerlilhayseed@piefed.blahaj.zone
            link
            fedilink
            English
            arrow-up
            2
            ·
            3 hours ago

            I think it’s tricky. It’s kind of like adding LLMs like vectors, and hopefully the effect can soften or at least reveal the shortcomings of individual models. Is it a good idea? I don’t know, I think there are good reasons to think it’s a waste of time and resources. I certainly think I’d need a better explanation of what use it would be before I spent more time building it. But I still think about what use it would be from time to time; I haven’t decided that it’s a bad idea yet.

            • irmadlad@lemmy.world
              link
              fedilink
              English
              arrow-up
              2
              ·
              2 hours ago

              at least reveal the shortcomings of individual models. Is it a good idea? I don’t know,

              I mean I do it, in my rudimentary way, to check for some semblance of consistency. I’m unclear why you think that not a good idea?

              • queerlilhayseed@piefed.blahaj.zone
                link
                fedilink
                English
                arrow-up
                2
                ·
                2 hours ago

                P.S. This is a hypothesis, I haven’t even designed the test for it, much less run it. What follow are my suppositions.

                I think whether or not it’s a good idea depends on how similar all the models are. I don’t have a rigorous definition of “similar” but things like similar training data, similar design methodologies, similar QA processes would all contribute. Theoretically (I think), if they’re all dissimilar, they should each catch errors the others miss. However, the more similar they are, the more likely they have the same biases and weak spots, and your error rate from a response + verification may be the same or even higher than the error rate for just the original prompt, and you’d be unlikely to detect those errors using just two similar models. It can instill false confidence in the results because you’re doing something that should in theory increase the validity of the data, but in practice might make no difference or even make the quality of responses worse.