I do selfhost my own, and even tried my hand at building something like this myself. It runs pretty well, I’m able to have it integrate with HomeAssistant and kubectl. It can be done with consumer GPUs, I have a 4000 and it runs fine. You don’t get as much context, but it’s about minimizing what the LLM needs to know while calling agents. You have one LLM context that’s running a todo list, you start a new one that is charge of step 1, which spins off more contexts for each subtask, etc. It’s not that each agent needs it’s own GPU, it’s that each agent needs it’s own context.
I do selfhost my own, and even tried my hand at building something like this myself. It runs pretty well, I’m able to have it integrate with HomeAssistant and kubectl. It can be done with consumer GPUs, I have a 4000 and it runs fine. You don’t get as much context, but it’s about minimizing what the LLM needs to know while calling agents. You have one LLM context that’s running a todo list, you start a new one that is charge of step 1, which spins off more contexts for each subtask, etc. It’s not that each agent needs it’s own GPU, it’s that each agent needs it’s own context.