PewDiePie releases Codex/ClaudeCode/Cursor killer, Odysseous (FOSS)

appauled@sh.itjust.works · 2 months ago

PewDiePie releases Codex/ClaudeCode/Cursor killer, Odysseous (FOSS)

onlinepersona@programming.dev · 2 months ago

How many GPUs do you even need to have a usable, self-hosted AI? It looks like he has 6 on his rig. Probably each costs 2k or something. That’s not peanuts. I have a 12GB VRAM card. It probably can’t generate anything in any meaningful amount of time. Which brings me to the question: who is this for?

Regardless, impressive what he vibe-coded there.

Rhaedas@fedia.io · 2 months ago

16GB is plenty for even older model setups. Now they’ve got a few models designed so you load just parts of the model onto the GPU (Mixture of Experts) and use the CPU for less referenced sections, so you get both reasonable speed and a much more complex model.

onlinepersona@programming.dev · 2 months ago

Oh nice. Does that depend on just the model or are there other requirements like CUDA or something?

Rhaedas@fedia.io · 2 months ago

Most models are going to require CUDA. There are some AMD ones out there, but it’s a totally different math and setup. As for the one I mentioned, it’s a pretty new idea so there are only a few out there, maybe just one (Qwen based). But I did get a 31B model to work on my 12GB, I just had to move from Ollama to llama.cpp to gain the control needed to set the parameters, and fine tune what it put on the CUDA to the max it would take. I had Claude help me along the way.

It’s new enough that there aren’t any good abliterated/uncensored models yet.

Jayjader@jlai.lu · 2 months ago

I’m surprised that you’re talking about models being CUDA-specific or AMD-specific. I’ve had a bunch of models running on my amd-only pc, using ollama, lemonade, and lm-studio, through either rocm or vulkan. None of these models were billed as AMD-specific. I had to do some config tweaking for ollama to use my graphics card but that’s more because I have a weird in-between-generations card that also predates the LLM hype (6700XT).

However, I did generally need to look for the GGUF format versions of things - usually accounts like unsloth have them uploaded on huggingface barely a day or two after the original version gets posted.

cecilkorik@piefed.ca · 2 months ago

For chat usage (which is strictly a more efficient way to generate code on the LLM’s part, although you have to keep carefully guided and compartmentalized otherwise it typically requires a lot more testing and sometimes back-and-forth iteration on your part) 12GB is plenty to run many decent LLMs, you’ll typically want to use a Q4 quantization to make models with larger parameter fit into smaller memory, sometimes an IQ2 or IQ3 if you really want a particular model.

For agentic usage (where the LLM is trained and optimized to use a harness like this to start requesting tool calls and getting their results and using the results of the tool calls to inform what it’s trying to do) it’s quite a bit more challenging to do on consumer hardware at a tolerable speed. The tools often generate large amounts of output which then take a long time to process, and the models and harnesses are both typically quite a bit stupider about using your limited resources efficiently. If you’re using to commercial “frontier” agentic models like Claude Code you’re going to have a bad time.

That said, it is absolutely possible to do agentic AI on consumer hardware (just the GPU you have, not 6 of them), as long as you’re reasonably patient, using a harness properly tuned for efficiency. Out-of-the-box, many if not most are designed for remote API usage, even the “open source, local” ones realistically rely on free tier APIs and are inherently wasteful in terms of them not really caring how many tokens you burn in these remote datacenters and they’re expecting to just be able to iterate over and over again until they get it right. You don’t have that luxury when you’re getting slow tokens.

Is PewDiePie’s any better or more efficient? I don’t know, I haven’t tried it yet. I prefer more minimal harnesses personally, OpenCode is about the most usable I’ve found personally, although I’m starting to experiment with Pi-mono (called Pi, but that’s unsearchable) which seems very promising, and I know quite a few people who have had good successful agent usage with Hermes Agent.

I’m not going to pretend it’s going to be easy or that you’ll necessarily have very good results. I am pretty lukewarm on AI as a whole, but I am personally deeply invested in making sure I have fully local access to it in as much capacity as is currently technologically possible, as a personal digital sovereignty issue.

As for hardware, I have a 12GB card myself and you don’t really need to fit everything into VRAM these days. I have an AMD X3D CPU which allows me to offload some of the model to system RAM with pretty decent performance, maybe it’s prohibitive on different architectures or configurations I don’t know but it’s worth a try. glm-4.7-flash:Q4_K_M from ollama is the model I’ve had the most consistent success with and with ollama running it with the context window set to 50,000 (context should also be set to be quantized to Q4_K_M), I end up with almost half of it offloaded to system RAM and it still runs quite fast thanks to the flash attention feature. I’ve worked with gemma4 quite a lot too and it’s definitely really fast but it’s also a bit unstable/weird at times, at least the heretic version hf.co/Stabhappy/gemma-4-26B-A4B-it-heretic-GGUF:Q4_K_M I’m running is. Still, if you really do need to fit everything into a smaller set of RAM you might try the gemma4 E4B models which clock in around 9GB when quantized. Qwen3.6 is I guess supposed to be really good too and should fit nicely on your 12GB card, but I haven’t had much opportunity to play with it yet. Qwen3 and 3.5 felt rather disappointing to me for agentic use but YMMV.

You’re not completely going to outsource all software and all code you write to AI using a local model, the way companies are doing with those commercial models. But I consider that an advantage, not a flaw. I find it’s much more useful to have it help, suggest and advise, not to completely replace everything I’m doing. Yes, sometimes it’s slow and sometimes it’s wrong, but so are other people when I ask them sometimes. I’m prepared for it, and you should be too. Don’t get complacent.

onlinepersona@programming.dev · 2 months ago

Thank you for that writeup.

Do you know how important the parameter size is? 12b, 24b, 128b, etc. Does it really improve performance or is it like megapixels in a camera: more megapixels don’t necessarily mean a better picture?

And what’s “quantisation”. Context compression or something?

I’ve been considering buying a better card to test models (also want to be personally sovereign), but NVIDIA on linux gives me the jeebies and, last i checked, AMD hasn’t released anything with more than 20GB in a while. In fact, figuring out hardware requirements has been tough and I’m considering just riding this whole thing out. Maybe the bubble will collapse and bring prices down to something reasonable.

cecilkorik@piefed.ca · 2 months ago

I’m not an expert by any means I’m just a dabbler, but my understanding is: In theory, more parameters make richer, wider, and deeper model knowledge possible, and with extensive enough training, those parameters could all be important. That said, there is a lot of megapixel-like inflation and there is no guarantee that any of those parameters are actually useful so in practice, really “advanced” models tend to do a better job of maximizing the usefulness of the limited parameters they do have to run on smaller devices. In general, I tend towards the highest parameter size of a particular model that I can reasonably run. My typical target range is between 8GB up to maybe 20GB, which depending on model might be in the 9b to 30b parameters range, and I might even be erring on the wrong side of this and maybe I’d even be better off with smaller parameter models.

There’s also a lot of models nowadays that use “active” parameters, so the model itself will have X parameters, but then it will determine which of those parameters are most relevant to the task or query at hand, and prune off all but the most relevant ones, so you might have a 30B model, but as soon as you run it, it turns itself into a specialized 4B model. You still need to load the whole model into some kind of RAM typically so it can decide which parameters are relevant, but once it does, it will run much faster. This is another way you can try to run larger models on more limited hardware. Older “dense” models that don’t use this technique with all parameters always active are still typically preferred for some tasks like coding, but YMMV.

Either way, it’s still sort of a crapshoot, there’s a lot of randomness and subjectiveness, and very small parameter models often seem to realistically be able to outperform much bigger models when they are “good”, “well-trained” advanced models, and they will typically be much faster, so if you don’t like the response, it’s much easier to just ask again or retry. I tend to trust the community wisdom when it comes to this, although I also think there’s a lot of cargo-culting and herd-following going on, I don’t know enough to do anything too much different from the herd myself, other than be willing to experiment a little. Latest is not always greatest, but in a field as quickly moving as this it often is. Don’t be afraid to try older models, or less popular models. You’ll often be disappointed, but not always.

Quantization is a form of compression, basically instead of using floating point precision to weigh the “strengths” of the various parameters (default is typically F16 or 16 bits per parameter weight), they get quantized down to smaller groups of bits. Q4 means you’re using 4 bits (essentially ranking each parameter on an integer scale from 0 to 15 instead of a floating point from 0 to 1) and in practice this is usually almost as good. Q8 would be even closer to the original full-size model, but smaller quants like Q2 and Q3 start losing quality. Other quantization-related techniques like i-Matrix (imat) map these values non-linearly and situationally, which is particularly helpful on quantizations Q3 and smaller, which are then called IQ3. The community has adopted Q4 as pretty much the go-to quantization level as the best available compromise between having more parameters being squeezed into less memory without destroying the inherent accuracy of those parameters.

realitaetsverlust@piefed.zip · 2 months ago

I use an 6700 XTX and it’s working perfectly fine, depending on the model. Gemma4 takes a long time to generate answers, but the Qwen-Series is quick and starts generating answers in ~10 seconds.

onlinepersona@programming.dev · 2 months ago

What’s the quality of the answers though? And how much context can it hold? I imagine it’s only good for small, short questions, but have no concept of what is needed for that.

I’m assuming you’re using a 12b or 24b qwen model. The ones from deepseek go up to hundreds of billions of params and I can’t tell if bigger number is better or just meaningless posturing.

realitaetsverlust@piefed.zip · 2 months ago

I’m using the 35b models.

Quality for qwen is mostly fine - sometimes it does hallucinate some shit while thinking, but it does correct itself almost every time. But the answers itself are, for the most part, precise and useful. Not what you know from the cloud models, obviously, but it’s absolutely fine for everyday use. What is actually annoying is the web search - not sure if that’s a qwen problem or a problem with open webui, but it actually takes a long time to finish the search.

I once had a situation where a model was running into an “infinite loop” while thinking, thinking the same line over and over again. And once, qwen just started outputting chinese halfway through the answer lol.

When it comes to context, I’m gonna be very honest - I don’t know. I have never hit any kind of problems or limits because of that since I’m not using AI over a long term project. I use it for small, concise cases and that’s it.

onlinepersona@programming.dev · 2 months ago

Thanks for the response. It’s interesting to read about the experience of others.

irmadlad@lemmy.world · 2 months ago

Didn’t downvote. I use AI, and not ashamed of it. I don’t write huge programs and I damn sure don’t release anything to the public mainly because, in the back of my mind, I can just see some poor chap using my code and now smoke is coming out of his server. It works for me. Usually it’s ‘write a script that does _________’ or Docker compose files. It seems pretty accurate for those uses and if I need a bash command sequence explained, it’s good for that too.

I also use AI when I master my audio tracks before I upload them. I am clinically deaf and there are some frequencies that I just can’t hear well enough to make a judgement call. It’s pretty good at that too.

Encrypt-Keeper@lemmy.world · 2 months ago

My MacBook Air with 24GB of unified RAM is enough to run something simple and useful.

KyuubiNoKitsune@lemmy.blahaj.zone · 2 months ago

That’s like what, 5 or 6k?

Encrypt-Keeper@lemmy.world · edit-2 2 months ago

Like 1k

ffhein@lemmy.world · edit-2 2 months ago

Price is comparable to a used RTX3090 with 24GB vram, which is probably more attractive to someone who is also interested in Linux/Windows gaming (and already owns a pc I mean). I would also guess that the RTX would be faster than the MacBook. IMO unified ram is more interesting when you can get a lot of it

Encrypt-Keeper@lemmy.world · edit-2 2 months ago

The problem with that is you still have to buy the rest of the computer to put that 3090 in.

KyuubiNoKitsune@lemmy.blahaj.zone · 2 months ago

Reasonable price!

Dultas@lemmy.world · 2 months ago

I think in one video it looked like 16 cards. I think he did multiple bifurcations of the pcie lanes. I think he is / was using it for protein folding as well.

onlinepersona@programming.dev · 2 months ago

That’s definitely not my level of disposable wealth/income. I can barely afford one card.

Korhaka@sopuli.xyz · 2 months ago

Depends on what you want it to do and how well it should do it. Zero is potentially enough. A second hand card from half a decade ago can also do quite a lot.

artyom@piefed.social · 2 months ago

My buddy has an older 16GB card and I installed LM studio for fun. Its not quite as fast as some of the web-based ones, but perfectly usable.

new_world_odor@lemmy.world · 2 months ago

I have a rx5600xt (6gb), 32gb ram, ryzen 3600. System hasn’t been updated since i built it during covid. QwenV3-vl35B is the heftiest thing I can run, it gets around 2 tokens/sec, in LM studio. It’s easier than most people seem to think.

onlinepersona@programming.dev · 2 months ago

How do you now run out of RAM? Does it offload to system RAM?

new_world_odor@lemmy.world · 2 months ago

Yes, offloads into system. Oh and i forgot to mention that’s with the context set around 25k. That can vary greatly per model though, it’s taken some experimentation to figure that out.

onlinepersona@programming.dev · 2 months ago

Thank you. That’s good to know.

apftwb@lemmy.world · edit-2 2 months ago

I can tell you from personal experience, 8GB is not enough for a snappy experience. Maybe if you had it setup to churn through data overnight. My RTX 3060 Ti was not happy.