Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

quickbitesdev@discuss.tchncs.de · 2 months ago

Replaced $40/month in AI API subscriptions with self-hosted Ollama + n8n

0ndead@infosec.pub · 2 months ago

Free bullshit generator

Ludicrous0251@piefed.zip · 2 months ago

No, not free, OPs power bill just climbed behind the scenes to match. Probably a discount but definitely not free.

Katherine 🪴@piefed.social · 2 months ago

Unless OP is running a data center, then there’s not really much of a power increase to run a local Ollama.

doodledup@lemmy.world · 2 months ago

Running a thousand watts and not running a thousand watts can be quiet a difference depending on where you live. And then consider buying all of the hardware. In many cases it’s probably cheaper to just pay $40 al month.

StripedMonkey@lemmy.zip · 2 months ago

That would be true worst case, but you’re never running inference 24/7. It’s no crazier than gaming in that regard.

fuckwit_mcbumcrumble@lemmy.dbzer0.com · 2 months ago

What whack ass setup so you think OP has? Dual 5090s? They’re running it on an i7.

T156@lemmy.world · edit-2 2 months ago

It’s also an 8 gigaparameter model. That’s pretty tiny, even if they use it heaps.

Semperverus@lemmy.world · edit-2 2 months ago

Do you think it runs at 1000w continuously? On any decent GPU, the responses are nearly instantaneous to maybe a few seconds of runtime at maybe max GPU consumption.

Compare that to playing a few hours of cyberpunk 2077 with raytracing and maxed out settings at 4k.

Don’t get me wrong, there’s a lot to hate about AI/LLMs, but running one locally without data harvesting engines is pretty minimal. The creation of the larger models is where the consumption primarily comes in, and then the data centers that run them are servicing millions of inquiries a minute making the concentration of consumption at a single point significantly higher (plus they retrain the model there on current and user-fed data, including prompts, whereas your computer hosting ollama would not.)

ElectricVocalist@jlai.lu · 2 months ago

Well it’s winter so any power usage he spends, he gets it back as heat

friend_of_satan@lemmy.world · 2 months ago

While this is correct, sometimes it can be free. I live in a cold climate, and over the winter I hooked up a folding@home computer in my office to keep things a bit warmer. Computers are 100% as efficient as a space heater.

Of course now that it’s getting warm things are changing. I’m actually in the middle of doing my last folding@home tasks until the temps drop next fall.

Ludicrous0251@piefed.zip · 2 months ago

Even in very cold regions heat pumps maintain a COP>1, so even running it as a space heater may not be free if you have access to a more efficient alternative. Also that may be a responsible justification for Folding@Home, but I doubt OP is turning off their LLM in the summer.

Jul (they/she)@piefed.blahaj.zone · 2 months ago

I hate that LLMs are called “AI”, but they do have some uses if trained on the right data set (rather than pirating all the data of all of internet and calling making the LLM think it’s valid data). I have been wanting to set one up for my Home Assistant voice control so that it can better understand my speech. Also, for better image component recognition for tagging in Immich.

I wish they would force the companies to release their training data sets considering they are getting a lot of it illegally (not that I’m a big copyright fan, but it’s crappy that copyright applies to individuals and small businesses, but not to big rich people and corporate backed companies. And attribution, and copyleft policy if the creator wants it, is something I agree with strongly.) If we could get the data sets and pick and choose what portions we want to include and then train our own LLMs, it would be better. It’s why scientific LLMs actually are useful. They are primarily only trained with peer reviewed scientific data not 4Chan and Reddit craziness or training it with SciFi and parody works as fact. No wonder it hallucinates.

Bullshit in, bullshit out, to paraphrase. If you teach a toddler that propaganda on 4chan or with SciFi, parodies, and hate speech as fact rather than giving it all context, they turn out to be the people who post thst nonsense. But the people funding it want quick results with no effort, and that’s what they get. A poorly educated child randomly spouting nonsense. LOL

irmadlad@lemmy.world · 2 months ago

In as much as I rail against regulation, or more so…over regulation, AI needs some heavy regulation. We stand at the crossroads of a very useful tool that is unfortunately hung up in the novelty stage of pretty pictures and AI rice cookers. It could be so much more. I use AI in a few things. For one, I use AI to master the music I create. I am clinically deaf, so there are frequencies that I just can’t hear well enough to make a call. So, I lean on AI to do that, and it does it quite well actually. I use AI to solve small programming issues I’m working on, but I wouldn’t dare release anything I’ve done, AI or not, because I can always see some poor chap who used my ‘code’, and now smoke is billowing out of his computer. It’s also pretty damn good at compose files. I’ve read about medical uses that sound very efficient in ingesting tons of patient records and reports and pinpointing where services could do better in aiding the patient so that people don’t fall through the cracks and get the medical treatment they need. So, it has some great potential if we could just get some regulation and move past this novelty stage.

Meldrik@lemmy.wtf · 2 months ago

https://www.europarl.europa.eu/topics/en/article/20230601STO93804/eu-ai-act-first-regulation-on-artificial-intelligence

Stick to Mistral, who are EU based.

Meldrik@lemmy.wtf · 2 months ago

deleted by creator

starshipwinepineapple@programming.dev · 2 months ago

Keep that n8n updated. Theres been several high and critical severity CVE’s recently and I’m betting more to come

TropicalDingdong@lemmy.world · 2 months ago

Any quality difference?

TheMightyCat@ani.social · edit-2 2 months ago

Depending what OP was using before but going from something like GPT5.2 to LLama 3 8B will be a massive difference (Although OP says to use it only for basic tasks so that does offset it)

LLama 3 already being a very old model doesn’t help either

I run Qwen3.5-35B-A3B-AWQ-4bit which while leagues ahead of LLama 3 8B still is a very noticeable difference.

This is not to say open source is bad, if one had the resources to run something like Qwen3.5-397B-A17B it would also be up there.

Valmond@lemmy.dbzer0.com · 2 months ago

What kind of hardware do you need to run those models?

TheMightyCat@ani.social · edit-2 2 months ago

I’m running 2x4090, the 35B fits very comfortable in that.

For large models like the 397B without a ton of money there are several ways, ive seen posts of people using arrays of used 3090s with good results.

The other option is CPU inference although with current RAM prices that is less cost effective.

I was looking at maybe an array of Milk-V JUPITER2 since vllm added riscv support which could be very cost effective.

Jakeroxs@sh.itjust.works · 2 months ago

Depends on how much quantization, but still fairly beefy, couldn’t run it on my homelab with a 3080ti for example.

I generally use smaller 8-12b models and they’re alright depending on the task.

suicidaleggroll@lemmy.world · edit-2 2 months ago

In general, you take the model size in billions of parameters (eg: 397B), divide it by 2 and add a bit for overhead, and that’s how much RAM/VRAM it takes to run it at a “normal” quantization level. For Qwen3.5-397B, that’s about 220 GB. Ideally that would be all VRAM for speed, but you can offload some or all of that to normal RAM on the CPU, you’ll just take a speed hit.

So for something like Qwen3.5-397B, it takes a pretty serious system, especially if you’re trying to do it all in VRAM.

yellerbadger@piefed.social · 2 months ago

IMO there’s a significant drop off with local LLMs vs the mainstream ones. This can be mitigated somewhat though by using web search tools or using retrieval augmented generation.

lepinkainen@lemmy.world · 2 months ago

Basically the local models don’t (and can’t) contain the full knowledge of the universe.

BUT they can call tools pretty well and if you give the harness the capability to search Wikipedia for example, it becomes a lot smarter

brownmustardminion@lemmy.ml · 2 months ago

I’m not a huge fan of AI, but I consider myself pretty open minded and have been considering doing a demo of Claude to at least gain an understanding of the tech I’m constantly talking shit about.

Is there anything self-hostable that compares in quality to what vibe coders claim Claude Opus is capable of?

Barbecue Cowboy@lemmy.dbzer0.com · 2 months ago

The trash talking on AI is half people with legitimate concerns on the societal and ecological impact and the other half just want to be in on the party and aren’t interested in understanding it. It’s useful like googling things is useful, the items you search for are not always correct, but if you have a basic level of knowledge it’ll help you get where you want to be much faster.

Nothing quite compares to Claude Opus in a cohesive package that I’d recommend for an average self hoster but I personally really like running Nemotron from Nvidia. It’s not the best model, but in my experience it’s consistently good enough along with being fast and stable. If you’re focused more on coding, I hear the Qwen series had some good models.

ℍ𝕂-𝟞𝟝@sopuli.xyz · 2 months ago

I actually did an experiment on doing just that. For context, I’m an experienced software engineer, whose company buys him a tom of Claude usage so I had time to test out what it can actually do and I feel like I’m capable of judging where it’s good and where it falls short at.

How Claude Code works is that there are actually multiple models involved, one for doign the coding, one “reasoning” model to keep the chain of thought and the context going, and a bunch of small specialized ones for odd jobs around the thing.

The thing that doesn’t work yet is that the big reasoning model has to still be big, otherwise it will hallucinate frequently enough to break the workflow. If you could get one of the big models to run locally, you’d be there. However, with recent advances in quantization and MoE models, it’s actually getting nearer fast enough that I would expect it to be generally available in a year or two.

Today the best I could do was a tool that could take 150 gigs of RAM, 24 gigs of VRAM and AMD’s top of the line card to take 30 minutes what takes Claude Code 1-2. But surprisingly, the output of the model was not bad at all.

sobchak@programming.dev · 2 months ago

You really only need a little more RAM than your GPU’s VRAM (unless you’re doing CPU offloading, which is extremely slow). Otherwise, I did the same thing recently too, and was surprised I was able to get it a Qwen 9B to fix a bug in a script I had. I think Sonnet would’ve fixed in a lot fewer tries, but the 9B model was eventually able to fix it. I could’ve fixed it myself quicker and cleaner than both, but it was an interesting test.

Voroxpete@sh.itjust.works · 2 months ago

Locally? You’d need a VERY powerful GPU to really be able to match the capabilities of Opus 4.6 online. I’ve played around with this stuff for the same reasons and while you can absolutely run a model with all of Claude’s capabilities offline, very few people will have the hardware to let it actually run at an acceptable speed and with a sufficient context window. That last part is the most important thing for coding because it’s what allows the model to operate across an entire project and not just a few functions at a time.

lepinkainen@lemmy.world · 2 months ago

Nothing you can run with affordable hardware. The SOTA stuff requires hundreds of gigabytes of memory - and not RAM, GPU memory.

But you can try with stuff like gpt-oss or qwen coder

utjebe@reddthat.com · 2 months ago

If it is just the user part of LLM, then paying $20 for one month subscription would be my recommendation.

You will not be able host anything like Sonnet or Opus.

fuckwit_mcbumcrumble@lemmy.dbzer0.com · 2 months ago

The models that the commercial AIs use are not at all usable on consumer grade hardware. The RTX pro 6000 has 96 gigs of vram, your GPU probably had 8.

I’ve played with the models that run on 16 gigs and it’s alright. But I wouldn’t even try fully vibe coding. Need some help with something small? Sure. But I wouldn’t have it try to make a finished product.

kambusha@sh.itjust.works · 2 months ago

What’s the model name to pull?

ikidd@lemmy.world · 2 months ago

Probably use Gemma4 if your machine has the chops for it.

Katherine 🪴@piefed.social · edit-2 2 months ago

You could probably get away with using gemma3:4b or phi3.5.

lepinkainen@lemmy.world · 2 months ago

Qwen3.5 and Gemma4 are the best ones for tool calling that don’t need massive amounts of memory

Shady_Shiroe@lemmy.world · 2 months ago

I only ever use my local ai for home assistant voice assistant on my phone, but it’s more of a gimmick/party trick since I only have temperatures sensors currently (only got into ha recently) and it can’t access WiFi so it’s just quietly sitting unloaded on my truenas server

blargh513@sh.itjust.works · 2 months ago

Running any LLM on truenas is not awesome. I’ve tried it with GPU passthrough and it’s just too much overhead. I may just burn all my stuff down and restart with Proxmox, run Truenas core inside just for NAS. The idea of a converged nas+virtualization is wonderful, but it’s just not there.

The host networking model alone is such a pain, then you get into performance stuff. I still like Truenas a lot, but I think that Proxmox is probably still the better platform.

Leax@lemmy.dbzer0.com · 2 months ago

I only use a N100 CPU 16Gb mini PC for self hosting, are there models nimble enough that could run without melting the PC?

kossa@feddit.org · edit-2 2 months ago

https://whatmodelscanirun.com/

Tells you, well, which models you can run and which performance to expect.

black_flag@lemmy.dbzer0.com · 2 months ago

deleted by creator

beeng@discuss.tchncs.de · 2 months ago

deleted by creator