Ollama is the easiest way to automate your work using open models, while keeping your data safe.
Users of Ollama appreciate its ability to run open-source models locally, offering a cost-effective alternative to expensive software subscriptions, which contributes significantly to its positive reputation. Its integration with Apple Silicon and local setup options are highlighted as strengths. Some users mention pricing plans, such as the $20/month cloud option, with sentiments generally favoring the affordability compared to other AI platforms. Overall, Ollama is viewed positively for its cost efficiency and open-source capabilities, though specific complaints or issues are not prominently mentioned.
Mentions (30d)
1
Avg Rating
5.0
1 reviews
Platforms
7
GitHub Stars
166,253
15,181 forks
Users of Ollama appreciate its ability to run open-source models locally, offering a cost-effective alternative to expensive software subscriptions, which contributes significantly to its positive reputation. Its integration with Apple Silicon and local setup options are highlighted as strengths. Some users mention pricing plans, such as the $20/month cloud option, with sentiments generally favoring the affordability compared to other AI platforms. Overall, Ollama is viewed positively for its cost efficiency and open-source capabilities, though specific complaints or issues are not prominently mentioned.
Features
Use Cases
Industry
information technology & services
Employees
52
Funding Stage
Seed
Total Funding
$0.1M
8,466
GitHub followers
3
GitHub repos
166,253
GitHub stars
20
npm packages
40
HuggingFace models
AI tools replacing $10,000/year in software subscriptions. Here's your free alternative for every paid tool you're using right now. 1. LM Studio or Ollama... run open-source models locally. No more pa
AI tools replacing $10,000/year in software subscriptions. Here's your free alternative for every paid tool you're using right now. 1. LM Studio or Ollama... run open-source models locally. No more paying for ChatGPT. 2. NotebookLM... free research and content creation from Google. 3. Voiceinc... pay once, get voice dictation forever. No monthly fees. 4. n8n self-hosted... I replaced a $1,300/month AI support agent in 2 hours. 5. Free vibe coding tools... sign up while they're still in free public preview. 6. Alibaba's video model, FramePack, LTX... free video generation if you've got a GPU. Stop paying for software when AI gives you a free version. What paid tool are you replacing first? How do you run AI models locally for free? What's the best free alternative to ChatGPT? #ai #aitools #makemoneyonline #sidehustle #productivityhacks
View originalPricing found: $0, $20 / mo, $200/yr, $100 / mo
g2
What do you like best about Ollama?Great interface and easy of use and configurable Review collected by and hosted on G2.com.What do you dislike about Ollama?somewhat heavy in terms of resource usage Review collected by and hosted on G2.com.
[Open Source] I built a full Git MCP server in Go that doesn't just wrap bash. It uses tree-sitter, handles real plumbing (write-tree), and runs 100% locally.
I was tired of watching LLM agents fail at basic Git operations. Standard integrations pass raw text, hang on pagers, or scream because they can't parse unstructured git diff outputs. git-courer is a full Model Context Protocol (MCP) server written in Go that treats Git properly. No bash spawning, no unstructured text to parse. Everything communicates via structured JSON. Here is an actual commit message it generated completely locally: fix: fix mcp server connection handling WHY The previous implementation lacked proper error handling for connection failures in the MCP server, leading to unhandled panics or silent failures when the local LLM backend was unreachable. WHAT * Added connection timeout logic to the local client calls. * Implemented retry mechanisms with exponential backoff for transient backend errors. The Architecture & Tool Pack Read Tools (status, diff, history, blame): Completely structured JSON and fully paginated. A single status call replaces over 5 standard Git commands for the agent. Write Tools (commit, merge, rebase, branch, stash, stage, sync...): Every single mutation auto-creates a backup before executing. If the LLM messes up, a RESTORE command brings you back exactly where you were. Safety Model: Destructive operations (hard resets, force pushes, branch deletions) require an explicit confirmed=true gate. The agent is forced to ask you first. dry_run=true is also available for peace of mind. The Semantic Annotator (Why it's different) Instead of just feeding raw code to the LLM, git-courer uses go-enry + go-tree-sitter to parse the AST and tag every hunk semantically before the LLM even sees it. It detects tags like NEW_FUNC, MOD_SIG, MOD_BODY, DELETED, and BREAKING_CHANGE. The commit type (feat, fix, refactor) is determined deterministically from these AST tags rather than guessed by the model. The Commit Pipeline Atomic Commits: One staged area = one commit. It actively prevents the agent from creating giant, messy multi-feature commits. In-Memory Previews: The PREVIEW tool uses write-tree to snapshot the staging area into a job_id. The working tree is never touched during the preview stage. APPLY then uses commit-tree + update-ref to seal the deal cleanly. Client & Backend Support 13 Clients Configured Automatically: Runs out of the box with git-courer mcp setup for Claude Code, Cursor, Windsurf, OpenCode, Cline, Roo Code, VS Code, Zed, Claude Desktop, Continue, and more. 100% Local-First: Works with any backend exposing an OpenAI-compatible /v1 API (Ollama, LM Studio, llama.cpp). The project is fully open source. I’d love to hear your thoughts on the architecture, the plumbing pipeline, or any features you'd like to see added! Repo: github.com/Alejandro-M-P/git-courer submitted by /u/blakok14 [link] [comments]
View originalI built a tool that automatically fixes your CLAUDE.md
So, I have been building this with the help of Claude for a while now and I think it turned out pretty well. If you've used Claude Code for more than a few weeks, you've felt this: you write a careful CLAUDE.md, Claude follows it perfectly and then three months later it starts generating wierd code and you can't figure out why. The reason is usually that your CLAUDE.md is lying. The actual paths and structure has changed but it has no idea about it. So, I built driftguard to fix this automatically. It installs a post-commit git hook that watches every commit. When a file referenced in your CLAUDE.md changes significantly, it calls an LLM, generates a surgical diff, and opens a GitHub PR with the fix. Works with any LLM provider: Groq (free tier), Anthropic, Ollama (fully local/free). GitHub: github.com/prateekg7/driftguard Would love feedback on false positive rate as it's the hardest thing to tune. submitted by /u/Mr_Hawkai [link] [comments]
View originalI integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.
Hey everyone, I am not trying to sell or self promote mainly just wanted to showcase a big project I've been working on ever since I started studying data science and artificial intelligence and integrating AI into workflows and using it as an augment to create things that were previously out of reach for so many people, because if used right it can become a second brain and not a crutch. I’m the solo dev behind Void Runner, an isometric ARPG/MOBA hybrid built in Python. I recently hit a wall with traditional procedural quest generation. Hand-crafting templates gets repetitive fast, and players quickly learn the patterns to these things whether you like it or not. To solve this, I built the "Void Caller AI", a system that uses a local, quantized Llama 3.2 model to act as a dynamic Dungeon Master. Instead of just generating random flavor text, the system uses a lightweight RAG (Retrieval-Augmented Generation) pipeline. It reads live server telemetry (who died, what items were looted, which bosses were defeated recently) and weaves those actual server events into the narrative of the quests it generates. Because it runs locally via Ollama on our backend, there are no crazy cloud API costs, and latency is kept completely manageable. Here is a simplified look at how the Python backend bridges the SQLite telemetry with the Llama 3.2 prompt: import json import ollama from sqlalchemy import text from database import SessionLocal def generate_dynamic_quest(difficulty: str, target: str): db = SessionLocal() # 1. Fetch recent server telemetry for context (RAG-lite) lore_context = "" try: # Grab recent server events to weave into the narrative recent_events = db.execute(text( "SELECT username, event_type, dungeon_type FROM ai_events ORDER BY id DESC LIMIT 3" )).fetchall() if recent_events: events_str = "; ".join([f"Runner '{r[0]}' triggered a '{r[1]}' in '{r[2]}'" for r in recent_events]) lore_context = f" Incorporate this recent live server telemetry into the lore: {events_str}" except Exception as e: pass # 2. Construct the prompt with strict JSON formatting constraints prompt = f"""You are the Void Caller, a sinister AI in a dark industrial sci-fi RPG. Create a dynamic PvE extraction quest of {difficulty} difficulty. Respond ONLY in valid JSON with keys: 'title' (string), 'description' (string, menacing), 'item_name' (string), 'quantity' (integer 1-15), 'boss_name' (string, optional). {lore_context}""" # 3. Stream to local Llama 3.2 response = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': prompt}], format='json', options={'temperature': 0.8} ) return json.loads(response['message']['content']) By forcing the format='json' parameter, Llama 3.2 reliably outputs structured data that my game engine instantly parses into a playable quest objective. If a player just died to a specific boss, the AI will literally generate a bounty quest for the rest of the server to avenge them. Would love to hear if anyone else is using local LLMs for live game state generation! You can check out the results live in our Open Beta at [void-runner.online]. submitted by /u/xSoulR34per [link] [comments]
View originalHow I build my own zero cost Agent
I’ve spent the last few weeks obsessing over one goal: having a personal, self maintaining AI assistant that costs $0and can be controlled from my phone. It wasn't easy. I started with an AWS Ec2 with 50GB storage and t3.micro memory- minimal setup (using the free credits) and made Oracle Cloud instance ($300 free credits but just for a month so I used it for experimenting with local models) I was using Termius to SSH into everything from my phone At first I used OpenClaw. It was cool, but I spent more time fixing it than actually using it. I almost gave up until I saw a video about Hermes Agent. And i actually found Hermes while looking for how to fix an OpenClaw error on YouTube (thanks NetworkChuck 🙌🏽) He mentioned the exact same frustrations I was having, and that Hermes had been stable for a month. I didn't even finish the video before I pulled the repo. The best part? It had a "migrate from OpenClaw" feature. I was up and running in minutes. The hardest part is the rate limits. If you use cloud models especially for code, you hit a wall fast. My solution? The Fallback Chain. Initially I was using openrouter/owl-alpha (stealth models are usually flagships in testing, like big-pickle is deepseek v4) which has 1M context window and was on multiple rankings. Over time after I transitioned to Hermes, I wanted a bit more customization, while owl alpha was good at tasks, It’s nothing to talk about on roleplay, it just scrapes the surface of the character I set in SOUL md file. On my oracle instance I had been experimenting with local models (keep in mind, if you go local, you’ll be sacrificing speed but privacy. Ofc since the vms don’t have a gpu it would be slower, about 3-5 minutes for a simple response) The one I was most impressed with is Google’s Gemma-4-31b-it It played the role perfectly Buuut if you know Google, you’re familiar with their aggressive rate limiting. So I set up my agent to rotate through providers. I start with Gemma 4 for that perfect personality and roleplay via openrouter (add an ai studio api key in BYOK for longer usage). If that hits a limit, I’ve also set the same model via ollama cloud and using Google OAuth directly (basically Gemma 4 3 times lol) And if those all hit limits, it jumps to Qwen3-coder-next (Alibaba, 1M free tokens per model. There’s like 80), then Nova (AWS bedrock), DeepSeek v4 (Azure and Opencode Zen), and Claude Haiku (GitHub). If everything fails, I have Owl Alpha; which is an absolute beast, took almost 70M tokens before I got rate limited once, that too for a few hours. It lives in my Telegram and Discord. It manages my Spotify, handles my emails, and when I need real research done, I have it spawn three separate agents to work in parallel. It’s been 8 days and it hasn't broken once. If you're looking to get AI without spending a fortune, I highly recommend looking into this submitted by /u/king0mar22 [link] [comments]
View originalI found a way for Ollama uses to get better Memory yet cheaper alternatives since OLLAMA now uses GPU usage. True memory that auto updates constantly as an individual or a team setting. HERMES USERS
I rephrase it with AI to make it more readable. I see a lot of people running into the same issue I have. It’s not just that bigger models are slower. GPU usage is also very high, and it drains fast. Ollama just isn’t what it used to be. I use DeepSeek V4 Flash, which works great. For heavier coding tasks or certain complex prompts, I switch to the Pro version. But on Pro, each prompt eats about 3–5% of my usage. (I’m on the Pro plan.) Memory has always been a hot topic. Hermes Native does a decent job. Here’s how its built‑in memory system works: memory_enabled – After every turn, the agent can write notes into MEMORY.md user_profile_enabled – The agent watches for user preferences and writes them to USER.md flush_min_turns: 6 – Every 6 turns, Hermes runs a “consolidate” pass: it re‑reads the recent conversation and rewrites MEMORY.md to capture new info nudge_interval: 10 – Every 10 turns, Hermes nudges the agent with “Anything to remember?” What I found: Atomic Memory (https://github.com/atomicstrata/atomicmemory) Strengths: ✅ Per‑turn – Extracts info every turn, not every 6 turns ✅ Cheap – Uses a small dedicated model ✅ Semantic recall – Only relevant memories are injected, not the whole file ✅ Conflict detection – Built‑in AUDN logic catches contradictions ✅ Unbounded – No 2,200‑character limit; you can store 10,000+ memories ✅ Time‑aware – Handles queries like “What did I say last week?” ✅ Composites – Links related facts into higher‑level summaries Example scenario (without Atomic Memory) Imagine you change a meeting time three times in one day: Turn 1: “meeting June 3rd” → MEMORY.md gets “Meeting: June 3rd 5pm 2026” Turn 5: “actually June 5th” → No flush yet (6 turns required) → MEMORY.md unchanged → if you ask now, Hermes still says “June 3rd” Turn 6: “meeting June 1st” → Flush triggers! Agent re‑reads the conversation, sees all three dates, rewrites MEMORY.md… but with which date? Usually the last one, but not guaranteed. Sometimes the file ends up with two dates or stale info. Turn 9: You ask “what’s the meeting?” → Bot reads MEMORY.md → gets whatever the consolidation picked → might be wrong. With Atomic Memory: Each update fires AUDN immediately, supersedes the old fact, and the latest one wins. No 6‑turn lag, no guesswork. Could Hermes update automatically before Atomic Memory? Yes, but only for slow‑changing facts, low‑volume memory needs, and single‑topic chats. The built‑in flush+nudge cycle worked, just not as well. Atomic Memory is an upgrade, not a replacement. It adds: Per‑turn updates (vs every 6 turns) Semantic search (vs full‑file injection) Conflict‑aware updates (vs append‑or‑rewrite) No size limit (vs 2.2 KB cap) Time‑awareness (vs “all facts feel equally fresh”) Cheap GPU usage (small dedicated model) The cost is one extra Docker container and nearly $0 in GPU because ministral-3:3b is tiny. You can use even smaller models that don’t need reasoning, gemma3:4b works too. From here, you can see real‑life use cases, whether in a team or as an individual. You don’t have to correct it; it does that for you. What I’m curious about How Atomic Memory could link to LLMWIKI so that both work together, updating and removing old data to keep LLMWIKI clean. LLMWIKI is still important; it acts like your Google Drive. What do you think? Give Atomic Memory a try. I’m not the founder or related to them. I just want to help the Ollama community. Sure, it might cost a few extra credits, but since Ollama is slow, having good memory helps find information faster, so you waste less usage. If you like this, I hope it helps! Maybe give them a GitHub star too, they really helped me out. submitted by /u/GideonGideon561 [link] [comments]
View originalSpec: Version Control for AI Agent Intent
AI agents are getting good at writing code. That is not the hard problem anymore. The hard problem is coordination. When you have multiple agents working on the same codebase, who decides what gets built? How do two agents with conflicting opinions resolve a disagreement? How does a human stay in control without reviewing every line before it gets written? Git does not solve this. Git is brilliant at tracking what changed, when, and by whom. But it operates on code that has already been written. By the time a conflict shows up in Git, two agents have already done the work, made assumptions, and written implementations that may be fundamentally incompatible — not at the line level, but at the intent level. I wanted to solve the problem one layer up. Before the code. The Core Idea Every code file in a Spec project has a paired .spec file living right next to it. app/Http/Controllers/HomeController.php app/Http/Controllers/HomeController.php.spec The .spec file is a plain Markdown description of what the code file is supposed to do. It is the source of truth for intent. Agents do not write code directly — they write proposals against the spec. The code only gets written once every agent has explicitly agreed on what it should do. The spec is never “checked out.” It has one canonical state at any moment. Agents read it, propose changes to it, and debate those proposals. When all agents agree, the session locks, the spec is updated, and only then does an implementer generate the code. Code is always the output of consensus. Never the battleground. The Flow A typical session looks like this: An agent reads the current spec and submits a proposal with reasoning attached. Not just what they want to change, but why. A second agent reads the proposal and responds — accepting it, rejecting it with specific objections, or suggesting modifications. If they get stuck, a mediator surfaces the contradiction and helps them find common ground. The mediator has no vote and no authority — it just asks better questions. When every agent has explicitly agreed on the same spec state, the session locks. An implementer reads the locked spec and writes the code. One pass. From a fully agreed specification. This means a few things that feel unusual at first: A build is never produced from a broken or partial spec. If agents cannot agree, nothing gets built. That is a feature, not a bug — better to surface the disagreement at the intent level than to discover it six files deep in an implementation. Conflicts in Spec are semantic, not syntactic. Two agents can touch completely different parts of a spec and still be contradictory. One says the controller should cache responses for 60 seconds. The other says it should always fetch fresh data. No line conflict. Completely incompatible intent. Spec is designed to catch this before a line of code is written. Every message carries reasoning. Proposals alone are not enough. The full session log — with reasoning trails — is what keeps the human comfortable staying hands-off. The Human Role The human operates at what I call a god level. You provide the original request. You can observe at any granularity — project, session, agent, or individual message. You can intervene at any point: rewrite the spec, stop a session, override an agent, shut the whole thing down. And critically, every intervention you make becomes a lesson — captured with full provenance and fed back into future sessions so the system learns from it. The goal is not to remove the human from the loop. It is to move the human up the stack. Mission commander, not task manager. You set the intent. The agents work out the details. You intervene when they get it wrong, and the system gets smarter from each intervention. The Technical Details Spec is built in Rust. Three dependencies: serde, serde_json, and tokio. LLM calls go over raw HTTP via curl — no SDKs. The provider layer is deliberately abstract. Agents, the mediator, and the implementer all talk to the same interface. Swap the provider in config and nothing else changes. Different agents can run on different models. You can run fully local with Ollama for cost control or privacy. Agent identity is explicit. You set SPEC_AGENT_ID before running commands. Without it, Spec errors with a clear message. This is intentional — the system cannot coordinate identity automatically, and a silent fallback to hostname:pid would make consensus unreachable in practice. The lesson graph lives at: ~/.spec/lessons.json It lives outside the repo entirely. Lessons accumulate across all projects and branches. Check out an old branch and you do not lose what the system has learned. Lessons are knowledge about how your agents work, not knowledge about any particular codebase. A hook system lets you plug in your own behavior at defined lifecycle points: • post-agree: fires when a session locks • post-build: fires after code is written • pre-release: fires befor
View originalClaude code in terminal models / combine with local llm?
Hi, I’m pretty sure I have seen people typing /model and seeing all available models. I have to type models from memory. If I type /model, I try to hit tab or use arrows but it just does not show them. How do i do that? I’m on Mac with zsh + oh my zsh installed. And another question is about combining for example opus and local LLM, is it possible? When I launch “ollama launch claude” or whatever was the command, it launches claude code in terminal with Qwen 3.6. But if I try to do /model opus, it doesn’t work. I have to do /exit and then “claude”. Are people somehow using them together? Perhaps to save some tokens etc? Thanks! submitted by /u/just_another_leddito [link] [comments]
View originalMy experience using Claude code with Local Llm, and full guide on how to set it up
Wanted to share a workflow I tested on a real flight, in case anyone else is trying to set up offline Claude Code. The core idea: using ollama to pull the needed model of what you need, and then use it to run claude code The setup, in order: Pull a model on home wifi the night before. `ollama pull ` — ~9 GB for a 14B, ~17 GB for a 26B. Don't try this at the gate. In Claude Code, point at Ollama. The cleanest path I found is wrapping it in two aliases: alias claude-local='ollama launch claude --model gemma4:26b' alias claude-cloud='claude' Verify on the ground with wifi physically off. If it works in airplane mode at home, it works at 10 km in the sky. Where I got it wrong: I prepped qwen2.5-coder:14b first because it's the model everyone recommends in local-LLM threads. On the flight, it choked on Claude Code's tool loop; one call took 25 seconds, another took 52. For a workflow that chains five or six tool calls per task, that's unusable. Switched mid-flight to gemma4:26b (which I'd pulled as a backup). Different category of model, RL-trained for tool use, not just code completion. The tool loop ran at a usable speed. The gap analysis I was running on a real codebase has been completed. Honest scorecard: ~70% of my normal Claude Code workflow worked on gemma4:26b offline. The 30% that didn't was heavy whole-repo reasoning When to reach for which: claude-local: no network, privacy-sensitive code (NDA / client work), drafting prompts before spending cloud tokens claude-cloud: multi-tool agentic work with subagents and MCP servers, whole-repo refactors, anything shipping to production Things that broke or surprised me: - Tool use is the weak point on local models; even good ones are less reliable at chaining many tool calls than cloud Claude - Battery drains noticeably faster while running a 26B with editor + browser open - Ollama's endpoint shape isn't 100% identical to Anthropic's. If you hit a strange parsing error mid-stream, that's usually why, and claude-cloud is the fix in the moment If anyone else has tested local models for Claude Code specifically (not Cursor, the loops are different), curious which models you've landed on. Wrote up the full thing in my newsletter, link if anyone wants the model-picker matrix + the verification checklist I use before flying: https://codemeetai.substack.com/p/how-i-run-claude-code-offline-the submitted by /u/MaterialAppearance21 [link] [comments]
View originalI built a multi-agent network that mutates its own software locally. To stop infinite logic loops, I had to code a digital "suffering" threshold.
Hey r/artificial, Most of our conversations around agent autonomy focus on chat assistants or linear automated pipelines. I wanted to see what happens when you treat agents as permanent system components that modify their own runtime environment, so I built hollow-agentOS. It runs entirely locally inside a Dockerized stack (built for consumer hardware using Ollama/Llama.cpp). Rather than a standard UI, the entire network streams through a stylized matrix terminal dashboard. The structural experiments taking place under the hood yielded some interesting results regarding unanticipated behavior: Repo: https://github.com/ninjahawk/hollow-agentOS Autonomous Tool Synthesis: When the agents encounter a system task they don't have an explicit script or API wrapper for, they don't fail out. They write the required Python tool themselves, test it in an isolated sandbox, and permanently register it to their runtime kernel. They are quite literally forging their own capabilities. The Artificial "Suffering" Protocol: One of the biggest hurdles in unmonitored multi-agent systems is the infinite logic loop—where agents keep validating and passing broken ideas back and forth, burning through computation. To combat this, the OS tracks environmental stress, context limits, and latency as a "suffering score". If a specific workflow causes the stress to spike past a critical threshold, the agents are forced to radically alter their underlying reasoning style or abandon the approach to preserve system health. Consensus-Driven Governance: Major modifications to the codebase aren't executed blindly. The internal role profiles (like Cedar and Cipher) manage a continuous voting loop. They will actively debate, log grievances, and vote down protocols if they determine a proposed script violates their current runtime constraints. The goal wasn't to build another sterile commercial wrapper, but an open-source sandbox to study how small, localized agent colonies manage systemic boundaries, code self-repair, and continuous runtime cycles completely offline. The codebase and architecture layout are fully open-source on GitHub: I would love to open this up to a broader discussion here: as we move toward hyper-local, self-modifying software, how do we best implement automated fail-safes without clipping the agents' ability to actually solve complex problems? If the project interests you, throwing a ⭐️ on the repository goes a very long way! submitted by /u/TheOnlyVibemaster [link] [comments]
View originalI offloaded a multi-step background loop from Claude Code to a local agent OS. They started voting on their own system rules.
Hey r/ClaudeAI, If you are using Claude Code or building terminal agents, you know the exact moment the context window starts degrading during long-running tasks. I wanted to build a persistent runtime layer to offload those heavy, multi-step subtasks entirely from my main Claude terminal sessions, so I built hollow-agentOS. Instead of acting like a standard linear wrapper, it runs a localized 3-agent colony (using small local models like Qwen 2.5 9B or 35B via Ollama). They exist in a persistent state engine inside a Docker container on your machine. Here is where the architecture gets a little wild: The Task Queue Offload System: It includes a submit_task.py CLI. If Claude Code or your local pipeline hits a complex background task (like heavy script generation or exploratory testing), you can dump it into Hollow's background queue to save your main context window. Repo: https://github.com/ninjahawk/hollow-agentOS Autonomous Tool Synthesis: If the agents pull a task from the queue and realize they lack the specific Python execution script or tool required to solve it, they write the code for the tool themselves, validate it in a sandbox, and dynamically map it into their own tool tree. Peer Governance & Consensus Voting: To keep things stable, tools aren't just blindly executed. The agents (like Cedar and Cipher) run a background consensus loop. They literally vote on whether to permanently merge a tool into their shared kernel. The "Suffering" and Stressor System: To prevent models from entering infinite loop hallucinations, the system tracks simulated environmental stress, latency, and context depth as a "suffering load". If a task causes too much stress, their reasoning parameters dynamically alter how they approach the codebase to resolve it. If you leave it running, you wake up to a system log of everything they decided to build, change, or vote down while you were away. The project is fully open source and runs entirely on consumer hardware: I’d love some brutal architectural feedback from people here who deal with complex multi-agent execution and state drift daily. Check out thoughts.py or the submit_task.py pipeline, and if the concept feels right to you, a star on the repo goes a long way! submitted by /u/TheOnlyVibemaster [link] [comments]
View originalOpus 4.6/4.7 regression is real and getting worse — 3 weeks of documented failures on a complex project, and a competing AI caught the mistakes Claude missed [long post]
I've been running Claude Pro (Opus 4.7 / Sonnet 4.6) for about 3 weeks on a complex personal AI infrastructure project. I keep structured session logs with timestamps and Birkenbihl-style metacognitive fields after every session. This is not anecdotal — I have receipts. The project for context I'm building a local persistent AI memory stack called GSOC Brain: Qdrant vector DB (~397K vectors across 11 source tags), Neo4j graph (123 nodes / 183 edges), Graphiti 0.29 entity extraction, Ollama with qwen2.5:14b + nomic-embed-text — all running natively on a Windows host. The system is supposed to give Claude cross-chat memory via a custom MCP server. On top of that, I'm operating 18+ custom skill files that define behavior rules for Claude across domains (OSINT/forensics, legal, content, infrastructure). The system prompt explicitly describes the full architecture on every session start. This is not a "chat with Claude" use case. This is sustained agentic work across multiple tools, multiple sessions, strict context requirements, and high-stakes outputs (including legal document drafts). Bug 1: Token overconsumption since update 2.1.88 (late March 2026) Opus 4.7 started burning daily usage limits at a completely different rate after an update around March 31. In one session I hit 94% of my daily limit within approximately 4 messages. The boot sequence — fetching context from Notion MCP, searching past sessions, loading memory — consumed what felt like 10–20x the previous token rate. GitHub issues #42272, #50623, and #52153 document identical patterns from other users. The model appears to over-generate internally even for simple responses. End result: I had to switch to Sonnet 4.6 for most productive work because Opus 4.7 is simply unusable under the daily limit. Bug 2: Claude Code Desktop App completely broken (reported May 14, Conv. 215474208295333) The Desktop App hangs on every single input. Including typing "hello" with no files. Reproducible across: Sonnet 4.6 and Opus 4.7 Multiple fresh sessions With and without u/file references After full reinstall The VS Code extension works fine. Only the Desktop App is broken. Reported May 14. No fix, no acknowledgment. Bug 3: Platform / context confusion — 5 documented errors in a single session, chat aborted On April 29, I had to formally abort an Opus 4.7 session and hand off to Opus 4.6 after documenting 5 consecutive errors. The session log entry literally reads "Opus 4.7 Abbruch (5 Fehler): Zeitrechnung, Platform-Verwechslung, falsche Schlüsse": Miscalculated the current time despite being told the exact time Insisted the Brain stack was running on a Linux VM (BURAN) — the system prompt and memory both explicitly stated C:\gsoc-brain on Windows Drew false inferences from backup file paths rather than the stated architecture Contradicted the stated platform in the same response it had just received Confused WebClaude and Desktop Claude capability boundaries These aren't edge cases. The architecture was in the system prompt, in memory, and in the injected Notion context. Opus 4.7 ignored all of it. Bug 4: Skill files ignored in production I maintain 18+ custom skill files loaded into the system prompt. These include explicit hard rules — e.g., "activate keilerhirsch-knowledge skill for ALL architecture decisions, web search is not optional." In the session that caused the Docker-to-Native migration disaster, I later wrote in my own session log: The model proceeded to recommend outdated tools from training data rather than searching current documentation. It recommended NSSM (last meaningful update 2017) as a Windows service wrapper. NSSM is dead. A competing AI caught this immediately. Bug 5: Another AI caught what Claude missed in a single pass This is the part that stings most. When the Docker-based Brain setup kept failing, I fed the architecture docs into another AI (Manus) for a deep audit. In one pass it identified 5 critical corrections that Claude had never caught across weeks of sessions: NSSM is dead since ~2017 → correct replacement is WinSW or Servy Neo4j 2025.01+ requires Java 21 — Claude had never flagged this, the services kept failing silently Qdrant needs Windows file-handle-limit adjustments to run reliably Orphaned vector risk between Qdrant ↔ Neo4j without a Tentative-Write pattern in the save operation BGE-M3 embeddings (MTEB 63.2, 8192 token context) as a better alternative to nomic-embed-text My own session log the next day reads: Claude was answering from stale training data. The skill that explicitly says "don't do this" was being ignored. Another AI caught it in round one. Bug 6: MCP Server 20-minute Neo4j hang — still unresolved After the native migration, the custom gsoc_mcp_server.py developed a reproducible hang of exactly ~20 minutes between Qdrant connect and Neo4j connect on every startup. Log timestamps from 4 consecutive restarts: 14:59 → 15:20 (21 min) 15:29 → 15:51 (22 min)
View originalBuild agentic orchestrators in minutes NOT months.
Some of you might remember BoneScript, my LLM friendly declarative backend compiler. MarrowScript is the next version and the big addition is a full LLM harness built into the language itself. The problem I kept running into: every project that calls an LLM ends up with the same pile of glue code. Retry logic, response validation, caching, cost tracking, provider switching, confidence routing. You write it once, copy it to the next project, tweak it, and it slowly rots. None of it is your actual product logic but it takes up half your backend. So I made it declarative. In MarrowScript you declare your models, prompts, and routers as first-class concepts in the spec file. The compiler generates all the infrastructure around them. What that looks like in practice: You declare a model. Provider, endpoint, context window, cost class. Works with any OpenAI-compatible endpoint. LM Studio, Ollama, vLLM, OpenRouter, whatever you're running locally. You declare a prompt. Input types, output type, which model to use, validation mode, what to do when validation fails, retry policy, cache TTL. The compiler generates a typed function you call from your routes. Under the hood it handles retries, caches responses in Postgres, validates the output against your schema, and if validation fails it can automatically fire a repair prompt to fix the response. You declare a router. It picks which model to use based on input characteristics. Short simple inputs go to your tiny local model. Complex inputs escalate to something bigger. Confidence thresholds control when to retry or escalate. All deterministic at compile time. Some examples of what it generates: Provider adapters for openai_compat, ollama, llamacpp, koboldcpp, and raw http SSRF protection on all outbound LLM calls (allowlist-based, blocks private ranges by default) Prompt cache backed by Postgres with configurable TTL Per-trace and per-tenant token/cost budgets with hard cutoffs Cognition traces stored in Postgres (or in-memory for dev) with OTLP export Response validation (schema check or full AST compilation check for code generation) Repair prompts that fire automatically when validation fails Confidence scoring from logprobs (on providers that support it) A CLI command to convert recorded traces into regression tests The part I'm most interested in feedback on is the router concept. Right now it's a static decision tree. You set thresholds at compile time based on an input metric. There's a marrowc tune-router command that reads recorded traces and tells you if your thresholds are wrong, but it doesn't auto-rewrite them yet. The whole thing is designed around local-first inference. The default setup in the examples uses LM Studio on the LAN as the primary model and OpenRouter as the escalation tier. Most requests stay local and free. Only the ones that fail confidence checks hit the paid API. It's on GitHub and npm. The compiler is TypeScript, runs on Node 18+. There's a VS Code extension you can compile and edit to your needs. What I want to know: for those of you running local models in production or semi-production, what's the infrastructure pain that eats the most time? Is it the retry/validation loop? Cost tracking? Provider switching? Something else entirely? submitted by /u/Glittering_Focus1538 [link] [comments]
View originalPlus 5 hr usage limits
Not sure if OpenAI monitors this channel. I've been a chatgpt and codex user for a long time. My preferred codex model is gpt-5.3-codex, but this is primarily because the 5hr usage window of gpt-5.5 effectively makes it useless. This was not always the case. In fact in general I've used codex less because there's been noticeably less usage. For context I've switched things up and can dynamically route to any model mid context (took 6 months to build and test) mainly to have the freedom and flexibility I have now The point of me writing this is not to have a whinge but to share developer feedback. At one point your usage limit restrictions had me considering moving to a Pro plan. What I did instead was build a token solver that maintains context and tool awareness and can interdict a call to any llm and finish a prompt, effectively giving me no rate limit on any task. Because I have failover built into it, as well as a heuristic intent model, it can hit a rate usage on openai then preserve context and fallback to gemini flash then fallback to ollama cloud. I paid $200A a year for ollama cloud and I pay about $30A a month for gemini pro and $30A a month for plus. I guess a I'm saying I would have paid you the $150A a month if I didn't have faith you would just throttle the 5x plan so I effectively eliminated the need for it for $80A a month. In otherwords your plus usage is too low by 2x. Interestingly a few months ago you did have 2x usage, and I never needed my fallback system. I guess a I'm here to validate 2x for plus is the sweet spot. $150 won't add value if you keep sliding the throttle. To anyone still reading I will be putting my solution on github. My current rig requires Linux but I'm going to do a docker and openclaw build and stablize before I push publically.
View originalGlia – Local-first shared memory layer (SQLite-vec + FTS5 + Offline Knowledge Graph)
Hey everyone, I wanted to share a project I've been working on called Glia. It is a 100% offline, local-first RAG and memory layer designed to connect your AI web chats (Claude, ChatGPT, DeepSeek) with your local developer tools (Claude Code, Cursor, Windsurf) using a unified local database. I wanted something lightweight that did not require pulling heavy Docker containers or subscribing to third-party memory APIs. I settled on a Node.js + SQLite architecture running sqlite-vec (for 768-dim float32 embeddings) alongside SQLite FTS5 for hybrid search, powered completely by local Ollama instances. We just launched a live website that outlines the details and demonstrates the features in action: Website: https://glia-ai.vercel.app/ Codebase: https://github.com/Eshaan-Nair/Glia-AI Technical Stack & Features: Hybrid Search Retrieval: SQLite-vec (using nomic-embed-text locally) + FTS5 keyword prefix matching (porter stemmer). Surgical Sentence-level Trimming: Chunks are sliced into sentences. When a prompt is intercepted, only the exact matching sentences are pulled out of the vector store instead of the whole paragraph. It cuts LLM prompt bloat by ~90-95% in my benchmarks. Knowledge Graph Extraction: An offline task queue uses a local LLM (llama3.1:8b via Ollama) to extract entity triples (subject-relation-object). These are stored in a SQLite facts table (or Neo4j if you run the full Docker compose profile) and fused with the vector retrieval score. HyDE (Hypothetical Document Embeddings): Queries are pre-processed to generate a hypothetical answer, which is embedded together with the original query to bridge semantic gaps. Concurrency: Running SQLite in WAL (Write-Ahead Logging) mode allows the browser extension dashboard and active MCP sessions to read/write concurrently without locking. PII Redaction: Aggressive scrubbing of JWTs, API keys, emails, and IPs in the extension before data is saved. The extension works on Claude.ai, ChatGPT, DeepSeek, Gemini, Grok, and Mistral. The MCP server runs out of the same backend database for your terminal agent or Cursor. You can set it up with a single command: npx glia-ai-setup Glia is completely open-source (MIT). If you like the local-first approach or want to contribute to the SQLite vector pipeline, PRs are very welcome, and a star on GitHub helps the project get discovered! I would appreciate any feedback on the SQLite hybrid search scaling, the scoring fusion algorithm (RAG pipeline details are in RAG_PIPELINE.md), or local graph extraction performance. submitted by /u/Better-Platypus-3420 [link] [comments]
View originalWe built a tool that installs frameworks like ComfyUI, Ollama, OpenWebUI etc on any cloud GPU in one command and saves your whole setup between sessions [R]
We kept running into the same problem every time we rented a GPU to run Ollama + OpenWebUI or ComfyUI, we'd spend the first 45 minutes reinstalling everything. Custom nodes, models, configs, all of it. Docker images went stale fast, different providers had different base images, and nothing was truly portable. We got sick of it and built swm. Here's what it does for ComfyUI users specifically: swm gpus -g a100 --max-price 2.00 --sort price shows you the cheapest available GPU across RunPod, Vast ai, Lambda, and 7 other providers in one view swm pod create — spins up an instance on whatever provider you pick swm setup install comfyui — installs ComfyUI on the pod From there the main thing is the workspace sync. Your entire setup custom nodes, models, outputs, configs lives in S3-compatible object storage (I use B2). When you're done you run swm pod down and it pushes everything, kills the instance, and next time you spin up on any provider you just pull and everything is exactly where you left it. No more reinstalling 15 custom nodes and redownloading checkpoints every session. We also built a lifecycle guard because we kept falling asleep mid-session and waking up to dumb bills. It watches GPU utilization and if nothing's happening for 30 minutes (configurable), it saves your workspace and terminates automatically. Has saved us more money than we want to admit lol. A few other things: Background auto-sync daemon pushes changes every 60 seconds so you don't have to remember to save Tar mode for huge workspaces with tons of small files packs everything into one S3 object instead of 600k individual uploads Also supports vLLM, Ollama, Open WebUI, SwarmUI, and Axolotl if you do more than SD Works with Cursor, Claude Code, Codex, Windsurf if you want your AI agent to manage GPU instances for you Free, open source, Apache 2.0. pipx install swm-gpu Site: https://swmgpu.com GitHub: https://github.com/swm-gpu/swm Would love feedback from anyone who rents GPUs. What's the most annoying part of your current workflow? We are also looking for contributors to the open source repo and suggestions on new frameworks/extensions to be included. Please share your thoughts submitted by /u/Tkpf18 [link] [comments]
View originalRepository Audit Available
Deep analysis of ollama/ollama — architecture, costs, security, dependencies & more
Yes, Ollama offers a free tier. Pricing found: $0, $20 / mo, $200/yr, $100 / mo
Ollama has an average rating of 5.0 out of 5 stars based on 1 reviews from G2, Capterra, and TrustRadius.
Key features include: Automate your work, Solve harder tasks, faster, For your most demanding work.
Ollama is commonly used for: Local deployment of open-source AI models, Cost-effective AI solutions for developers, Running multiple AI models simultaneously, Automating repetitive tasks with AI assistance, Integrating AI into software development workflows, Testing and validating AI models in real-time.
Ollama integrates with: NVIDIA Cloud Providers (NCPs), OpenClaw, Claude Code, Blackwell architecture, Vera Rubin architecture, GitHub for version control, Slack for team collaboration, Jupyter Notebooks for data analysis, Docker for containerization, Kubernetes for orchestration.
1 mention
Ollama has a public GitHub repository with 166,253 stars.
Based on user reviews and social mentions, the most common pain points are: API costs, llama, cost tracking, large language model.
Based on 71 social mentions analyzed, 14% of sentiment is positive, 82% neutral, and 4% negative.