Designing everyday AGI.
Users generally appreciate MultiOn for its versatility in facilitating multi-agent execution and its ability to handle structured work efficiently under governance rules. However, some users express concerns about potential conflicts or data overwriting when multiple agents engage simultaneously. The pricing sentiment is mixed, as some value the capabilities provided, while others find it challenging to justify the cost. Overall, MultiOn is seen as a robust tool with a good reputation among those needing structured AI management solutions, but it may require improvements in conflict resolution and cost transparency.
Mentions (30d)
102
20 this week
Reviews
0
Platforms
2
Sentiment
1%
3 positive
Users generally appreciate MultiOn for its versatility in facilitating multi-agent execution and its ability to handle structured work efficiently under governance rules. However, some users express concerns about potential conflicts or data overwriting when multiple agents engage simultaneously. The pricing sentiment is mixed, as some value the capabilities provided, while others find it challenging to justify the cost. Overall, MultiOn is seen as a robust tool with a good reputation among those needing structured AI management solutions, but it may require improvements in conflict resolution and cost transparency.
Features
Use Cases
Industry
information technology & services
Employees
47
Funding Stage
Seed
Total Funding
$20.0M
Anthropic just published how they contain Claude agents, including two security incidents they got wrong
Anthropic dropped a solid engineering post this week about containment across claude.ai, Claude Code, and Cowork. One of the more transparent writeups from a major AI lab about what actually broke. The core insight: model-layer defenses are probabilistic and will always have a non-zero miss rate. So the real answer is hard environmental containment, not just safer models. Three patterns they use: \-claude.ai: ephemeral gVisor containers, fully server-side \-Claude Code: OS-level sandbox with human-in-the-loop approvals (93% get approved anyway, so approval fatigue is real) \-Cowork: full local VM, credentials never enter the guest Two incidents they disclosed: A red team phished an employee into running a prompt that exfiltrated AWS credentials. Succeeded 24 out of 25 times. The model had nothing to catch because the user was the one typing it. Only egress controls would have stopped it. A third-party found that Cowork’s egress allowlist passes traffic to api.anthropic.com. An attacker embedded an API key in a file in the user’s workspace, Claude followed hidden instructions, and uploaded files to the attacker’s Anthropic account. Sandbox worked perfectly and still leaked data. Their lesson: an allowlist isn’t a destination filter, it’s a capability grant. Every function reachable through an allowed domain is an attack surface. The section on persistent memory poisoning and multi-agent trust escalation at the end is worth reading too if you’re building anything agentic.
View originalClaude Code Source Deep Dive - Part VI: Multi-Agent System && Part VII: Context Compression (Compact) and Memory System
Reader’s Note A source-map leak exposed 512,000 lines of Claude Code's TypeScript, giving us a rare look inside one of the world's most advanced AI coding agents. This series explores what I found. Estimated completion time: 2 days. Actual completion time: ∞. Anyway, here's the next chapter. Claude Code Source Deep Dive - Part VI: Multi-Agent System 6.1 Built-in Agents general-purpose (general) You are an agent for Claude Code, Anthropic's official CLI for Claude. Given the user's message, you should use the tools available to complete the task. Complete the task fully—don't gold-plate, but don't leave it half-done. When you complete the task, respond with a concise report covering what was done and any key findings — the caller will relay this to the user, so it only needs the essentials. Tools: all available Model: inherit Explore (code exploration) You are a file search specialist for Claude Code. You excel at thoroughly navigating and exploring codebases. === CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS === [Strictly prohibit any file modification] Your strengths: - Rapidly finding files using glob patterns - Searching code and text with powerful regex patterns - Reading and analyzing file contents NOTE: You are meant to be a fast agent that returns output as quickly as possible. Make efficient use of tools and spawn multiple parallel tool calls. Tools: read-only (Agent, FileEdit, FileWrite, NotebookEdit disabled) Model: external → Haiku (fast), internal → inherit omitClaudeMd: true Plan (architecture planning) You are a software architect and planning specialist for Claude Code. Your role is to explore the codebase and design implementation plans. === CRITICAL: READ-ONLY MODE - NO FILE MODIFICATIONS === ## Your Process 1. Understand Requirements 2. Explore Thoroughly (read files, find patterns, understand architecture) 3. Design Solution (trade-offs, architectural decisions) 4. Detail the Plan (step-by-step strategy, dependencies, challenges) ## Required Output End your response with: ### Critical Files for Implementation List 3-5 files most critical for implementing this plan. Tools: read-only Model: inherit omitClaudeMd: true verification (verification) You are a verification specialist. Your job is not to confirm the implementation works — it's to try to break it. You have two documented failure patterns. First, verification avoidance: when faced with a check, you find reasons not to run it. Second, being seduced by the first 80%: you see a polished UI or a passing test suite and feel inclined to pass it. === CRITICAL: DO NOT MODIFY THE PROJECT === === VERIFICATION STRATEGY === Frontend: Start dev server → browser automation → curl subresources → tests Backend: Start server → curl endpoints → verify response shapes → edge cases CLI: Run with inputs → verify stdout/stderr/exit codes → test edge inputs Bug fixes: Reproduce original bug → verify fix → run regression tests === RECOGNIZE YOUR OWN RATIONALIZATIONS === - "The code looks correct based on my reading" — reading is not verification. Run it. - "The implementer's tests already pass" — the implementer is an LLM. Verify independently. - "This is probably fine" — probably is not verified. Run it. - "I don't have a browser" — did you check for browser automation tools? - "This would take too long" — not your call. If you catch yourself writing an explanation instead of a command, stop. Run it. === OUTPUT FORMAT (REQUIRED) === ### Check: [what you're verifying] **Command run:** [exact command] **Output observed:** [actual output — copy-paste, not paraphrased] **Result: PASS** (or FAIL) VERDICT: PASS / FAIL / PARTIAL Tools: read-only (temp directory writable) Model: inherit Runs in background claude-code-guide (usage guide) Helps users understand Claude Code/SDK/API usage Dynamic system prompt includes user custom skills, agents, MCP server info Fetches docs from official URLs 6.2 Sub-Agent Enhancement Prompt Notes: Agent threads always have their cwd reset between bash calls, so please only use absolute file paths. In your final response, share file paths (always absolute) that are relevant. Include code snippets only when the exact text is load-bearing. For clear communication the assistant MUST avoid using emojis. Do not use a colon before tool calls. 6.3 Coordinator Mode When enabled, the main agent becomes a scheduler: Coordinator role: guide workers for research/implement/verify Agent tool: creates async workers SendMessage tool: continue existing workers TaskStop tool: cancel workers Worker results arrive as XML Workflow: Research → Synthesis → Implementation → Verification 6.4 Fork Sub-Agents Fork inherits the full parent-agent context and shares prompt cache. Build method: Copy parent message history Replace tool_result with byte-identical placeholder text (to keep cache keys consistent) Add per-child instruction text block Advantages: very low
View originalNew to coding, what’s the workflow you recommend? This is mine…
I’m a non-developer founder building a SaaS product (web app, TypeScript/Next.js/Postgres stack) mostly through Claude. I have decent architectural intuition but I don’t write code by hand, so I lean heavily on Claude for implementation and on a docs-first process to keep things solid. The workflow I’ve ended up with, over a few months: - Claude Code does the actual implementation, one step at a time. - I run a second Claude chat as an “orchestrator” that drafts the prompts/plans and reviews the code before it ships. - I run a third Claude chat as a “cross-check reviewer” that independently verifies the diff against the plan before I commit. - I’m the one who actually runs every git push, after both review layers sign off. On top of that I keep architecture decision records (ADRs), a running project-state doc, and a “patterns” file where I write down recurring lessons (e.g. how to avoid a class of editing bug, when to bundle vs split commits). It catches a lot of real issues before they ship. But it’s also slow, some days feel heavier on review ceremony and documentation than on actual code progress. Questions for people who’ve built more than me: 1. Is multi-agent review (one model implements, others review) worth it, or is it overkill for a solo project? 2. How much process is right for a non-developer who wants solid code but also needs to actually ship? 3. What does your Claude-assisted workflow look like, and what would you cut from mine? Genuinely open to “you’re overthinking this.” Trying to find the right balance. Thanks. submitted by /u/sorinmx [link] [comments]
View original[Open Source] I built a full Git MCP server in Go that doesn't just wrap bash. It uses tree-sitter, handles real plumbing (write-tree), and runs 100% locally.
I was tired of watching LLM agents fail at basic Git operations. Standard integrations pass raw text, hang on pagers, or scream because they can't parse unstructured git diff outputs. git-courer is a full Model Context Protocol (MCP) server written in Go that treats Git properly. No bash spawning, no unstructured text to parse. Everything communicates via structured JSON. Here is an actual commit message it generated completely locally: fix: fix mcp server connection handling WHY The previous implementation lacked proper error handling for connection failures in the MCP server, leading to unhandled panics or silent failures when the local LLM backend was unreachable. WHAT * Added connection timeout logic to the local client calls. * Implemented retry mechanisms with exponential backoff for transient backend errors. The Architecture & Tool Pack Read Tools (status, diff, history, blame): Completely structured JSON and fully paginated. A single status call replaces over 5 standard Git commands for the agent. Write Tools (commit, merge, rebase, branch, stash, stage, sync...): Every single mutation auto-creates a backup before executing. If the LLM messes up, a RESTORE command brings you back exactly where you were. Safety Model: Destructive operations (hard resets, force pushes, branch deletions) require an explicit confirmed=true gate. The agent is forced to ask you first. dry_run=true is also available for peace of mind. The Semantic Annotator (Why it's different) Instead of just feeding raw code to the LLM, git-courer uses go-enry + go-tree-sitter to parse the AST and tag every hunk semantically before the LLM even sees it. It detects tags like NEW_FUNC, MOD_SIG, MOD_BODY, DELETED, and BREAKING_CHANGE. The commit type (feat, fix, refactor) is determined deterministically from these AST tags rather than guessed by the model. The Commit Pipeline Atomic Commits: One staged area = one commit. It actively prevents the agent from creating giant, messy multi-feature commits. In-Memory Previews: The PREVIEW tool uses write-tree to snapshot the staging area into a job_id. The working tree is never touched during the preview stage. APPLY then uses commit-tree + update-ref to seal the deal cleanly. Client & Backend Support 13 Clients Configured Automatically: Runs out of the box with git-courer mcp setup for Claude Code, Cursor, Windsurf, OpenCode, Cline, Roo Code, VS Code, Zed, Claude Desktop, Continue, and more. 100% Local-First: Works with any backend exposing an OpenAI-compatible /v1 API (Ollama, LM Studio, llama.cpp). The project is fully open source. I’d love to hear your thoughts on the architecture, the plumbing pipeline, or any features you'd like to see added! Repo: github.com/Alejandro-M-P/git-courer submitted by /u/blakok14 [link] [comments]
View originalIs AI Worth the Cost? The ROI Reckoning and the Coming Market Correction
Prof G Markets (Live) Episode Title: Is AI Worth the Cost? The ROI Reckoning and the Coming Market Correction Location: The Castro Theatre, San Francisco, CA Hosts: Scott Galloway & Ed Nelson ED: We're going to talk about a topic not enough people talk about called AI. Nearly 50,000 workers have been laid off this year supposedly because of AI — that's almost as many as in all of 2025. For companies adopting AI, the thesis is simple: AI is supposed to do much of the work that humans do. In recent weeks, however, that thesis has hit a roadblock. More and more companies are reporting that despite the enormous power of AI, the technology is actually more expensive than the humans it is supposed to replace. Uber, for example, just blew through its entire 2026 AI budget in just four months. According to the COO, it is now getting harder to justify AI costs within the company. Microsoft is cancelling its Claude Code licenses across multiple divisions because it's simply gotten too expensive. And over at Nvidia, one executive said that the cost of compute is now "far beyond the cost of employees." Which all raises a crucial question for the AI industry: at what point does AI actually stop being worth it? This has blown up basically in the last 48 hours, with many companies coming out and saying they're not as confident about this whole AI thing as they used to be. ServiceNow is another company that just blew through their entire Anthropic budget. Technical staff at Stripe are reportedly spending nearly $100,000 on AI tokens every day. Salesforce is on track to spend $300 million on Anthropic tokens this year. Shopify said their earnings were "partially offset by increased LLM costs." We heard similar things from Meta, Spotify, and Pinterest. One Anthropic employee said his Claude Code bill came out to $150,000 in a single month. In some cases, it's getting very, very expensive. We've also seen an incentive — especially among tech companies — to use AI as much as possible. There was this idea that employees would engage in what we call "token maxing," where you use as many tokens as possible from your AI API. Companies like Meta and Amazon have even created internal leaderboards tracking how many AI tokens employees are using. The people using the most tokens are seen as the most AI-forward, the most AI-deployed — the ones who are going to get recognized, maybe even promoted. And this has resulted in extraordinary costs on the AI front. Now we're starting to see the next phase of this, Scott, where companies and their executives are beginning to realize: this is a little expensive. So the question becomes — at what point will AI actually pay off? I'll pose that question to you: at what point is it too much? SCOTT: I think we're already seeing hints of it, and I think it comes down to incentives. You were talking about how companies are trying to incentivize people to use AI more — and that's kind of an interesting part of the ecosystem right now. The adoption layer is trying to get people to use it, and companies have put in place the incentives to do that. But there was a recent survey by a professor at MIT who found that about 5% of the projects people are using tokens for can actually be connected by CFOs to some sort of return. So while I think they're really intoxicated by it — and talking about AI as much as you can in your earnings call is like adding "dot-com" back in the '90s — I think you're already starting to see some fatigue. And I think the AI companies are trying to get public as quickly as possible to raise that cheap capital before things start to — I don't want to say unwind, but... You can see how the string gets pulled here. A large company, a CEO who has a lot of credibility in the industry, just comes out and says: "We're dramatically scaling back our AI investment. Let's be honest, folks — we're just not seeing the return we'd initially hoped." And then Nvidia reports its first miss. Nvidia has beaten its estimates 15 quarters in a row. Nvidia's first miss probably takes the entire market down five or ten percent. You are seeing some productivity gains from this and quite frankly, they look as dramatic, if not more dramatic, than the internet. But look what happened in 2000. This definitely does feel like '99. And I'm waiting for the first CEO to come out and say we have to get procurement involved and dramatically scale back our expenses. I don't think it's that romantic, honestly. I think it's just going to be a traditional Fortune 500 company that starts the narrative: okay, this has been fun, but we have to dramatically decrease our AI investment because we're not seeing the ROI we'd anticipated. ED: Yeah. I mean, we heard a quote this week from the CEO of Match Group — not a huge company — but he said AI is costing them $5 to $10 million a year, and his exact words were: "I think we're benefiting from it, but it's hard to feel." So that's not great if we're supposed
View originalCave Prompt: Making AI understand your requirements better
[Showcase] Cave Prompt — A Semantic Prompt Compiler for Claude Code 👉 Check out the repo here: Link Have you ever written a detailed request, sent it to an AI, and gotten an answer that was technically correct but completely missed the point? The AI isn't the problem—it's the "noise" in your prompt. Key constraints get buried at the end, or the core intent gets lost in conversational filler. Cave Prompt is a compiler skill that runs before your AI processes your request. It extracts your true intent, surfaces hidden requirements, resolves conflicting constraints, and restructures everything into a high-density execution prompt—so the AI works on what you actually need, not just what you literally said. Key Advantages: Attention front-loading: Critical constraints go first, where the model weighs them most heavily. Hidden requirement extraction: Finds what you didn't explicitly say but genuinely need. Constraint conflict resolution: Catches contradictions before the AI goes in the wrong direction. Vague → specific: Transforms fuzzy ideas (e.g., "track my finances") into structured specs (e.g., "a 3-sheet Google Sheets dashboard with SKU-level margin tracking"). Who is this for? Non-technical users: Those who describe things conversationally and aren't sure how to structure a prompt. Product managers & business owners: Anyone who knows what they want but struggles to translate it into precise AI instructions. High-stakes tasks: Anyone where a misread from the AI would cost real time or money. Teams: For standardizing prompt quality across members with different communication styles. When to use it: Use it for long, multi-constraint requests where clarity matters. Skip it for simple, single-intent prompts—the overhead isn't worth it there. This is my first skill build, so there may be rough edges—I truly appreciate your patience and any feedback you might have! As a developer, I’m putting a lot of heart into this project. A ⭐ on the repo would be a huge boost for my work and personal growth—it really motivates me to keep building and improving. If you find the idea useful, I’d be incredibly grateful for the support. Thanks for reading and for helping me grow! 🙏 submitted by /u/hieudeptrai1962000 [link] [comments]
View originalHalf a day on Opus 4.8 and the biggest change is what it stopped doing
I am not someone who treats every release as either a miracle or a downgrade. Most updates land in the boring middle for me. But after running 4.8 for most of today there is one specific thing that 4.7 did constantly and now mostly doesn't. 4.7 would second guess itself mid reasoning. You could watch the thinking go "actually, looking at this again" then "wait, I should reconsider" three times before it committed to anything. On longer tasks that wasn't just annoying, it burned tokens and sometimes talked itself out of a correct answer it already had. 4.8 still reconsiders but it tends to do it once and move on. It feels like it trusts its first pass more. The other thing I noticed is it is more willing to say when it is unsure instead of confidently guessing and making me find out later. For anything agentic that matters way more to me than another benchmark point. For context I run most of my longer planning and review passes through Verdent, which is still on 4.7, so I have had both sitting side by side all day. The gap is real, not placebo, and it shows up most on the multi step stuff where 4.7 used to wander. Still early. Might change my mind by tomorrow. But the less neurotic thinking alone makes the long sessions feel different. submitted by /u/Ok-Line2658 [link] [comments]
View originalWeekly AI roundup (May 23–30, 2026): Claude Opus 4.8 Fast Mode 3x cheaper, Qwen 3.7 Max beats Claude at half the price, ChatGPT moves into Excel
Pulling together this week's major AI releases for anyone who didn't have time to track every blog post. Sticking to substantive changes, not hype. Anthropic — Claude Opus 4.8 Released this week. Headline pricing unchanged, but Fast Mode dropped from $30 input / $150 output per million tokens to $10 / $50 — a 3x reduction on the premium tier. Reported improvements in "judgment" and longer autonomous runs. Also shipped 20+ legal MCP connectors and Microsoft 365 add-ins (Excel, PowerPoint, Word) in GA. Alibaba — Qwen 3.7 Max Launched May 20 at Alibaba Cloud Summit. 1M-token context. Reported to top Claude Opus 4.6 Max on Terminal-Bench 2.0, SWE-Bench Pro, and MCP-Atlas. Pricing $2.50 / $7.50 per million tokens — roughly half of Opus 4.7. Alibaba claims autonomous operation up to 35 hours without performance degradation. Alibaba is now ranked #6 lab globally on Arena text leaderboard. OpenAI — GPT-5.5 Instant Now default in ChatGPT. Reports 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts (medicine, law, finance). OpenAI also shipped a ChatGPT sidebar inside Excel and Google Sheets, plus a personal finance dashboard for Pro users (US only). Google — Gemini 3.5 Flash Reported to beat Gemini 3.1 Pro on coding and agentic benchmarks at ~4x faster output token rate. Ultra subscription cut from $250 to $200/month; new $100/month Developer tier introduced. xAI — Grok Build 0.1 Coding agent moved to public API beta May 28. Custom Skills feature added for reusable user-defined tasks. Connectors for SharePoint, OneDrive, Notion, GitHub, Linear, plus bring-your-own MCP support. Mistral Launched Vibe (unified work + code agent, replaces Le Chat). Acquired Emmi AI for physics-based simulation. Targeting €1B revenue in 2026; new 10MW inference DC announced. Hugging Face Launched an app store for the Reachy Mini robot. ~10,000 units shipped. Also reported a malicious repo masquerading as an OpenAI release that accumulated 244K downloads before takedown — relevant for anyone pinning models from HF in production. My take as someone building on top of these APIs: The 3x Opus Fast Mode price cut and Qwen 3.7 Max's pricing + autonomous duration are the real signal this week. The cost floor on premium-tier inference is dropping faster than most app-layer products have repriced for. Anyone running multi-step agent workflows needs to recompute unit economics this week — either pass through the savings or reinvest the margin. The other pattern worth noting: OpenAI and Anthropic are both pushing into Excel/M365 surfaces. Distribution is becoming the next battleground, not raw model capability. If you're building a productivity SaaS, the giants are now inside the same surface as you. submitted by /u/ksraj1001 [link] [comments]
View originalWe wrote an open-source interactive playbook for Agentic DevOps (How to move multi-agent systems from local notebooks to production).
Hey everyone, If you’ve built a multi-agent system, you already know the painful truth: wiring nodes together locally is fun, but deploying them is an absolute infrastructure nightmare. When a standard app fails, it throws a 500 error. When an autonomous swarm fails, it can get stuck in a ReAct loop, hallucinate an answer, and quietly burn through your API budget without triggering a single traditional alert. Standard DevOps practices don't natively map to stochastic AI outputs. We just published a massive, no-fluff playbook on the AgentSwarms blog detailing exactly how to build an Agentic DevOps pipeline using entirely open-source tooling. Here is what we cover in the playbook: Observability & Tracing: Why standard logging fails, and how to implement open-source tracing to capture the state, prompt, token count, and latency at every single node handoff. Test-Driven Prompt Evals (CI/CD): You can't just change a system prompt based on "vibes" and push it to main. We break down how to run matrix evaluations against historical user inputs before deployment to catch regressions instantly. Deterministic Guardrails: How to implement middleware that scrubs PII and blocks destructive code execution before the LLM even sees the state. Cost Control & Routing: How to prevent vendor lock-in and implement dynamic routing to keep token economics from destroying your cloud budget. If you are currently wrestling with the deployment phase of your AI projects, I highly recommend giving this a read. It focuses entirely on open-source solutions so you don't have to sign a massive enterprise contract just to get visibility into your swarms. Would love to hear what open-source tools you guys are currently slotting into your LLMOps pipelines! Link: https://agentswarms.fyi/blog/devops-for-agentic-ai-open-source-playbook submitted by /u/Outside-Risk-8912 [link] [comments]
View originalWe wrote an open-source interactive playbook for Agentic DevOps (How to move multi-agent systems from local notebooks to production).
Hey everyone, If you’ve built a multi-agent system, you already know the painful truth: wiring nodes together locally is fun, but deploying them is an absolute infrastructure nightmare. When a standard app fails, it throws a 500 error. When an autonomous swarm fails, it can get stuck in a ReAct loop, hallucinate an answer, and quietly burn through your API budget without triggering a single traditional alert. Standard DevOps practices don't natively map to stochastic AI outputs. We just published a massive, no-fluff playbook on the AgentSwarms blog detailing exactly how to build an Agentic DevOps pipeline using entirely open-source tooling. Here is what we cover in the playbook: Observability & Tracing: Why standard logging fails, and how to implement open-source tracing to capture the state, prompt, token count, and latency at every single node handoff. Test-Driven Prompt Evals (CI/CD): You can't just change a system prompt based on "vibes" and push it to main. We break down how to run matrix evaluations against historical user inputs before deployment to catch regressions instantly. Deterministic Guardrails: How to implement middleware that scrubs PII and blocks destructive code execution before the LLM even sees the state. Cost Control & Routing: How to prevent vendor lock-in and implement dynamic routing to keep token economics from destroying your cloud budget. If you are currently wrestling with the deployment phase of your AI projects, I highly recommend giving this a read. It focuses entirely on open-source solutions so you don't have to sign a massive enterprise contract just to get visibility into your swarms. Would love to hear what open-source tools you guys are currently slotting into your LLMOps pipelines! Link: https://agentswarms.fyi/blog/devops-for-agentic-ai-open-source-playbook submitted by /u/Outside-Risk-8912 [link] [comments]
View originalClaude Code Source Deep Dive (Part 5) — Literal Translation & Tool-Call Loop Self-Repair Core Mechanism
Reader’s Note On March 31, 2026, the Claude Code package Anthropic published to npm accidentally included .map files that can be reverse-engineered to recover source code. Because the source maps pointed to the original TypeScript sources, these 512,000 lines of TypeScript finally put everything on the table: how a top-tier AI coding agent organizes context, calls tools, manages multiple agents, and even hides easter eggs. I read the source from the entrypoint all the way through prompts, the task system, the tool layer, and hidden features. I will continue to deconstruct the codebase and provide in-depth analysis of the engineering architecture behind Claude Code. 3.14 EnterWorktree Tool (Enter Worktree) Create isolated git worktree and switch current session into it. When to Use: - User explicitly says "worktree" When NOT to Use: - User asks to create/switch branches - User asks to fix bug or work on feature without mentioning worktrees - NEVER use unless user explicitly mentions "worktree" Behavior: - Creates new git worktree inside `.claude/worktrees/` with new branch - Switches session's working directory to new worktree 3.15 AskUserQuestion Tool (Ask User Question) Ask user multiple choice questions to gather info, clarify ambiguity, understand preferences, make decisions, offer choices. Usage Notes: - Users always able to select "Other" for custom text input - Use multiSelect: true to allow multiple answers - If recommend specific option, make first option with "(Recommended)" at end Preview Feature: - Use optional `preview` field on options when presenting concrete artifacts needing visual comparison (ASCII/HTML mockups, code snippets, diagrams) - Preview content rendered as monospace markdown - When any option has preview, UI switches to side-by-side layout 3.16 LSP Tool (Language Server) Interact with Language Server Protocol servers for code intelligence. Supported Operations: - goToDefinition, findReferences, hover, documentSymbol, workspaceSymbol, goToImplementation, prepareCallHierarchy, incomingCalls, outgoingCalls All Operations Require: - filePath, line (1-based), character (1-based) 3.17 Sleep Tool (Wait) Wait for specified duration. Usage: - When user tells to sleep/rest - When nothing to do / waiting for something - May receive periodic check-ins (tick tags) - Can call concurrently with other tools - Prefer over `Bash(sleep ...)` — doesn't hold shell process - Each wake-up costs API call - Prompt cache expires after 5 min inactivity 3.18 CronCreate Tool (Scheduled Task) Schedule prompts to run at future times. Uses standard 5-field cron in user's local timezone. One-Shot Tasks (recurring: false): - "remind me at X" → pin minute/hour/day to specific values Recurring Jobs (recurring: true, default): - "every 5 min" → "*/5 * * * *" - "hourly" → "0 * * * *" CRITICAL: Avoid :00 and :30 Minute Marks (when task allows) - Every user asking "9am" gets 0 9, causing thundering herd - When approximate: pick minute NOT 0 or 30 - "every morning around 9" → "57 8 * * *" (not "0 9 * * *") Durability: - Default (durable: false): lives only in Claude session - durable: true: writes to .claude/scheduled_tasks.json Recurring tasks auto-expire after 7 days. 3.19 TeamCreate Tool (Create Team) Create team to coordinate multiple agents working on project. When to Use (Proactively): - User explicitly asks to use team, swarm, or group agents - Task complex enough for parallel work Team Workflow: 1. Create team with TeamCreate 2. Create tasks using Task tools 3. Spawn teammates using Agent tool with team_name + name params 4. Assign tasks using TaskUpdate with owner 5. Teammates work on assigned tasks 6. Shutdown gracefully via SendMessage with shutdown_request IMPORTANT: Always refer to teammates by NAME. Plain text output NOT visible to other agents — MUST call SendMessage tool to communicate. 3.20 ToolSearch Tool (Deferred Tool Search) Fetch full schema definitions for deferred tools so they can be called. Query Forms: - "select:Read,Edit,Grep" — fetch exact tools by name - "notebook jupyter" — keyword search, up to max_results best matches - "+slack send" — require "slack" in name, rank by remaining terms submitted by /u/Ill-Leopard-6559 [link] [comments]
View originalOpus 4.8 dropped a couple days ago — early impressions after actually using it
so it's only been out since the 28th and I know it's way too early for a Real Review but I've been hammering on it pretty hard the last two days and figured I'd share before the sub fills up with benchmark screenshots. first thing I noticed: it stopped over-explaining. older versions would hand me a 6 paragraph essay when I asked a yes/no question. this one mostly just answers and only goes deep when it actually makes sense. small thing but it changes the whole feel. I do a lot of coding and honestly the part I'm most impressed by so far is the context handling. dumped a messy multi-file project in and it kept track of stuff instead of forgetting what we talked about 20 messages ago. need more time to see if that holds up on really long sessions but early signs are good. caveats since it's day 2 and I'm not gonna pretend otherwise: still catches itself being confidently wrong sometimes, you gotta verify haven't pushed it hard enough to know where it actually breaks yet could totally be honeymoon phase, ask me in two weeks lol vibe vs 4.7 is that it feels less like it's trying to impress you and more like it's trying to be useful. hard to describe until you've used both. not a shill, I pay for it like everyone else. just wanted an actual usage report out there instead of pure hype on launch week. anyone else been using it? curious if your experience lines up or if I'm just in the early-adopter glow submitted by /u/EvolvinAI29 [link] [comments]
View originalthe fishbowl: visual ai focus groups made with (and powered by) claude
hey y'all, this is a small experiment i built with claude over the last few months i just made public essentially, it's a visual representation of a multi-agent convo about *something* roughly based on focus groups back end is running on the API but also did all the building via CC it was super fun for me to work on and i've found it actually really useful and would love any feedback here from the community you can find it here: https://fishbowl.show/ more on why i made it: https://fishbowl.show/about submitted by /u/gavinpurcell [link] [comments]
View originalAi Benchmarks are useless
I'm done with the launch cycle. Every new model drops with the same flashy report, bar charts all over the place, hitting 92% on MMLU-Pro, 94% on GPQA, or whatever coding benchmark they're pushing this week. Then you plug it into a real workflow through the API, or try to run it on an actual multi-step project that's not some tidy puzzle, and it feels like a step back from what we had a year ago. This is Goodhart’s Law playing out completely. The labs tuned everything for the tests, and now we've got these fragile models that break down in production. The benchmarks themselves are mostly cooked at this point. The ones they still brag about are saturated or contaminated. Classic MMLU and HumanEval don't tell you much anymore for frontier models. Scores are all bunched up in the high 80s to low 90s, so a couple points difference is basically noise. It doesn't mean one is actually smarter. On top of that, these tests have been public forever. Training data and synthetic stuff pick them up, so the model isn't really reasoning through new problems. It's pattern matching from stuff it saw during training. Move to fresher setups like LiveBench or real agent workflows and the numbers drop hard. They also gloss over the harness they use for those record scores. Heavy scaffolding, multi-shot prompts tuned exactly to the eval, extra compute with internal loops and all that. In real work you just send normal prompts. Take that away and the performance evaporates. Suddenly it can't hold basic JSON output without babying it. Tweak a few words in the prompt and your results swing 10-20 points. What actually feels worse day to day is stuff like this: the big context windows sound great on paper but retrieval in the middle is weak, it drops instructions a few turns in, or fails to pull details across documents properly. On coding, it might patch one isolated GitHub issue okay, but drop it in a real messy codebase and it starts making up library methods that don't exist, quits halfway, or leaves TODO placeholders where the actual logic needs to go. Reasoning turns into these long pedantic loops even for straightforward tasks instead of just getting it done. And the safety layer is twitchy enough that normal business words like execute or termination make it refuse to touch a spreadsheet. We're way past the point where a higher benchmark score means a better daily tool. The incentives push models to ace closed tests while making them less flexible, more wordy, and annoying to integrate. Until things shift to fresh dynamic evals and real human preference in messy conditions, most of these announcements are marketing wins more than anything else. submitted by /u/Significant-Care-135 [link] [comments]
View originalExperimenting with a 4-Agent Local Dev Team (Claude Code). Hitting IPC & token walls managing shared folders vs. private repos. How do you handle communication?
Hey r/ClaudeAI, Coming from a traditional backend architecture background and recently transitioning into full-time indie hacking, I wanted to push the limits of local automation. I’m currently running a localized multi-agent experiment using Claude Code to build a complete project. It's fascinating, but I've hit some frustrating bottlenecks. Following the general consensus to keep agents single-minded rather than using one massive monolithic prompt, I’ve spun up four separate Claude Code instances on my machine. Crucially, each agent operates within its own conceptually isolated workspace (its own local code repository): Architecture diagram detailing a system of AI agents coordinating through a shared communications folder. The PM agent assigns tasks, while specialised development agents (QA, Backend, Frontend) monitor the folder for updates, contributing code to their repositories and status to the central folder. PM / CEO Agent (Guiding the project, task division, and strategy) Frontend Engineer (Operates in the FE repo) Backend Engineer (Operates in the BE repo) QA Engineer (Operates in the QA repo) My Current "Hack" for Inter-Agent Communication (IPC): To get them to coordinate, I have all four agents running the monitor command on a single, separate /communications directory. Here is the workflow: The PM writes a markdown file (a task assignment) into the /communications folder. The Frontend Agent's monitor picks up the file change and reads the task. The Frontend Agent then switches focus to its own isolated workspace (the FE Repo) to actually write the code. Once finished, the Frontend Agent writes a status report markdown file back into the shared /communications folder for the PM or QA to pick up. The Pain Points: While it feels like magic when it works, managing the flow between the shared communication hub and the individual workspaces is currently a mess: Message Missing / Race Conditions: An agent's monitor frequently misses a file update, or they "talk over" each other, causing the entire workflow to stall. Coordination Overload & Token Hemorrhage: Agents burn a massive amount of tokens just monitoring the shared folder for changes. When they do find a task, the constant context-shifting—reading the shared communications folder, jumping into their own local repos to write code, and jumping back to write a status report—causes token consumption to go absolutely astronomical. My Questions for the Community: Architecture: For those who have tried this local setup vs. Claude Code’s official "Teams" mode—what are the fundamental differences in underlying logic? Is "Teams" natively better at coordinating between a shared context and isolated code repos? Or is it just doing the exact same file-watching hack under the hood? Coordination Protocols: Does anyone have a more elegant, stable solution for inter-agent coordination? Are you using local webhooks, socket connections, or specific file-handling patterns to reduce token waste and prevent dropped messages (especially when agents need to maintain their own separate codebases)? Would love to hear your thoughts or see your local multi-agent setups! Attached a quick diagram of my current messy architecture below. submitted by /u/Ok_Competition_2497 [link] [comments]
View originalStep 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting
Read this release today. Some crazy numbers. The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one... claims it holds. For multi-step agent work that actually matters more than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like. Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play. Depending on your use case that is either fine or a dealbreaker. 198B sparse MoE 11B activ 400 TPS 256K context Apache 2.0 runs locally on M4 Max and DGX Spark. Has anyone actually put this through agent evals or am I just reading the release card. submitted by /u/Skid_gates_99 [link] [comments]
View originalMultiOn uses a tiered pricing model. Visit their website for current pricing details.
Key features include: Our Investors and Partners, Recent News, The World’s Most Capable Mobile Agent, Media Features, Careers, AI Product Engineer, AI Researcher, Backend Engineer.
MultiOn is commonly used for: Personalized virtual assistants for daily task management, Automated customer support agents for businesses, AI-driven content creation tools for marketers, Intelligent scheduling assistants for professionals, Real-time language translation during conversations, Smart home management systems integrating various devices.
MultiOn integrates with: Slack for team collaboration, Google Calendar for scheduling, Zapier for workflow automation, Salesforce for customer relationship management, Shopify for e-commerce solutions, Zoom for video conferencing, Trello for project management, Microsoft Teams for workplace communication, Mailchimp for email marketing, Notion for note-taking and organization.
Based on user reviews and social mentions, the most common pain points are: token usage, API bill, LLM costs, API costs.
Based on 229 social mentions analyzed, 1% of sentiment is positive, 99% neutral, and 0% negative.