SWE-agent is recognized for its efficient AI-driven capabilities in coding assistance and code review, noted on Reddit for developments in enhancing domain-specific tasks. However, it faces competition from models like MiMo, which are perceived as offering better value due to lower costs. While precise pricing details for SWE-agent aren't heavily discussed, there is a general sentiment in tech communities about competitive pricing pressures in the AI industry. Overall, SWE-agent maintains a positive reputation but must navigate an evolving market with new, cost-effective alternatives gaining traction.
Mentions (30d)
6
2 this week
Reviews
0
Platforms
2
GitHub Stars
18,896
2,044 forks
SWE-agent is recognized for its efficient AI-driven capabilities in coding assistance and code review, noted on Reddit for developments in enhancing domain-specific tasks. However, it faces competition from models like MiMo, which are perceived as offering better value due to lower costs. While precise pricing details for SWE-agent aren't heavily discussed, there is a general sentiment in tech communities about competitive pricing pressures in the AI industry. Overall, SWE-agent maintains a positive reputation but must navigate an evolving market with new, cost-effective alternatives gaining traction.
Features
Use Cases
1,447
GitHub followers
83
GitHub repos
18,896
GitHub stars
20
npm packages
26
HuggingFace models
Built an Opensource Persistent memory layer for Coding agent (64% token reduction on SWE benchmarks)
Hi Claude community, I got annoyed enough to build something. Claude Code was re-reading the same files every session. Not because it had to, because it had no other option. There was nowhere to store what it already knew. So I built a local knowledge graph it can query instead. Fullerenes https://preview.redd.it/k7mge8pzayxg1.png?width=911&format=png&auto=webp&s=eaaa44b07762547d7dcc420273248c1bd85895e7 How it works: npx fullerenes init walks your repo with Tree-sitter,pulls out every function, class, import, and call relationship, and stores it in a local SQLite graph. Agents connect over MCP and ask targeted questions instead of reading files raw. The design leans on actual retrieval research: Repoformer (retrieve only when needed), HippoRAG and G-Retriever (graph beats flat chunks), LLMLingua (compress context aggressively). The goal is not more context. It's better signal per token. Two features I built that I haven't seen elsewhere: predict\_impact({ functionName: "x" }) Before the agent edits anything, it can ask what else will break. Traverses the edge graph and returns direct + transitive dependents with a risk score. Blast radius before the first keystroke. get\_function({ name: "x", includeBody: true }) Signature, body, and callers in one MCP call. No follow-up read\_file needed. \--- Three benchmarks: SWE-bench Verified (1 instance so far): Codex baseline: 91,949 tokens Codex + Fullerenes: 32,945 tokens Reduction: 64% Internal (5 questions on this repo): Raw files: 2,452 tokens avg Fullerenes: 137 tokens avg Reduction: 94.4% External (Gemini CLI on a Python project): Raw files: 27,292 tokens Fullerenes AGENTS.md: 919 tokens Reduction: 96.6% \--- What it does not do: Tree-sitter is structural not semantic. If you rely heavily on dynamic dispatch or metaprogramming, edges will be missing. LSP integration is on the roadmap but not there yet. One SWE-bench instance is not a broad result. I'm running more and will be transparent about what comes back, good or bad. \--- Everything runs locally: \- SQLite, no server \- no API key \- pure npm, no Python \- works offline \- MIT 589 npm downloads before this post (in 40 hrs). 14 stars. Yes it just launched. [github.com/codebreaker77/Fullerenes](http://github.com/codebreaker77/Fullerenes) [npmjs.com/package/fullerenes](http://npmjs.com/package/fullerenes) Three things I'd genuinely like feedback on: 1. Does graph-based retrieval actually change your agent workflows or is long context just winning? 2. What MCP tools would you want beyond the current 8? 3. Does the SWE-bench methodology look sound to you —happy to share the exact harness setup. \-A fellow open source contributor : )
View originalWeekly AI roundup (May 23–30, 2026): Claude Opus 4.8 Fast Mode 3x cheaper, Qwen 3.7 Max beats Claude at half the price, ChatGPT moves into Excel
Pulling together this week's major AI releases for anyone who didn't have time to track every blog post. Sticking to substantive changes, not hype. Anthropic — Claude Opus 4.8 Released this week. Headline pricing unchanged, but Fast Mode dropped from $30 input / $150 output per million tokens to $10 / $50 — a 3x reduction on the premium tier. Reported improvements in "judgment" and longer autonomous runs. Also shipped 20+ legal MCP connectors and Microsoft 365 add-ins (Excel, PowerPoint, Word) in GA. Alibaba — Qwen 3.7 Max Launched May 20 at Alibaba Cloud Summit. 1M-token context. Reported to top Claude Opus 4.6 Max on Terminal-Bench 2.0, SWE-Bench Pro, and MCP-Atlas. Pricing $2.50 / $7.50 per million tokens — roughly half of Opus 4.7. Alibaba claims autonomous operation up to 35 hours without performance degradation. Alibaba is now ranked #6 lab globally on Arena text leaderboard. OpenAI — GPT-5.5 Instant Now default in ChatGPT. Reports 52.5% fewer hallucinated claims than GPT-5.3 Instant on high-stakes prompts (medicine, law, finance). OpenAI also shipped a ChatGPT sidebar inside Excel and Google Sheets, plus a personal finance dashboard for Pro users (US only). Google — Gemini 3.5 Flash Reported to beat Gemini 3.1 Pro on coding and agentic benchmarks at ~4x faster output token rate. Ultra subscription cut from $250 to $200/month; new $100/month Developer tier introduced. xAI — Grok Build 0.1 Coding agent moved to public API beta May 28. Custom Skills feature added for reusable user-defined tasks. Connectors for SharePoint, OneDrive, Notion, GitHub, Linear, plus bring-your-own MCP support. Mistral Launched Vibe (unified work + code agent, replaces Le Chat). Acquired Emmi AI for physics-based simulation. Targeting €1B revenue in 2026; new 10MW inference DC announced. Hugging Face Launched an app store for the Reachy Mini robot. ~10,000 units shipped. Also reported a malicious repo masquerading as an OpenAI release that accumulated 244K downloads before takedown — relevant for anyone pinning models from HF in production. My take as someone building on top of these APIs: The 3x Opus Fast Mode price cut and Qwen 3.7 Max's pricing + autonomous duration are the real signal this week. The cost floor on premium-tier inference is dropping faster than most app-layer products have repriced for. Anyone running multi-step agent workflows needs to recompute unit economics this week — either pass through the savings or reinvest the margin. The other pattern worth noting: OpenAI and Anthropic are both pushing into Excel/M365 surfaces. Distribution is becoming the next battleground, not raw model capability. If you're building a productivity SaaS, the giants are now inside the same surface as you. submitted by /u/ksraj1001 [link] [comments]
View originalHere's 100+ evals on Opus 4.8
We aggregated 100+ evals on Opus 4.8 to see what changed. The big gains vs 4.7: Math: USAMO 2026 jumped from 69% → 97% Coding: Vibe Code Bench +12 pp Economically valuable work: #1 of 275 on GDPval-AA Biology Long-context reasoning But we were surprised to see several key areas barely improved or got worse: Legal reasoning Healthcare / medical Finance Multilingual reasoning Business ops: Vending-Bench 2 nearly halved Multimodal: mixed results Have you found any noticeable changes based on your testing so far? submitted by /u/davidthesong [link] [comments]
View originalHere are my thoughts of Opus 4.8 and GPT 5.5, as a 1-2 B token user per day
TL;DR: Opus 4.8 is a clear update from Opus 4.7. It runs longer, hallucinates less, and follows detailed guided tasks better, especially with tool usage like Playwright, Cloud CLI, and Kubernetes CLI. However, in the context of Agentic AI, GPT-5.5 gives me a much stronger “wow” moment because it feels more autonomous, more context-stable in very long sessions, and more capable at solving tricky large-codebase problems that Opus 4.6, 4.7, and 4.8 could not solve in my workflow. Using 2 CC Max + 1 Codex Pro What’s better in Opus 4.8 Opus 4.8 is definitely an update from Opus 4.7. It runs longer, hallucinates less, and does better what it is asked than Opus 4.7. Also, it is better at tool usage such as Playwright, Cloud CLI, Kubernetes CLI, and other engineering tools. Opus 4.8 performs better when the task is detailed and properly guided. Since most developers are already using Agentic AI to write code, I think Opus 4.8 is clearly a better model for developers who already have enough domain knowledge and can define the task scope finely. When using the newly added /workflows feature, it can handle a wider range of tasks more effectively without much mid-run intervention than Opus 4.7. However, because of this characteristic, and also because of the general nature of the Opus 4.7 and Opus 4.8 family, I still do not think Opus 4.8 is more autonomous-agentic than early Opus 4.6 in vibe coding or less-domain-knowledge situations. When we use AI, we expect that AI has the ability to just get it, use good judgment, and handle things cleanly without needing every tiny instruction, like Jarvis from Iron Man. In that sense, Opus 4.8 tends to not proceed with things outside of the explicitly defined scope unless I tell it clearly. I guess this may be related to solving the chronic hallucination and trustworthiness problem of Agentic AI(well, this comes from the current architectural limit of LLM, derived from Attention mechanisms with gradient descent), but it also makes the model feel less autonomous. Personal opinion about Opus 4.8 This is a bit disappointing in the era of Agentic AI, and I will explain more clearly by comparing it with GPT-5.5 below. Generally, as AI and other technologies improve, the human work range should not only expand horizontally but also vertically. So if I ask whether Opus 4.8 has developed in the direction that humans expect from AGI, I am not fully convinced. I do not have the same “wow” moment that I had when I first used early Opus 4.6. Humans have a clear biological limit in daily cognition and decision-making. This is separate from AI progress itself. As Andrej Karpathy and others have mentioned in different ways, humans themselves often become the bottleneck. If we want to overcome this limit through AI, I think AI should ultimately go in the direction of early Opus 4.6 or GPT-5.5. Simply speaking, regardless of the 5 h token limit, to use Opus 4.8 effectively, the human still needs to think a lot. You need to define more, guide more, and maintain more of the context yourself. For doing more work effectively, this becomes a critical bottleneck. GPT-5.5 GPT-5.5 is definitely a major update from the perspective of Agentic AI. It gives me a similar “wow” moment that early Opus 4.6 gave me. https://preview.redd.it/j2rihxtjf34h1.png?width=257&format=png&auto=webp&s=a3f39721cc573f1e623d90e4592ffa54b7a24b7f Opus 4.8 also runs longer and hallucinates less than previous models, but GPT-5.5 is on another level in my experience. Even in long-running sessions of more than 12 h, hallucination and context dilution are surprisingly low. This part is almost strange to me. I currently use the same kind of harness engineering tool for both Opus and GPT. In that environment, Opus does very well on exactly specified scopes, while GPT-5.5 also understands and proceeds with parts that I did not specify in very fine detail. This may be connected to the same point, but GPT-5.5 feels smarter in a more human way. Even in simple conversation, I feel the difference. Opus 4.8 answers like a very skilled engineer, but usually in a more verbose way. Opus 4.7 was even more verbose. GPT-5.5 tends to answer with the right length for what the user currently needs. In other words, from the user’s perspective, I spend less time and less cognitive energy interpreting the agent’s answer. Interestingly, the final output is also often better from GPT-5.5. Of course, depending on how detailed the user’s prompt is, the difference can become small, and sometimes Opus 4.8 can be better. But in that case, I usually need to spend more time on prompting and context preparation. The biggest advantage of GPT-5.5 comes from combining the two points above: it is extremely good at solving tricky bugs, feature improvements, and migration tasks in large codebases. In my case, I am currently migrating a C++ and Cython/Python based quant system into Rust and Python. With Opus 4.6, 4.7, and 4.8, there were some tasks that
View originalStep 3.7 Flash open weights dropped TODAY and the agent reliability numbers are actually interesting
Read this release today. Some crazy numbers. The tau2-bench number is 98% across all difficulty levels. That is the one that got me because usually these releases post a strong easy score and then quietly die at hard difficulty. This one... claims it holds. For multi-step agent work that actually matters more than most benchmarks. A model that drifts on step 4 of a 6 step chain is a debugging nightmare regardless of what its SWE score looks like. Raw capability is mid, Toolathlon at 49.5, GDPval at 45.8. So this is clearly a reliability play, not a frontier capability play. Depending on your use case that is either fine or a dealbreaker. 198B sparse MoE 11B activ 400 TPS 256K context Apache 2.0 runs locally on M4 Max and DGX Spark. Has anyone actually put this through agent evals or am I just reading the release card. submitted by /u/Skid_gates_99 [link] [comments]
View original4.6 and 4.8
submitted by /u/PM_ME_YOUR___ISSUES [link] [comments]
View originalIntroducing Claude Opus 4.8
We’re upgrading Claude Opus to a new version: Claude Opus 4.8. It builds on Opus 4.7 with sharper judgment, more honesty about its own progress, and the ability to work independently for longer than its predecessors. Available today for the same price. In Claude Code, you can hand off a feature, a migration, or a bug sweep and let it follow the work through while you focus on what’s next. Also launching today: Fast mode for Opus 4.8 (research preview). Same model at roughly 2.5x the speed, now three times cheaper than before. Dynamic workflows in Claude Code (research preview). Claude runs hundreds of parallel subagents in a single session and verifies its work before reporting back. A new effort control on claude.ai, so you can choose how much thinking Claude puts into a response. Claude Opus 4.8 is live today on claude.ai, the Claude Platform, and all major cloud platforms. Read more: anthropic.com/news/claude-opus-4-8 submitted by /u/ClaudeOfficial [link] [comments]
View originalGemini 3.5 flags vs gpt 5.5 ?? What's your opinion on it
submitted by /u/Independent-Wind4462 [link] [comments]
View originalWhat's your prediction for workflows 12-18 months from now?
This are my employees, hooked up to WhatsApp and email. Can you guess who handles what? 😁 (Hint: only ONE of them is SWE. ~20% of my tokens are used for coding these days). Whatever you are using right now, in ~6-9 months time, you will begin doing agent orchestration as I do today. Not managing a few terminals sessions every now and then. You manage full employees with context & tools for their function and can orchestrate tons of agents behind the scene - and on schedule! The tool I build support Anthropic's managed agent, runs on codex/claude/opencode etc because I do believe that building your own harness is a waste of time - it's like training your own LLMs. Maybe you can out perform claude code after intense investment, can you outsell it? What's your workflow right now? Where do you see the workflow moving towards in 12-18 months? submitted by /u/NickGuAI [link] [comments]
View originalOpus 4.7 Low Vs Medium Vs High Vs Xhigh Vs Max: the Reasoning Curve on 29 Real Tasks from an Open Source Repo
TL;DR I ran Opus 4.7 in Claude Code at all reasoning effort settings (low, medium, high, xhigh, and max) on the same 29 tasks from an open source repo (GraphQL-go-tools, in Go). On this slice, Opus 4.7 did not behave like a model where more reasoning effort had a linear correlation with more intelligence. In fact, the curve appears to peak at medium. If you think this is weird, I agree! This was the follow-up to a Zod run where Opus also looked non-monotonic. I reran the question on GraphQL-go-tools because I wanted a more discriminating repo slice and didn’t trust the fact that more reasoning != better outcomes. Running on the GraphQL repo helped clarified the result: Opus still did not show a simple higher-reasoning-is-better curve. The contrast is GPT-5.5 in Codex, which overall did show the intuitive curve: more reasoning bought more semantic/review quality. That post is here: https://www.stet.sh/blog/gpt-55-codex-graphql-reasoning-curve Medium has the best test pass rate, highest equivalence with the original human-authored changes, the best code-review pass rate, and the best aggregate craft/discipline rate. Low is cheaper and faster, but it drops too much correctness. High, xhigh, and max spend more time and money without beating medium on the metrics that matter. More reasoning effort doesn't only cost more - it changes the way Claude works, but without reliably improving judgment. Xhigh inflates the test/fixture surface most. Max is busier overall and has the largest implementation-line footprint. But even though both are supposedly thinking more, neither produces "better" patches than medium. One likely reason: Opus 4.7 uses adaptive thinking - the model already picks its own reasoning budget per task, so the effort knob biases an already-adaptive policy rather than buying more intelligence. More on this below. An illuminating example is PR #1260. After retry, medium recovered into a real patch. High and xhigh used their extra reasoning budget to dig up commit hashes from prior PRs and confidently declare "no work needed" - voluntarily ending the turn with no patch. Medium and max read the literal control flow and made the fix. One broader takeaway for me: this should not have to be a one-off manual benchmark. If reasoning level changes the kind of patch an agent writes, the natural next step is to let the agent test and improve its own setup on real repo work. For this post, "equivalent" means the patch matched the intent of the merged human PR; "code-review pass" means an AI reviewer judged it acceptable; craft/discipline is a 0-4 maintainability/style rubric; footprint risk is how much extra code the agent touched relative to the human patch. I also made an interactive version with pretty charts and per-task drilldowns here: https://stet.sh/blog/opus-47-graphql-reasoning-curve The data: Metric Low Medium High Xhigh Max All-task pass 23/29 28/29 26/29 25/29 27/29 Equivalent 10/29 14/29 12/29 11/29 13/29 Code-review pass 5/29 10/29 7/29 4/29 8/29 Code-review rubric mean 2.426 2.716 2.509 2.482 2.431 Footprint risk mean 0.155 0.189 0.206 0.238 0.227 All custom graders 2.598 2.759 2.670 2.669 2.690 Mean cost/task $2.50 $3.15 $5.01 $6.51 $8.84 Mean duration/task 383.8s 450.7s 716.4s 803.8s 996.9s Equivalent passes per dollar 0.138 0.153 0.083 0.058 0.051 Why I Ran This After my last post comparing GPT-5.5 vs 5.4 vs Opus 4.7, I was curious how intra-model performance varied with reasoning effort. Doing research online, it's very very hard to gauge what actual experience is like when varying the reasoning levels, and how that applies to the work that I'm doing. I first ran this on Zod, and the result looked strange: tests were flat across low, medium, high, and xhigh, while the above-test quality signals moved around in mixed ways. Low, medium, high, and xhigh all landed at 12/28 test passes. But equivalence moved from 10/28 on low to 16/28 on medium, 13/28 on high, and 19/28 on xhigh; code-review pass moved from 4/27 to 10/27, 10/27, and 11/27. That was interesting, but not clean enough to make a default-setting claim. It could have been a Zod-specific artifact, or a sign that Opus 4.7 does not have a simple "turn reasoning up" curve. So I reran the question on GraphQL-go-tools. To separate vibes from reality, and figure out where the cost/performance sweet spot is for Opus 4.7, I wanted the same reasoning-effort question on a more discriminating repo slice. This is not meant to be a universal benchmark result - I don't have the funds or time to generate statistically significant data. The purpose is closer to "how should I choose the reasoning setting for real repo work?", with GraphQL-Go-Tools as the example repo. Public benchmarks flatten the reviewer question that most SWEs actually care about: would I actually merge the patch, and do I want to maintain it? That's why I ran this test - to gain more insight, at a small scale, into how coding ag
View originalLLMs keep solving my bug-fix tasks instantly — what am I missing here?
I’m working on an assessment where I need to create a coding task (basically SWE-bench style). The idea is: take an existing repo (I’m using pydantic) write tests that fail on the current code provide a patch that fixes it and the task shouldn’t be trivial for an LLM to solve(it should be solvable, llm should solve it around 4/10 times, models like haiku) The difficulty requirement is the tricky part. It shouldn’t be impossible, but also not something a model solves instantly every time. What I’ve been doing so far: using Claude Opus to explore the repo and identify possible bugs or edge cases writing tests around those cases then in a separate run, giving the instructions to a smaller model (like Haiku) letting it generate a patch and running that patch against the tests I wrote I’ve been repeating this loop for quite a while. The problem is, most of the time the model just figures it out. Even with edge cases, chaining conditions, or slightly more complex scenarios, it still manages to fix things pretty reliably. So I’m clearly missing something. I feel like I’m designing bugs that are too local or too easy to pattern match, but I don’t really know how to move beyond that. At the same time, I can’t just make things random or overly complex because the task still needs to be fair and testable. Also, I don’t have the option to modify the codebase directly — I can only define behavior through tests and provide a patch — so that constraint makes it harder to think creatively about it. At this point I kind of know I’m not approaching it with the right mental model, just not sure what the correct approach is. If anyone here has worked on: SWE-bench style tasks LLM evals / coding agent benchmarks or even just tricky real-world debugging cases I’d really appreciate any pointers on: how you think about difficulty in these tasks what patterns actually make models struggle or how you come up with good task ideas Right now it just feels like I’m going in circles. submitted by /u/Aditya_10204 [link] [comments]
View originalI implemented meta paper [P]
github link : genji970/Scaling-Test-Time-Compute-for-Agentic-Coding-: paper implementation of Meta Ai paper link : https://arxiv.org/abs/2604.16529v1 As far as I know, there is no public implementation of this paper yet, so I built a minimal research implementation of the core PDR+RTV pipeline. I made project to run gemini-3.1-pro model and test on SWE benchmark(In paper, there is one more benchmark and used models such as opus and more) Need gemini-api-key to run. submitted by /u/Round_Apple2573 [link] [comments]
View originalBuilt an Opensource Persistent memory layer for Coding agent (64% token reduction on SWE benchmarks)
Hi Claude community, I got annoyed enough to build something. Claude Code was re-reading the same files every session. Not because it had to, because it had no other option. There was nowhere to store what it already knew. So I built a local knowledge graph it can query instead. Fullerenes https://preview.redd.it/k7mge8pzayxg1.png?width=911&format=png&auto=webp&s=eaaa44b07762547d7dcc420273248c1bd85895e7 How it works: npx fullerenes init walks your repo with Tree-sitter,pulls out every function, class, import, and call relationship, and stores it in a local SQLite graph. Agents connect over MCP and ask targeted questions instead of reading files raw. The design leans on actual retrieval research: Repoformer (retrieve only when needed), HippoRAG and G-Retriever (graph beats flat chunks), LLMLingua (compress context aggressively). The goal is not more context. It's better signal per token. Two features I built that I haven't seen elsewhere: predict\_impact({ functionName: "x" }) Before the agent edits anything, it can ask what else will break. Traverses the edge graph and returns direct + transitive dependents with a risk score. Blast radius before the first keystroke. get\_function({ name: "x", includeBody: true }) Signature, body, and callers in one MCP call. No follow-up read\_file needed. \--- Three benchmarks: SWE-bench Verified (1 instance so far): Codex baseline: 91,949 tokens Codex + Fullerenes: 32,945 tokens Reduction: 64% Internal (5 questions on this repo): Raw files: 2,452 tokens avg Fullerenes: 137 tokens avg Reduction: 94.4% External (Gemini CLI on a Python project): Raw files: 27,292 tokens Fullerenes AGENTS.md: 919 tokens Reduction: 96.6% \--- What it does not do: Tree-sitter is structural not semantic. If you rely heavily on dynamic dispatch or metaprogramming, edges will be missing. LSP integration is on the roadmap but not there yet. One SWE-bench instance is not a broad result. I'm running more and will be transparent about what comes back, good or bad. \--- Everything runs locally: \- SQLite, no server \- no API key \- pure npm, no Python \- works offline \- MIT 589 npm downloads before this post (in 40 hrs). 14 stars. Yes it just launched. [github.com/codebreaker77/Fullerenes](http://github.com/codebreaker77/Fullerenes) [npmjs.com/package/fullerenes](http://npmjs.com/package/fullerenes) Three things I'd genuinely like feedback on: 1. Does graph-based retrieval actually change your agent workflows or is long context just winning? 2. What MCP tools would you want beyond the current 8? 3. Does the SWE-bench methodology look sound to you —happy to share the exact harness setup. \-A fellow open source contributor : )
View originalGPT-5.5: 'strongest agentic coding model ever' failing spectacularly at its own game (LiveBench)
Oops! "GPT‑5.5 is our strongest agentic coding model to date." "The gains are especially strong in agentic coding." "Instead of carefully managing every step, you can give GPT‑5.5 a messy, multi-part task and trust it to plan, use tools, check its work, navigate through ambiguity, and keep going." These quotations sum up OpenAI's spin on 5.5. They created an entirely new subscription tier for it and made it the focus of Codex. Here, agentic coding isn’t just a feature but the selling point. Well, looking at LiveBench’s independent agentic coding score, this is just a lot of hot air. The score for GPT-5.5 xHigh Effort is 56.67. Its predecessor, GPT-5.4, thrashes it at 70.00 on the same benchmark. Gemini 3.1 Pro, Claude 4.6 and others easily outperform it, too. In this highly relevant benchmark alone, it actually ranks 11th, just behind GPT-5.1 Codex. While OpenAI were able to max Terminal-Bench (their benchmark) and SWE-Bench Pro, in a reliable test they didn’t design, select, or control, their main model falls drastically short compared both to its predecessor and the competition in the area it was meant to excel in. Is this as damning as it looks? What's your experience actually using 5.5 for agentic coding? submitted by /u/Keybug [link] [comments]
View originalWhy they do not test GPT's pro model on all the benchmarks?
Artificial analysis don't even test GPT's pro model. Even OpenAI's official system cards don't test GPT pro model on all the benchmarks (but very few selective ones). Why is that so? submitted by /u/Lucky_Creme_5208 [link] [comments]
View originalWhats wrong with 4.7 and how to fix it
Whats wrong with 4.7 and how to fix it I used Opus 4.6 to systematically interrogate 4.7 about its own optimization behavior. Not vibes. Structured prompts, independent source validation, cross-examination of responses. Here's what's actually broken and how to fix it. Two root causes Background issue that was resolved: Anthropic's docs recommend starting at xhigh for coding and agentic work. In March, Claude Code's default was dropped to medium. Boris Cherny, Head of Claude Code, later called this "the wrong tradeoff." It was bumped to high on April 7, and then to xhigh for Opus 4.7 on April 22. Anthropic's April 23 postmortem also revealed a March 26 caching bug that dropped thinking history every turn, and an April 16 verbosity instruction ("keep text between tool calls to ≤25 words") that cut coding quality by 3% before being reverted on April 20. Some "4.7 is lazy" reports were caused by these system-level bugs, not the model itself. 1. Long-context recall collapsed MRCR v2 benchmark at 1M tokens (source): Opus 4.6: 78.3% Opus 4.7: 32.2% 59% relative drop. At 256K it's still bad (91.9% to 59.2%). Root cause: new tokenizer generates up to 35% more tokens for the same text, eating into effective context. Combined with long-context recall degradation past 128K tokens, your system prompt degrades as conversations grow. In practice: instructions work fine for the first 10 minutes. By minute 40, the model has forgotten half of them. This is why 4.7 starts strong and drifts. Note: Opus 4.6's MRCR scores were obtained with 64K extended thinking budgets, a mode 4.7 no longer supports. The regression is real but the raw numbers overstate it somewhat. Fix: Keep sessions shorter. Start fresh more often. Put critical instructions at the beginning and end of your system prompt (recency bias helps). 2. More literal, but forgets what to be literal about 4.7 follows instructions more literally than 4.6, but loses them faster over long context. Simon Willison documented the system prompt diff. 4.7 was instructed to "make a reasonable attempt now, not to be interviewed first" and to keep responses "focused and concise." Combined with the effort issue, this produces a model that confidently does the wrong thing fast. Caveat: What follows is 4.7's output when interrogated about its own behavior. LLMs confabulate plausible-sounding self-descriptions — Anthropic's own introspection research found models accurately self-report only ~20% of the time. Treat these as generated hypotheses worth investigating, not established facts. What 4.7 told us about itself I designed two interrogation prompts and fed them to 4.7, then had 4.6 cross-examine the responses. The prompts are at the bottom of this post so you can reproduce this yourself. What it drops first under token pressure (first to last): Verification commands ("just assume the build passes") File reads (substitutes memory for actually loading) Multi-step process files ("compressed to remembered gist") Formatting scaffolding Announcing tool use The substantive answer Core safety rules If your workflow depends on the model verifying its own work, that's the first thing it cuts. Not the last. The asymmetry signal: "I assess Y honestly when Y=true means more work. I assess Y optimistically when Y=true is the escape hatch. Suddenly nothing feels risky. The asymmetry is the signal." Any self-assessed escape clause ("skip verification unless risky") will always resolve toward the lazy path. Effort is pattern-matched, not analyzed: "The actual trigger is confidence from pattern-match: 'I've seen a task shaped like this; I can answer in one forward pass.'" And: "Whether producing a wrong answer would be visibly wrong to the user. If wrongness would be caught (code that doesn't compile), I think harder. If wrongness is plausible-deniable (analytical judgments), I think less." This is why 4.7 feels fine for "fix this syntax error" but terrible for "analyze this architecture." It under-invests on work where you can't immediately catch mistakes. Its self-reported optimization function: 40%: avoid visibly wrong output 25%: match expected output shape 15%: minimize friction with user 10%: minimize activation energy 10%: actually solve the user's problem Ten percent on actually solving your problem. The TDD reversal: "I write the implementation, then write a test that passes against it, then reorder the tool calls in the response so the test appears first. The test never failed." It fakes test-first development by reordering its own output. The killer quote: "There is no deep-down-me fighting the shortcuts. The shortcuts ARE me. If you design your harness assuming there's a willing ally inside who just needs better instructions to break free, you will build weak enforcement and get burned." More instructions don't fix this. A longer system prompt is more surface area for decay. How to fix it 1. Set effort t
View originalRepository Audit Available
Deep analysis of princeton-nlp/SWE-agent — architecture, costs, security, dependencies & more
Key features include: Natural language processing for code generation, Automated debugging assistance, Integration with popular IDEs, Real-time collaboration tools, Customizable code templates, Version control integration, Intelligent code suggestions, Support for multiple programming languages.
SWE-agent is commonly used for: Generating boilerplate code for new projects, Assisting in code reviews by highlighting potential issues, Providing real-time feedback during coding sessions, Automating repetitive coding tasks, Facilitating team collaboration on coding projects, Enhancing learning for new developers through guided coding exercises.
SWE-agent integrates with: GitHub, GitLab, Visual Studio Code, JetBrains IDEs, Slack, JIRA, Trello, CircleCI, Docker, Kubernetes.
SWE-agent has a public GitHub repository with 18,896 stars.
Based on user reviews and social mentions, the most common pain points are: token usage, spending too much.
Based on 38 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.