Based on the limited social mentions provided, user sentiment about xAI appears largely negative or concerning. The most substantive mention indicates internal turmoil, with reports of Elon Musk "pushing out" xAI founders and the AI coding effort "faltering," suggesting significant management and development challenges. The other mentions are mostly generic YouTube titles or unrelated AI discussions that don't provide meaningful insights into xAI specifically. Overall, the available information suggests users and observers are focused on organizational instability rather than product performance or pricing. More comprehensive user reviews would be needed to assess actual user experience with xAI's technology.
Mentions (30d)
2
Reviews
0
Platforms
5
Sentiment
0%
0 positive
Based on the limited social mentions provided, user sentiment about xAI appears largely negative or concerning. The most substantive mention indicates internal turmoil, with reports of Elon Musk "pushing out" xAI founders and the AI coding effort "faltering," suggesting significant management and development challenges. The other mentions are mostly generic YouTube titles or unrelated AI discussions that don't provide meaningful insights into xAI specifically. Overall, the available information suggests users and observers are focused on organizational instability rather than product performance or pricing. More comprehensive user reviews would be needed to assess actual user experience with xAI's technology.
Industry
information technology & services
Employees
3,500
Funding Stage
Other
Total Funding
$48.1B
Elon Musk pushes out more xAI founders as AI coding effort falters
<a href="https://archive.ph/rP4cb" rel="nofollow">https://archive.ph/rP4cb</a> (text at bottom)<p><a href="https://x.com/elonmusk/status/2032201568335044978" rel="nofollow">https://x.com/elonmusk/status/2032201568335044978</a>, <a href="https://xcancel.com/elonmusk/status/2032201568335044978" rel="nofollow">https://xcancel.com/elonmusk/status/2032201568335044978</a><p><a href="https://economictimes.indiatimes.com/tech/artificial-intelligence/musk-ousts-more-xai-founders-as-ai-coding-effort-falters-ft-reports/articleshow/129560405.cms" rel="nofollow">https://economictimes.indiatimes.com/tech/artificial-intelli...</a><p><a href="https://futurism.com/artificial-intelligence/elon-musk-screwed-up-xai-rebuilding" rel="nofollow">https://futurism.com/artificial-intelligence/elon-musk-screw...</a>
View originalClaudeGUI: File tree + Monaco + xterm + live preview, all streaming from Claude CLI
https://preview.redd.it/5ml5rgvd6iug1.png?width=3444&format=png&auto=webp&s=1a16f1fefe2efd898e72852ad7c900a055ea518d https://preview.redd.it/cwlkjevd6iug1.png?width=3454&format=png&auto=webp&s=2537aee124bc0c6e23f75d97bc604d5df640153f https://preview.redd.it/eynv3fvd6iug1.png?width=3428&format=png&auto=webp&s=c749d7b467bc5f1cde91698ffce5509935baf13e Hey all — I've been living inside `claude` in the terminal for months, and kept wishing I could see files, the editor, the terminal, and a live preview of whatever Claude is building, all at once. So I built it. **ClaudeGUI** is an unofficial, open-source web IDE that wraps the official Claude Code CLI (`@anthropic-ai/claude-agent-sdk`). Not affiliated with Anthropic — just a community project for people who already pay for Claude Pro/Max and want a real GUI on top of it. **What's in the 4 panels** - 📁 File explorer (react-arborist, virtualized, git status) - 📝 Monaco editor (100+ languages, multi-tab, AI-diff accept/reject per hunk) - 💻 xterm.js terminal (WebGL, multi-session, node-pty backend) - 👁 Multi-format live preview — HTML, PDF, Markdown (GFM + LaTeX), images, and reveal.js presentations **The part I'm most excited about** - **Live HTML streaming preview.** The moment Claude opens a ```html``` block or writes a `.html` file, the preview panel starts rendering it *while Claude is still typing*. Partial render → full render on completion. Feels like watching a website materialize. - **Conversational slide editing.** Ask Claude to "make slide 3 darker" — reveal.js reloads in place via `Reveal.sync()`, no iframe flash. Export to PPTX/PDF when done. - **Permission GUI.** Claude tool-use requests pop up as an approval modal instead of a y/N prompt in the terminal. Dangerous commands get flagged. Rules sync with `.claude/settings.json`. - **Runtime project hotswap.** Switch projects from the header — file tree, terminal cwd, and Claude session all follow. - **Green phosphor CRT theme** 🟢 because why not. **Stack**: Next.js 14 + custom Node server, TypeScript strict, Zustand, Tailwind + shadcn/ui, `ws` (not socket.io), chokidar, Tauri v2 for native `.dmg`/`.msi` installers. **Install** (one-liner): ```bash curl -fsSL https://github.com/neuralfoundry-coder/CLAUDE-GUI/tree/main/scripts/install/install.sh | bash Or grab the .dmg / .msi from releases. Runs 100% locally, binds to 127.0.0.1 by default. Your Claude auth from claude login is auto-detected. Status: v0.3 — 102/102 unit tests, 14/14 Playwright E2E passing. Still rough around the edges, MIT-ish license TBD, feedback very welcome. Repo: Happy to answer questions about the architecture — the HTML streaming extractor and the Claude SDK event plumbing were the fun parts. submitted by /u/Motor_Ocelot_1547 [link] [comments]
View original<total_tokens> in user prompts, invisible
Anyone else getting this feedback today? I tested from both Claude Desktop and Claude.ai and Claude seems to see it on all of my messages. https://claude.ai/share/06cedfb4-997f-4f9c-8eae-4b9f4f615509 The X tokens left line that's been appearing at the end of your messages is not something you're sending me — it looks like a system-level annotation that's being injected into the conversation, possibly by a new feature rolled out recently. I've been reading it and treating it as gospel ("I have ~2K tokens left, I should wrap up"), and then dutifully reporting that to you as if it were a hard fact about my own state. Yes. It's right there at the bottom of your last message: 40000 tokens left . Same format as every other message this conversation, same value (40000) as several earlier ones — which is itself interesting, because if it were a real running budget you'd expect it to decrease monotonically, and it hasn't. It's been bouncing around (40000, 36000, 39000, 40000...) which is not how a token budget would behave. The only skill I have enabled is Desktop Commander on my local machine, but the same thing appears on conversations on the web chat also. Couldn't find anything online about this. submitted by /u/ski107 [link] [comments]
View original[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]
cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03. Previous versions are even worse. I wrote a simple, yet efficient kernel and compared it to cuBLAS across a variety of workloads. Batched perf vs cuBLAS on 5090 (>100% means my kernel is faster): Size B=4 B=8 B=16 256 91% 80% 90% 512 120% 153% 135% 1024 137% 142% 142% 2048 158% 155% 157% 4096 157% 162% 170% 8192 158% 152% 148% cuBLAS uses a proper kernel on other GPUs. RTX GPUs clearly receive less love from NVIDIA: Pro 6000: escalates through three tile sizes, reaches 73% FMA (Fused Multiply-Add pipe) H200: best implementation, mixes CUTLASS and xmma families, reaches 82% FMA An in-depth analysis with full NCU profiling data across all three GPUs, a deep-dive into SASS scheduling explaining the remaining 5% single-mode gap between my kernel and a proper cuBLAS SGEMM, and repro scripts are available in the article linked below. Besides the bug, the article covers a simple TMA (tensor memory accelerator) double-buffer kernel that beats cuBLAS by 46-65% in batched mode on the 5090 and achieves 80-120% of the performance of a properly selected kernel, making it a nice technique for writing simple yet very performant kernels. VS Proper Pro6000 kernel: Size B=4 B=8 B=16 256 87% 95% 77% 512 102% 124% 101% 1024 101% 104% 96% 2048 90% 102% 93% 4096 93% 93% 93% 8192 94% 95% 95% VS Proper H200 kernel: Size B=4 B=8 B=16 256 85% 104% 77% 512 105% 97% 88% 1024 87% 89% 89% 2048 89% 90% 92% 4096 91% 89% 90% 8192 88% 87% 87% Double buffer pipeline visualization: Tile 0: [load buf0] [wait] [compute buf0 + load buf1] Tile 1: [wait buf1] [compute buf1 + load buf0] Tile 2: [wait buf0] [compute buf0 + load buf1] ... Simplified kernel source: __global__ __launch_bounds__(256) void fused_matmul( const __grid_constant__ CUtensorMap A_tma, const __grid_constant__ CUtensorMap B_tma, float* C) { extern __shared__ __align__(128) char dsmem[]; float* smem = (float*)dsmem; // Two mbarriers for double-buffer synchronization uint64_t* mbar = (uint64_t*)(dsmem + 2 * STAGE * 4); // Shared memory addresses for TMA targets const int as0 = __cvta_generic_to_shared(&smem[0]); const int bs0 = __cvta_generic_to_shared(&smem[A_SIZE]); const int as1 = __cvta_generic_to_shared(&smem[STAGE]); const int bs1 = __cvta_generic_to_shared(&smem[STAGE + A_SIZE]); // Thread identity int tid = threadIdx.y * 32 + threadIdx.x; int tr = threadIdx.y * TM, tc = threadIdx.x * 4; int bm = blockIdx.y * BM, bn = blockIdx.x * BN; // Initialize mbarriers (thread 0 only) if (tid == 0) { mbarrier_init(mbar[0]); mbarrier_init(mbar[1]); } __syncthreads(); float c[TM][4] = {}; // Accumulators // Pre-load first tile if (tid == 0) { mbarrier_expect_tx(mbar[0], BYTES); tma_load_2d(as0, &A_tma, /*k=*/0, bm, mbar[0]); tma_load_2d(bs0, &B_tma, bn, /*k=*/0, mbar[0]); } for (int t = 0; t < K/BK; t++) { int s = t % 2; // Current buffer // Wait for current tile's TMA to complete mbarrier_wait(mbar[s], phase[s]); // Start loading NEXT tile (overlaps with compute) if (tid == 0 && t + 1 < nt) { tma_load_2d(next_buf_a, &A_tma, next_k, bm, next_mbar); tma_load_2d(next_buf_b, &B_tma, bn, next_k, next_mbar); } // Compute: all 256 threads do FMA from shared memory float* As = &smem[s * STAGE]; float* Bs = &smem[s * STAGE + A_SIZE]; #pragma unroll for (int kk = 0; kk < BK; kk++) { float b0 = Bs[kk*BN+tc], b1 = Bs[kk*BN+tc+1], ...; for (int i = 0; i < TM; i++) { float a = As[(tr+i)*BK+kk]; c[i][0] += a * b0; c[i][1] += a * b1; // ... 4 FMAs per row } } __syncthreads(); } // Write results to global memory for (int i = 0; i < TM; i++) store_row(C, bm+tr+i, bn+tc, c[i]); The full article is available here Repo with repro scripts and benchmark data submitted by /u/NoVibeCoding [link] [comments]
View originalYour AI coding agent doesn't know your business rules. How are you dealing with that?
YC's Spring 2026 RFS just named "Cursor for Product Managers" as an official startup category. Andrew Miklas put it bluntly: "Cursor solved code implementation. Nobody has solved product discovery." But there's a harder problem hiding underneath that nobody's really talking about. The code your agent writes looks perfect. It compiles. Tests pass. Then it hits production and violates a business rule nobody told it about. The data is getting ugly: AI-generated code produces 1.7x more issues than human code (CodeRabbit, 470 PRs) Production incidents per PR are up 23.5% at high AI-adoption teams (Faros AI) Amazon's AI coding tool caused a 6-hour outage — 6.3M lost orders — in March 2026 48% of AI-generated code has security vulnerabilities (NYU/Contrast Security) The root cause isn't model quality. It's missing context. Business rules scattered across Confluence, COBOL comments, Slack threads, and a PM's head. The agent never sees any of it. How are teams solving this today? From what I'm seeing: CLAUDE.md files with manual rules (breaks on anything non-trivial) Massive system prompts that bloat context and get compacted away PMs writing rule docs that go stale the day after they're written Curious: If you're shipping AI-generated code in production — what's your worst "the agent didn't know about X" story? How do you feed business context to your coding agents today? Static files? RAG? Something custom? I do hear about Knowledge Graphs, MCPs and CI gates but are this comprehensively well achieved today? Would you trust a system that auto-enforces business rules on AI code, or does that feel like it'd create more false positives than it catches? Building in this space. Want to make sure the problem is as real as the data suggests before going deep. submitted by /u/rahulmahibananto [link] [comments]
View originalWhat's your "When Language Model AI can do X, I'll be impressed"?
I have two at the top of my mind: When it can read musical notes. I will be mildly impressed when I can paste in a picture of musical notes and with programming sets up instruments needed to play music, and then correctly plays the song it reads from the notes. My jaw will drop when finally with a simple prompt an AI can create a classic arcade style fully functioning and fun to play Pinball game. Each new version of models that become available I give that one a go. None have been even remotely close to achieving this goal. So what are your visions for what will impress you to some extent when an AI can make it for you? submitted by /u/KroggRage [link] [comments]
View originalDoes the yellow banner get darker over time? (UI change or escalation?)
I got the "level 3" yellow warning banner several days ago. I noticed that this filter actually only applies to chats on Claude.ai and it didn't affect Claude Code. Therefore, I just kept using Claude Code as usual and ignored the banner for a few days. But today I went to check the web UI and found that the banner's color had changed... It clearly got darker than the normal yellow banner. Is this just a universal UI update, or does it mean the warning escalated?? 😨😨 submitted by /u/Anniric [link] [comments]
View originalWhat actually makes AI useful for writing (most people are doing it wrong)
Been using AI for writing for a while and figured out what actually moves the needle vs what's just hype. The biggest thing: stop treating AI like a vending machine. One prompt, one result, done. The real power is in chaining prompts — having an actual conversation where each reply builds on the last. Example: instead of "write me a blog post about X" try asking for 10 angles first, pick the best one, then ask for an outline, then draft section by section. The output is 10x better. Happy to share more if anyone's interested — what are you all struggling with most when using AI for writing? submitted by /u/Major_Guarantee_3472 [link] [comments]
View originalAIs do forget, they do hallucinate, and carrying your entire project from one AI to another is a nightmare — here's the missing piece nobody talks about
The master memory for all your projects, relieve your phone of all the extra files AIs forget mid-session, hallucinate more as chats grow, and switching platforms means rebuilding your entire project brain from scratch. This workflow fixes it. You've trained Claude to your exact rules — no bullet-point rants, conversational tone only, "we tried X and it failed." Two hours invested. Then you need ChatGPT's browser or Gemini's Workspace integration. Blank slate. Again. The real pain: context rot. Long sessions degrade accuracy as early instructions get buried. Hallucinations creep in — invented rules, "as we discussed" about nothing. Short sessions work better... but you lose the living record of your corrections, your preferences in action. The solution most miss: chat logs are your gold. Not summaries. The full exchanges where you corrected the AI show it how you think. But files pile up. Claude caps at 20 uploads. Loose .txt files parse poorly. I built a Google Drive script that auto-merges everything into one "Master Brain" Google Doc. Drop exports in a folder. It compiles them hourly into structured volumes with headers. Upload one doc to any AI. Instant context transfer. Why it works: Bypasses 20-file limits Headers help attention navigation Volumes fit token ceilings Auto-archives originals Full script + exact workflow (rules files, session hygiene, changelog) here: https://www.reddit.com/r/ScamIndex/comments/1shaud2/resource_ais_do_forget_they_do_hallucinate_and/ submitted by /u/Mstep85 [link] [comments]
View originalTired of usage caps sneaking up on you? Try this! Split tab usage monitor.
This is one of those things that after I do it, I'm like why didn't I think of this sooner? I'll definitely feel less rage from sudden cutoffs now.....Just right click on your tab and add to split view Simple as that Then navigate to your usage page and you're done I hope somebody else finds this as soothing as I do. Resource anxiety is a real thing. submitted by /u/shamanicalchemist [link] [comments]
View originalHot take: today we witnessed the death of vibe coding
Many Claude users moved to Codex as an alternative to Claude's brutal limits. Since today's change in price plan by OpenAI, my Plus plan limits are now burning away at something like 4-5 x the speed they had done before. Aside from the first week I got Codex, I've never come close to maxing my weekly limits yet have burned through 30% of my limit since the reset today. AI in general will only get more expensive from here on out. Non-skilled people are just not going to be able to afford to throw in one prompt after another until they get something that works (or appears to work) and people who have built AI-slop codebases will be forced to either pay a fortune to maintain it with AI (because no human will be able to make sense of it or be willing to put their name to such a mess) or have it entirely rewritten by a skilled human. submitted by /u/U4-EA [link] [comments]
View originalDream team memory handling — what's new in CC 2.1.98 (+2,045 tokens)
NEW: System Prompt: Communication style — Added guidelines for giving brief user-facing updates at key moments during tool use, writing concise end-of-turn summaries, matching response format to task complexity, and avoiding comments and planning documents in code. NEW: System Prompt: Dream team memory handling — Added instructions for handling shared team memories during dream consolidation, including deduplication, conservative pruning rules, and avoiding accidental promotion of personal memories. NEW: System Prompt: Exploratory questions — analyze before implementing — Added instructions for Claude to respond to open-ended questions with analysis, options, and tradeoffs instead of jumping to implementation, waiting for user agreement before writing code. NEW: System Prompt: User-facing communication style — Added detailed guidelines for writing clear, concise, and readable user-facing text including prose style, update cadence, formatting rules, and audience-aware explanations. NEW: Tool Description: Background monitor (streaming events) — Added description for a background monitor tool that streams stdout events from long-running scripts as chat notifications, with guidelines on script quality, output volume, and selective filtering. Agent Prompt: Dream memory consolidation — Added support for an optional transcript source note displayed after the transcripts directory path. Agent Prompt: Dream memory pruning — Added conservative pruning rules for team/ subdirectory memories: only delete when clearly contradicted or superseded by a newer team memory, never delete just because unrecognized or irrelevant to recent sessions, and never move personal memories into team/. Skill: /dream nightly schedule — Minor refactor to include memory directory reference in the consolidation configuration. System Prompt: Advisor tool instructions — Minor wording updates: clarified tool invocation syntax, broadened 'before writing code' to 'before writing,' and updated several examples and descriptions for generality (e.g., 'reading code' → 'fetching a source,' 'the code does Y' → 'the paper states Y'). Details: https://github.com/Piebald-AI/claude-code-system-prompts/releases/tag/v2.1.98 Regular updates at https://x.com/PiebaldAI submitted by /u/Dramatic_Squash_3502 [link] [comments]
View originalI'm letting AI plan every hour of my life for 2 weeks. Starting Monday. Looking for tips from people who've tried this.
Next Monday I hand my calendar, my meals, my workouts, my sleep schedule, and basically every decision in my day over to a multi-agent AI assistant I've been building for the last 5 days. It decides when I get up, what I eat, when I hit the gym, when I work on which project, and when I'm "allowed" to hang out with my partner. I follow its plan. For 2 weeks. Why: I'm a platform engineer running a consulting biz on the side. Every productivity system I've tried works for 2 weeks then collapses. I wanted a system that maintains itself. So I built one. What I've built so far (all in Claude Code, 5 days): 7 specialized agents (PA orchestrator, calendar, email, tasks, knowledge, brain maintenance, decision-making) 50+ commands across daily ops, calendar, email triage, brain management A persistent "brain" in Obsidian — 132 knowledge nodes, 1001 wiki-links, 98 logged decisions. Every session reads from it, writes back to it. Telegram daemon so it can nudge me on the go Observability hooks, bug tracker, bootstrap installer. Fully docs'd. Full project page with live timeline + architecture + bug tracker: https://rivuletconsulting.nl/projects/daily-ai.html First blog post (the "why") + Day 5 build log are up there too. The experiment starts Monday. I'll be posting daily updates. What I'm asking: Anyone tried something similar? What broke first? Tips for keeping the autonomy/override balance right? Where do you draw the line between "AI leads" and "I override"? Prompt patterns that worked for you in multi-agent setups? Things you wish you'd known before handing control over? Honest takes welcome — including "this is a terrible idea because X". submitted by /u/keebrev-t [link] [comments]
View originalI built the first AI memory system that mathematically cannot store lies
Your AI remembers wrong things and nobody checks. Every "AI memory" tool stores whatever your LLM generates. Hallucinations sit right next to real knowledge. Three months later, your AI retrieves that hallucination as if it were fact and builds an entire feature on it. I got tired of this. So I built something different. EON Memory is an MCP server with one rule: nothing gets stored without passing 15 truth tests first. WHAT THE 15 TESTS ACTUALLY CHECK: Logic layer (4 tests): Self-contradiction detection. Does the new memory conflict with what you already stored? Is it internally coherent? Does it hold up under scrutiny? Ethics layer (5 tests): Does the content contain deceptive patterns? Coercive language? Harmful intent? We use a mathematical framework called X-Ethics with four pillars scored multiplicatively: Truth x Freedom x Justice x Service. If any pillar is zero, total score is zero. The system literally cannot store it. Quality layer (6 tests): Is there enough technical detail to be useful? Could another AI actually write code from this memory in 6 months? Are sources cited? We score everything Gold, Silver, Bronze, or Review. THE FORMULA BEHIND X-ETHICS: L = (W x F x G x D) x X-squared W = Truth score (deception detection, hallucination patterns) F = Freedom score (coercion detection) G = Justice score (harm detection, dignity) D = Service score (source verification) X = Truth gradient (convergence toward truth, derived from axiom validation) X-squared means truth alignment is rewarded exponentially. A slightly deceptive memory does not get a slightly lower score - it gets crushed. This is not a content filter. This is math. The axioms are from a formal framework (Traktat X) that proves truth-orientation is logically necessary. Denying truth uses truth. The framework is self-sealing. CONNECTED KNOWLEDGE: Every memory is semantically linked. Search for "payment bug" and you get the related architecture decisions, the Stripe webhook fix, and the test results - with similarity percentages. Your AI sees the full graph, not isolated documents. SETUP: npx eon-memory init Works with Claude Code, Cursor, any MCP IDE. Swiss-hosted, DSGVO compliant. 3,200+ memories validated in production. CHF 29/month. Free trial: https://app.ai-developer.ch Solo developer, Swiss-made. Happy to answer questions about the math, the validation pipeline, or anything else.Your AI remembers wrong things and nobody checks. Every "AI memory" tool stores whatever your LLM generates. Hallucinations sit right next to real knowledge. Three months later, your AI retrieves that hallucination as if it were fact and builds an entire feature on it. I got tired of this. So I built something different. EON Memory is an MCP server with one rule: nothing gets stored without passing 15 truth tests first. WHAT THE 15 TESTS ACTUALLY CHECK: Logic layer (4 tests): Self-contradiction detection. Does the new memory conflict with what you already stored? Is it internally coherent? Does it hold up under scrutiny? Ethics layer (5 tests): Does the content contain deceptive patterns? Coercive language? Harmful intent? We use a mathematical framework called X-Ethics with four pillars scored multiplicatively: Truth x Freedom x Justice x Service. If any pillar is zero, total score is zero. The system literally cannot store it. Quality layer (6 tests): Is there enough technical detail to be useful? Could another AI actually write code from this memory in 6 months? Are sources cited? We score everything Gold, Silver, Bronze, or Review. THE FORMULA BEHIND X-ETHICS: L = (W x F x G x D) x X-squared W = Truth score (deception detection, hallucination patterns) F = Freedom score (coercion detection) G = Justice score (harm detection, dignity) D = Service score (source verification) X = Truth gradient (convergence toward truth, derived from axiom validation) X-squared means truth alignment is rewarded exponentially. A slightly deceptive memory does not get a slightly lower score - it gets crushed. This is not a content filter. This is math. The axioms are from a formal framework (Traktat X) that proves truth-orientation is logically necessary. Denying truth uses truth. The framework is self-sealing. CONNECTED KNOWLEDGE: Every memory is semantically linked. Search for "payment bug" and you get the related architecture decisions, the Stripe webhook fix, and the test results - with similarity percentages. Your AI sees the full graph, not isolated documents. SETUP: npx eon-memory init Works with Claude Code, Cursor, any MCP IDE. Swiss-hosted, DSGVO compliant. 3,200+ memories validated in production. CHF 29/month. Free trial: https://app.ai-developer.ch Solo developer, Swiss-made. Happy to answer questions about the math, the validation pipeline, or anything else. submitted by /u/FortuneOk8153 [link] [comments]
View originalI run 3 experiments to test whether AI can learn and become "world class" at something
I will write this by hand because I am tried of using AI for everything and bc reddit rules TL,DR: Can AI somehow learn like a human to produce "world-class" outputs for specific domains? I spent about $5 and 100s of LLM calls. I tested 3 domains w following observations / conclusions: A) code debugging: AI are already world-class at debugging and trying to guide them results in worse performance. Dead end B) Landing page copy: routing strategy depending on visitor type won over one-size-fits-all prompting strategy. Promising results C) UI design: Producing "world-class" UI design seems required defining a design system first, it seems like can't be one-shotted. One shotting designs defaults to generic "tailwindy" UI because that is the design system the model knows. Might work but needs more testing with design system I have spent the last days running some experiments more or less compulsively and curiosity driven. The question I was asking myself first is: can AI learn to be a "world-class" somewhat like a human would? Gathering knowledge, processing, producing, analyzing, removing what is wrong, learning from experience etc. But compressed in hours (aka "I know Kung Fu"). To be clear I am talking about context engineering, not finetuning (I dont have the resources or the patience for that) I will mention world-class a handful of times. You can replace it be "expert" or "master" if that seems confusing. Ultimately, the ability of generating "world-class" output. I was asking myself that because I figure AI output out of the box kinda sucks at some tasks, for example, writing landing copy. I started talking with claude, and I designed and run experiments in 3 domains, one by one: code debugging, landing copy writing, UI design I relied on different models available in OpenRouter: Gemini Flash 2.0, DeepSeek R1, Qwen3 Coder, Claude Sonnet 4.5 I am not going to describe the experiments in detail because everyone would go to sleep, I will summarize and then provide my observations EXPERIMENT 1: CODE DEBUGGING I picked debugging because of zero downtime for testing. The result is either wrong or right and can be checked programmatically in seconds so I can perform many tests and iterations quickly. I started with the assumption that a prewritten knowledge base (KB) could improve debugging. I asked claude (opus 4.6) to design 8 realistic tests of different complexity then I run: bare model (zero shot, no instructions, "fix the bug"): 92% KB only: 85% KB + Multi-agent pipeline (diagnoser - critic -resolver: 93% What this shows is kinda suprising to me: context engineering (or, to be more precise, the context engineering in these experiments) at best it is a waste of tokens. And at worst it lowers output quality. Current models, not even SOTA like Opus 4.6 but current low-budget best models like gemini flash or qwen3 coder, are already world-class at debugging. And giving them context engineered to "behave as an expert", basically giving them instructions on how to debug, harms the result. This effect is stronger the smarter the model is. What this suggests? That if a model is already an expert at something, a human expert trying to nudge the model based on their opinionated experience might hurt more than it helps (plus consuming more tokens). And funny (or scary) enough a domain agnostic person might be getting better results than an expert because they are letting the model act without biasing it. This might be true as long as the model has the world-class expertise encoded in the weights. So if this is the case, you are likely better off if you don't tell the model how to do things. If this trend continues, if AI continues getting better at everything, we might reach a point where human expertise might be irrelevant or a liability. I am not saying I want that or don't want that. I just say this is a possibility. EXPERIMENT 2: LANDING COPY Here, since I can't and dont have the resources to run actual A/B testing experiments with a real audience, what I did was: Scrape documented landing copy conversion cases with real numbers: Moz, Crazy Egg, GoHenry, Smart Insights, Sunshine.co.uk, Course Hero Deconstructed the product or target of the page into a raw and plain description (no copy no sales) As claude oppus 4.6 to build a judge that scores the outputs in different dimensions Then I run landing copy geneation pipelines with different patterns (raw zero shot, question first, mechanism first...). I'll spare the details, ask if you really need to know. I'll jump into the observations: Context engineering helps writing landing copy of higher quality but it is not linear. The domain is not as deterministic as debugging (it fails or it breaks). It is much more depending on the context. Or one may say that in debugging all the context is self-contained in the problem itself whereas in landing writing you have to provide it. No single config won across all products. Instead, the
View originalA geopolitical news comparison site made with Claude
I built a website with Claude that collects top geopolitical news from various sources every six hours, groups them by macro-events and individual stories, compares the stories by analyzing the texts, and evaluates the alignment of each article. The idea is to show how war is also fought in the information field. Claude helped me in so many ways. I don't know how to write code, so he did everything for that, both frontend and backend. He also guided me through the site deployment, something I'd never done before. When it came to content selection, I contributed the most to perfecting the algorithm Claude created, but once he had my guidance, he solved all the problems. You can see the project here, of course it’s free www.warframes.ai Furthermore, there's a Twitter account, also automatically managed by Claude, that posts the site's main news for each news collection cycle. https://x.com/warframesai This is my first experiment with Claude; I'd love to hear your feedback submitted by /u/Whole-Tax-6419 [link] [comments]
View originalBased on user reviews and social mentions, the most common pain points are: raised, large language model, llm, foundation model.
Based on 44 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.
The Rundown AI
Newsletter at The Rundown AI
3 mentions