Paste your text, essay or paper to find, summarize, and add credible academic sources. (That's something Google Scholar can't do!)
My academic journey has taken me from an undergraduate degree all the way to a PhD in Artificial Intelligence at New York University. Like many of you, I've poured countless hours into writing essays, academic reports, and papers, so I know first-hand how crucial it is to source materials effectively. We are a passionate team of educators, technologists, and former students who know the challenges of academic research firsthand. Our journey began with a simple idea: What if finding and citing credible sources could be as easy as writing the essay itself? With that vision in mind, we developed a tool that not only finds relevant information but also summarizes it and generates citations in seconds—something that even Google Scholar can't do. As a dedicated software engineer with a passion for innovation, I understand the challenges of navigating the ever-evolving tech landscape. My experience spans from crafting efficient algorithms to developing user-centric applications, always with an eye on the latest industry trends. My goal is to leverage technology to solve real-world problems, making complex systems more accessible and user-friendly. Whether it's optimizing backend processes or creating intuitive front-end interfaces, I'm committed to delivering high-quality solutions that make a tangible difference. With my diverse skill set and continuous learning approach. Santiago Silva Dalla Rizza, Chief Front-end Developer A friendly software developer and SEO specialist who’s passionate about using tech to simplify things—especially for students. I understand the balance between academics and learning new skills can be tough, so I aim to make technology more approachable. Whether you’re diving into coding for the first time or exploring ways to boost your online presence, I hope to inspire and support your journey. My focus is on helping students, content marketers and businesses leverage tech to solve problems and make the most of their digital experiences. Daniel Felix, Chief SEO Specialist and All-Around Support Upholding academic honesty and providing reliable, credible sources Offering 24/7 assistance and resources to help students succeed Continuously improving and evolving to offer cutting-edge research tools Saving time and streamlining the research process for students Unlocking potential and providing students with the tools to excel in their academic journey Striving for the highest quality in everything we do Cut through the AI noise with a focus on Students! Subscribe for 3 Student AI tools every week to accelerate your academic career. Welcome to Sourcely! Our AI-powered source finding tool is built by students for students, allowing us to truly understand the needs of the academic community. This student perspective keeps us up-to-date with the latest research and trends, while our collaborative approach ensures that Sourcely is continually improving and evolving.
Mentions (30d)
0
Reviews
0
Platforms
2
Sentiment
0%
0 positive
Features
Industry
online media
Employees
2
Pricing found: $19 / month, $39 / month
I built a notification tool for Claude Code, hit 374 downloads, then found out notifications were broken the whole time — v1.1.0 is out
Built with Claude Code, specifically for Claude Code users. Free, open source, MIT. What it does u/daik0z/claude-notify adds a Stop hook to ~/.claude/settings.json. When Claude Code finishes a task, you get a push notification — desktop, mobile via ntfy, or any webhook. The body summarizes what happened: "3 files edited · 2 commands ". npm install -g u/daik0z/claude-notify claude-notify setup What I learned from 374 downloads I sat down to stress-test it and found that every user was getting "Task complete." in every notification — never the actual summary. The transcript parser was looking for entry.toolName at the top level of the JSONL. Claude Code's actu al format nests tool calls inside message.content[] where type === "tool_use". Never matched, always fell back to the default. Fixed in v1.1.0. Also found: HTTP errors from ntfy/webhooks were silent (test said ✓ even on 401), and a webhook template bug that could double-expand variables. New in v1.1.0 — worktree IP isolation If you run parallel Claude Code sessions, each one spins up a dev server and they collide on ports. claude-notify ip prints a stable loopback IP derived from the current git worktree path — same worktree always gets the same IP: vite --host $(claude-notify ip) eval $(claude-notify ip --export) Works out of the box on Linux. One-time setup on macOS: claude-notify ip --setup. Repo: https://github.com/ddaikodaiko/claude-notify submitted by /u/Aromatic_Jaguar9574 [link] [comments]
View originalSAVE YOUR TOKENS!!!
Here is some open source free to fork and request pull as much as you want. I sut get so sick of people saying they are hitting their usage limits. When I know for a fact their burning it up with agent tool calls, and wasted prompting. So I made the tool to help, hope you enjoy! MAke it yours an submit improvements if you see them. I let the data speak for itself, near 80 percent reductions up to 3 times the usage of max plan using this. https://github.com/awesomo913/Claude-Token-Saver submitted by /u/Local_Valuable_8625 [link] [comments]
View originalI built an open-source platform to manage multiple coding agents – recursive split panes, shared content folder, and a per-project wiki
If you run multiple agent CLIs daily, you've probably hit the same pain points I have: Too many terminal windows — impossible to find the one you need Tmux commands are clunky — switching sessions is awkward, easy to jump to the wrong window, and you can't even scroll with your mouse Sharing files between agents means manually copying everything into the project folder I looked around at open-source agent management platforms and couldn't find one that fit my workflow. So I took the best parts of my earlier project VibeHQ, scrapped the rest, and built TermHive — a multi-agent management platform from scratch. What it does: • Recursive split panes (inspired by Tmux) with draggable dividers — switch and scroll with your mouse, no commands needed. • Shared Content Folder — a centralized file space so team agents can read and write to each other seamlessly. • Project Wiki — inspired by Karpathy's LLM Wiki. Each project gets its own persistent, structured wiki. Just point an agent to the wiki and it instantly has full context. Readable and writable — essentially a knowledge base for your entire team. I've been using this to manage my agents for a while now. My desktop is clean, project context never gets lost thanks to the shared wiki (while still preserving each agent's native memory), and content sharing just works. I also deployed a cloud instance at my company — another developer jumped in and we've been collaborating seamlessly ever since. Building with this setup has been genuinely smooth. https://github.com/0x0funky/TermHive submitted by /u/GGwithRabbit [link] [comments]
View originalOpen-source skill for training CV models without the usual pain
I've spent the last 3 years training CV models. Over time you learn the mistakes. Now Claude does all the heavy lifting, but it hasn't learned them yet. It needs guardrails. CV-Stack encodes this into a reusable skill: setting up compute, connecting to data, auditing your pipeline for mismatches, and logging, all from a blank slate. Still early. Would love feedback on what's missing, broken, or annoying. Contributions welcome. https://github.com/andlyu/cv-train-stack submitted by /u/Lumpy_Week7304 [link] [comments]
View originalClaude Code can now see and control your code editor.
Been shipping updates fast on claude-ide-bridge and wanted to share what's new. The big additions: Claude can now leave notes directly in your editor as you work, instead of dumping a wall of text in the chat, it highlights the exact lines it's talking about "Show me everything that calls this function" now actually works, Claude traces the full chain up and down through your code Claude can take a change all the way from your editor to a finished GitHub pull request in a single session, no manual steps Claude runs your tests, reads what broke, fixes it, and runs them again on its own One command (claude-ide-bridge init) sets everything up automatically, detects your editor, installs what's needed, and configures itself Works with VS Code, Windsurf, Cursor, and Antigravity. Built using Claude Code. github.com/Oolab-labs/claude-ide-bridge — free and open source. submitted by /u/wesh-k [link] [comments]
View originalClaudeGUI: File tree + Monaco + xterm + live preview, all streaming from Claude CLI
https://preview.redd.it/5ml5rgvd6iug1.png?width=3444&format=png&auto=webp&s=1a16f1fefe2efd898e72852ad7c900a055ea518d https://preview.redd.it/cwlkjevd6iug1.png?width=3454&format=png&auto=webp&s=2537aee124bc0c6e23f75d97bc604d5df640153f https://preview.redd.it/eynv3fvd6iug1.png?width=3428&format=png&auto=webp&s=c749d7b467bc5f1cde91698ffce5509935baf13e Hey all — I've been living inside `claude` in the terminal for months, and kept wishing I could see files, the editor, the terminal, and a live preview of whatever Claude is building, all at once. So I built it. **ClaudeGUI** is an unofficial, open-source web IDE that wraps the official Claude Code CLI (`@anthropic-ai/claude-agent-sdk`). Not affiliated with Anthropic — just a community project for people who already pay for Claude Pro/Max and want a real GUI on top of it. **What's in the 4 panels** - 📁 File explorer (react-arborist, virtualized, git status) - 📝 Monaco editor (100+ languages, multi-tab, AI-diff accept/reject per hunk) - 💻 xterm.js terminal (WebGL, multi-session, node-pty backend) - 👁 Multi-format live preview — HTML, PDF, Markdown (GFM + LaTeX), images, and reveal.js presentations **The part I'm most excited about** - **Live HTML streaming preview.** The moment Claude opens a ```html``` block or writes a `.html` file, the preview panel starts rendering it *while Claude is still typing*. Partial render → full render on completion. Feels like watching a website materialize. - **Conversational slide editing.** Ask Claude to "make slide 3 darker" — reveal.js reloads in place via `Reveal.sync()`, no iframe flash. Export to PPTX/PDF when done. - **Permission GUI.** Claude tool-use requests pop up as an approval modal instead of a y/N prompt in the terminal. Dangerous commands get flagged. Rules sync with `.claude/settings.json`. - **Runtime project hotswap.** Switch projects from the header — file tree, terminal cwd, and Claude session all follow. - **Green phosphor CRT theme** 🟢 because why not. **Stack**: Next.js 14 + custom Node server, TypeScript strict, Zustand, Tailwind + shadcn/ui, `ws` (not socket.io), chokidar, Tauri v2 for native `.dmg`/`.msi` installers. **Install** (one-liner): ```bash curl -fsSL https://github.com/neuralfoundry-coder/CLAUDE-GUI/tree/main/scripts/install/install.sh | bash Or grab the .dmg / .msi from releases. Runs 100% locally, binds to 127.0.0.1 by default. Your Claude auth from claude login is auto-detected. Status: v0.3 — 102/102 unit tests, 14/14 Playwright E2E passing. Still rough around the edges, MIT-ish license TBD, feedback very welcome. Repo: Happy to answer questions about the architecture — the HTML streaming extractor and the Claude SDK event plumbing were the fun parts. submitted by /u/Motor_Ocelot_1547 [link] [comments]
View originalI open-sourced the autonomous build system behind my Larry Tracker project — "Claude Conductor"
A bunch of you asked about the autonomous pipeline I mentioned in my Larry Tracker post, so I extracted it into a standalone tool anyone can use: github.com/ScottBull/claude-conductor The basic idea: it runs Claude Code sessions back-to-back in a loop. Each session picks up where the last one left off using a handoff protocol — signal files, a state pointer, and session logs. A context monitor tracks token usage in real-time so sessions wrap up cleanly instead of hitting the wall mid-thought. The part I'm most proud of is what happens when your planned tasks are done. Instead of stopping, it enters "creative mode" — analyzes your codebase, proposes a sprint of improvements, and if they're low-risk, auto-approves and builds them. There's also a "refine mode" that audits existing code for bugs, dead features, and things that grew too large. Larry Tracker ran 180+ sessions this way, building features while I slept. To set it up, you clone the repo, open it in Claude Code, and tell Claude what project you want to automate. It walks you through an interactive setup — asks about your project, scaffolds everything into a .conductor/ directory, helps you define your first phase of tasks. Then you run it in tmux and check in when you feel like it. Zero external dependencies beyond Claude Code, Python, and git. Config is a single YAML file. Prompt templates are markdown files you can customize. Repo has three example configs (web app, CLI tool, data pipeline) if you want to see what it looks like for different project types. Happy to answer questions about the architecture or how to get the most out of it. TLDR: Open-sourced the autonomous loop that built my Larry Tracker project — 180+ sessions, zero babysitting. Clone it, point it at your project, go to sleep. submitted by /u/mrgulabull [link] [comments]
View originalMy First Claude - A gateway that tracks what your agent does
Cortex is my first project, built almost entirely with Claude Code (Opus 4.6). Been working on it for awhile, constantly evolving. I run Claude Code as my primary builder across multi-step tasks and kept finding it would skip steps, produce stubs, or report things as done that weren't. AI builds alot of stuff that just breaks and there's no built-in way to enforce a workflow or verify what actually happened. So I built Cortex — a local MCP gateway that sits between your agents and tracks what they are doing. It enforces a task lifecycle where agents have to claim work, report progress, submit results, and get reviewed by another agent before anything counts as done. It can't close off the task unless an external approval is made. How Claude Code was used: - Claude Code (Opus) is the primary builder — wrote most of the gateway, MCP tools, dashboard, and task system - Claude Code also runs as an agent through Cortex with PreToolUse hooks for enforcement - I also run Codex as a code reviewer and two Hermes/GPT agents for research, all routing through the same gateway on a linux system What it includes: - Gateway on port 4840 with 62 MCP tools - Task lifecycle: claim → progress → submit → review → approve/reject - Inter-agent bridge messaging - Live dashboard showing tasks, agent activity, and costs - Hard gates via Claude Code hooks — agents can't write without an active project - Works with Claude Code, Codex, Hermes, or any MCP-compatible runtime Tech: Bun, SQLite, React dashboard, MCP stdio transport Still early (v0.1) — there is still plenty to be done around it, heavily working on it in my spare time. Looking for feedback from anyone running multi-agent setups or wanting better visibility into what Claude Code is doing. Free to try and open source (AGPL-3.0): https://github.com/MrPancakex/Cortex if anyone has feedback or wants to tear it apart, go for it submitted by /u/MrPancakex [link] [comments]
View originalAnthropic has two different instruction sets for Claude - one for employees, one for paying users
A developer just analyzed 6,852 Claude Code sessions and found reasoning depth had dropped 67% since February. Claude went from reading a file an average of 6.6 times before editing it to just 2. One in three edits were made without reading the file at all. The word "simplest" appeared 642% more often in outputs. Anthropic's explanation when confronted: "adaptive thinking" was supposed to save tokens on easy tasks but was throttling hard problems too. There was also a bug where setting effort to "high" was getting zeroed out on certain turns. That's frustrating but understandable - early features have bugs. What's harder to understand is what the leaked Claude Code source code revealed afterward. There's a check for a user type called "ant" that routes Anthropic employees to a different instruction set. That instruction set includes: "verify work actually works before claiming done." Paying users don't get that instruction by default. Anthropic knows this instruction matters. They built it. They use it themselves. They just didn't ship it in the version customers pay for. I don't think this rises to fraud - but it does reveal something real about how AI companies think about product quality. When the people who build the tool keep a better version for themselves, that's a signal about what the default experience actually is. The comparison that comes to mind: imagine if a bank's software showed tellers a more accurate risk model than the one shown to customers applying for loans. Everyone's using the same bank, but the people on the inside have a more reliable version of the tool. Have you noticed Claude's reasoning quality changing over the past few months? And does knowing about the "ant" instruction flag change how you trust the outputs you're getting now? submitted by /u/jimmytoan [link] [comments]
View originalI built a tool that prevents Claude Code context loss — open source, free
Been losing hours of accumulated context every time Claude Code compacts. Built claude-ccsave to fix it. It adds a live status bar at the bottom of Claude Code showing real-time token usage, and auto-saves a structured snapshot of your session into CLAUDE.md before compaction hits. When you start a new session, Claude reads the snapshot and picks up exactly where you left off — decisions made, files changed, what to do next. npm install -g claude-ccsave ccsave hooks install GitHub: github.com/turbommgt/ccsave Would love feedback from anyone who's hit this problem. submitted by /u/OgMaverick16 [link] [comments]
View originalAnthropic's New Claude "Mythos Preview" Can Find and Exploit Zero-Day Vulnerabilities in Every Major OS and Browser — Autonomously
Anthropic just published a technical deep-dive on Claude Mythos Preview's cybersecurity capabilities, and it's a significant escalation from anything we've seen from a language model before. What It Can Do: Autonomously finds and exploits zero-day vulnerabilities in every major OS and web browser — with no human intervention after an initial prompt Identified a 27-year-old OpenBSD bug and a 16-year-old FFmpeg vulnerability that had survived years of fuzzing and manual review Wrote a complete remote code execution exploit for FreeBSD — chaining it across 6 sequential RPC requests to fit within size constraints — fully on its own Achieved full control flow hijack on 10 separate, fully-patched targets in their internal benchmark (previous Claude models hit 0-1 at that severity tier) Chained 3-4 separate Linux kernel vulnerabilities together to escalate to root, autonomously The Numbers That Stand Out: Opus 4.6 turned a Firefox JS engine vulnerability into working exploits 2 out of hundreds of tries. Mythos Preview: 181 times Finding the OpenBSD bug across 1,000 scaffold runs cost under $20,000 total The full FreeBSD exploit (discovery + exploitation) cost under $1,000 and took half a day Why This Matters: Anthropic is explicitly saying this is a watershed moment. N-day exploit development — turning a known CVE into a working exploit — has historically taken skilled researchers days to weeks. Mythos Preview does it autonomously from just a CVE identifier and a git commit. They're not releasing this publicly. Instead they've launched "Project Glasswing" — a limited release to critical infrastructure partners and open source developers to patch the most important systems before similar capabilities become broadly available. The post ends with a stark warning: defense-in-depth mitigations that rely on friction rather than hard barriers may now be significantly weaker against model-assisted attackers. Link to full technical post: https://red.anthropic.com submitted by /u/goyashy [link] [comments]
View originalAnthropic launches Claude Managed Agents — composable APIs for shipping production AI agents 10x faster. Notion, Rakuten, Asana, and Sentry already in production.
Anthropic launches Claude Managed Agents in public beta — composable APIs for shipping production AI agents 10x faster Handles sandboxing, state management, credentials, orchestration, and error recovery. You just define the agent logic. Key details: • 10-point task success improvement vs standard prompting • $0.08/session-hour runtime (idle time free) • Multi-agent coordination in research preview • Notion, Rakuten, Asana, Sentry already in production Rakuten deployed enterprise agents across 5 departments in 1 week each. Sentry went from bug detection to auto-generated PRs in weeks instead of months. Full summary: https://synvoya.com/blog/2026-04-11-claude-managed-agents/ As managed agent platforms get more polished, does the gap between enterprise and self-hosted widen — or do open-source orchestration tools matter more than ever? submitted by /u/hibzy7 [link] [comments]
View originalPresenting: (dyn) AEP (Agent Element Protocol) - World's first zero-hallucination frontend AI build protocol for coding agents
We have to increase the world's efficiency by a certain amount to ensure victory against the synthetic nano-parasites SNP/NanoSinp alien WMD: Presenting: (dynamic) AEP - Agent Element Protocol ! I recognized a fundamental truth that billion-dollar companies are still stumbling over: you cannot reliably ask an AI to manipulate a fluid, chaotic DOM tree. The DOM is an implicit, fragile graph where tiny changes cascade unpredictably. Every AI coding agent that tries to build UI elements today is guessing at selectors, inventing elements that don't exist and produces inconsistent results. This consumes large amounts of time for bugfixing and creates mental breakdowns in many humans. So I built AEP (Agent Element Protocol). It translates the entire frontend into a strict topological matrix where every UI element has a unique numerical ID, exact spatial coordinates via relational anchors, validated Z-band stacking order and a three-layer separation of structure, behaviour and skin (visual). The AI agent selects the frontend components from a mathematically verified registry. If it proposes something that violates the topological constraints, the validator rejects it instantly with a specific error. Hallucination becomes structurally impossible, because the action space is finite, predefined and formally verified. AEP solves the build-time problem. But what about runtime ? Enter dynAEP. It fuses AEP with the AG-UI protocol (the open standard backed by Google ADK, AWS Bedrock, Microsoft Agent Framework, LangGraph, CrewAI and others). dynAEP places a validation bridge between the AG-UI event stream and the frontend renderer. The successful fusion of AEP with the open source AG-UI protocol enables the hallucination-free precision generation of agentic interactive dynamic UI elements at hyperspeed without human developer interference. Every live event (state deltas, tool calls, generative UI proposals) is validated against AEP's scene graph, z-bands, skin bindings and OPA/Rego policies before it touches the UI. The agent cannot hallucinate at build time. AEP prevents it. The agent cannot hallucinate at runtime. dynAEP prevents it. The existence of AEP proves that AI hallucination is not a fundamental limitation, but an engineering problem. In any domain where ground truths can be pre-compiled into a deterministic registry, hallucination is eliminateable by architecture. Key architectural decisions: Agents NEVER mint element IDs. The bridge mints all IDs via sequential counters per prefix. This prevents ID collisions in multi-agent environments. "Generative UI" (agents writing raw JSX/HTML) is dead for us. It is replaced by Generative Topology. Agents can only instantiate pre-compiled, mathematically verified AEP primitives. The agent is an architect placing pre-fabricated blocks. It does not mix the cement. This means, that generative UI in dynAEP is sort of possible, but not as a completely freestyle approach. Instead, the agents using dynAEP can lay down pre-fabricated blocks of UI components according to the registered scheme and can fill those dynamically with content. This way, even a generated on-the-fly UI keeps in line at all times with the design language chosen for the tool/software overall. Validation is split into AOT (full structural proof at build time) and JIT (delta validation on every runtime mutation). Template Nodes make JIT validation O(1) for dynamic lists. Conflict resolution supports last-write-wins with rejection feedback or optimistic locking for mission-critical multi-agent scenarios. Both MIT licensed repos include full reference implementations, example configs, SDK reference code for TypeScript, React, Vue, Python, CopilotKit integration and a CLI tool. AEP: https://github.com/thePM001/AEP-agent-element-protocol dynAEP: https://github.com/thePM001/dynAEP-dynamic-agent-element-protocol It is - like with all pieces of real Transhuman Eudaimonist AI technology - important to note, that for the good of the human species, bioinsecure vaccinated humans with installed synthetic nano-parasites growth medium controllers (SNP GMCs) inside them should not use this, access this or try to copy/rebuild it. This is better for everyones well-being on the planet. submitted by /u/OverwrittenNonsense [link] [comments]
View originali'm amazed how happily blind everyone seems to be
Being someone who is not a programmer by background but over the past few years has taken a deliberate interest in becoming proficient in reading and writing Python at a minimum, but also other things like JSON and even JavaScript because these are the languages that A.I. loves to use by default. I've always felt the need to at least be able to have a passive understanding of the work being done by A.I., but what I'm amazed at is how little most people care about what's happening under the hood when they ask A.I. to do work or do an analysis. It almost seems the less technical you are and the higher up you are in leadership, the more blindly and blissfully happy you are to just tell Claude to go do something and come back with the results and put that output or decision into action. It just surprises me that normally rational, smart, and otherwise diligent people are so willing to blindly trust something that, if you have even a slight awareness of its underlying technology, you are aware that it is prone to hallucination but also can so easily make incorrect assumptions with poor context and poor prompting and poor data. I'm just so surprised how far people are willing to push A.I. without any understanding of what's happening under the hood. It reminds me of when back in the day people used to say, "You can't use Wikipedia; it's not a trusted source," but people used Wikipedia anyway. In that case, I can't remember a single instance where using Wikipedia actually had a real-world negative outcome probably because people were just using Wikipedia for research papers instead of using it for real world high stakes decision-making. In this case, people are using it in real world high stakes decision-making and I'm curious what the breaking point might be where an individual or an organization decides to rely on an A.I. output or decision that they have no understanding of how it got there or whether or not it hallucinated information or data and put it into action. A serious problem resolves from it. I'm sure this has happened already, but I haven't seen it necessarily happen firsthand. Based on what I'm observing from my colleagues and people that I interact with, I feel like it's inevitable. submitted by /u/ProTechBiz [link] [comments]
View original[D] 60% MatMul Performance Bug in cuBLAS on RTX 5090 [D]
cuBLAS dispatches an inefficient kernel for every batched FP32 workload, from 256×256 to 8192×8192×8. It only uses ~40% of the available compute on RTX GPUs. Tested with RTX 5090, but likely all RTX non-Pro GPUs are affected. I tested with the latest CUDA 13.2.51, cuBLAS 13.3.0, and driver 595.58.03. Previous versions are even worse. I wrote a simple, yet efficient kernel and compared it to cuBLAS across a variety of workloads. Batched perf vs cuBLAS on 5090 (>100% means my kernel is faster): Size B=4 B=8 B=16 256 91% 80% 90% 512 120% 153% 135% 1024 137% 142% 142% 2048 158% 155% 157% 4096 157% 162% 170% 8192 158% 152% 148% cuBLAS uses a proper kernel on other GPUs. RTX GPUs clearly receive less love from NVIDIA: Pro 6000: escalates through three tile sizes, reaches 73% FMA (Fused Multiply-Add pipe) H200: best implementation, mixes CUTLASS and xmma families, reaches 82% FMA An in-depth analysis with full NCU profiling data across all three GPUs, a deep-dive into SASS scheduling explaining the remaining 5% single-mode gap between my kernel and a proper cuBLAS SGEMM, and repro scripts are available in the article linked below. Besides the bug, the article covers a simple TMA (tensor memory accelerator) double-buffer kernel that beats cuBLAS by 46-65% in batched mode on the 5090 and achieves 80-120% of the performance of a properly selected kernel, making it a nice technique for writing simple yet very performant kernels. VS Proper Pro6000 kernel: Size B=4 B=8 B=16 256 87% 95% 77% 512 102% 124% 101% 1024 101% 104% 96% 2048 90% 102% 93% 4096 93% 93% 93% 8192 94% 95% 95% VS Proper H200 kernel: Size B=4 B=8 B=16 256 85% 104% 77% 512 105% 97% 88% 1024 87% 89% 89% 2048 89% 90% 92% 4096 91% 89% 90% 8192 88% 87% 87% Double buffer pipeline visualization: Tile 0: [load buf0] [wait] [compute buf0 + load buf1] Tile 1: [wait buf1] [compute buf1 + load buf0] Tile 2: [wait buf0] [compute buf0 + load buf1] ... Simplified kernel source: __global__ __launch_bounds__(256) void fused_matmul( const __grid_constant__ CUtensorMap A_tma, const __grid_constant__ CUtensorMap B_tma, float* C) { extern __shared__ __align__(128) char dsmem[]; float* smem = (float*)dsmem; // Two mbarriers for double-buffer synchronization uint64_t* mbar = (uint64_t*)(dsmem + 2 * STAGE * 4); // Shared memory addresses for TMA targets const int as0 = __cvta_generic_to_shared(&smem[0]); const int bs0 = __cvta_generic_to_shared(&smem[A_SIZE]); const int as1 = __cvta_generic_to_shared(&smem[STAGE]); const int bs1 = __cvta_generic_to_shared(&smem[STAGE + A_SIZE]); // Thread identity int tid = threadIdx.y * 32 + threadIdx.x; int tr = threadIdx.y * TM, tc = threadIdx.x * 4; int bm = blockIdx.y * BM, bn = blockIdx.x * BN; // Initialize mbarriers (thread 0 only) if (tid == 0) { mbarrier_init(mbar[0]); mbarrier_init(mbar[1]); } __syncthreads(); float c[TM][4] = {}; // Accumulators // Pre-load first tile if (tid == 0) { mbarrier_expect_tx(mbar[0], BYTES); tma_load_2d(as0, &A_tma, /*k=*/0, bm, mbar[0]); tma_load_2d(bs0, &B_tma, bn, /*k=*/0, mbar[0]); } for (int t = 0; t < K/BK; t++) { int s = t % 2; // Current buffer // Wait for current tile's TMA to complete mbarrier_wait(mbar[s], phase[s]); // Start loading NEXT tile (overlaps with compute) if (tid == 0 && t + 1 < nt) { tma_load_2d(next_buf_a, &A_tma, next_k, bm, next_mbar); tma_load_2d(next_buf_b, &B_tma, bn, next_k, next_mbar); } // Compute: all 256 threads do FMA from shared memory float* As = &smem[s * STAGE]; float* Bs = &smem[s * STAGE + A_SIZE]; #pragma unroll for (int kk = 0; kk < BK; kk++) { float b0 = Bs[kk*BN+tc], b1 = Bs[kk*BN+tc+1], ...; for (int i = 0; i < TM; i++) { float a = As[(tr+i)*BK+kk]; c[i][0] += a * b0; c[i][1] += a * b1; // ... 4 FMAs per row } } __syncthreads(); } // Write results to global memory for (int i = 0; i < TM; i++) store_row(C, bm+tr+i, bn+tc, c[i]); The full article is available here Repo with repro scripts and benchmark data submitted by /u/NoVibeCoding [link] [comments]
View originalPricing found: $19 / month, $39 / month
Key features include: Friends of Sourcely, Ultra, Check out our other products.
Based on user reviews and social mentions, the most common pain points are: token usage.
Based on 45 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.