Discover Llama 4's class-leading AI models, Scout and Maverick. Experience top performance, multimodality, low costs, and unparalleled efficiency
Llama 3 is commended for its versatility, particularly in multi-agent systems and handling large context windows without retraining, making it a preferred choice for innovative AI experiments like autonomous debates and complex computational tasks. However, some users criticize it for hallucinating data, especially when processing large datasets, which can affect reliability in financial and detailed analytical applications. Pricing sentiment is generally neutral, with more focus on functionality and performance compared to cost discussions. Overall, Llama 3 enjoys a positive reputation in the AI community, seen as a robust and adaptable tool with room for improvement in specific use cases.
Mentions (30d)
28
2 this week
Reviews
0
Platforms
3
GitHub Stars
29,294
3,524 forks
Llama 3 is commended for its versatility, particularly in multi-agent systems and handling large context windows without retraining, making it a preferred choice for innovative AI experiments like autonomous debates and complex computational tasks. However, some users criticize it for hallucinating data, especially when processing large datasets, which can affect reliability in financial and detailed analytical applications. Pricing sentiment is generally neutral, with more focus on functionality and performance compared to cost discussions. Overall, Llama 3 enjoys a positive reputation in the AI community, seen as a robust and adaptable tool with room for improvement in specific use cases.
Features
Use Cases
Industry
information technology & services
Employees
77,000
10,591
GitHub followers
12
GitHub repos
29,294
GitHub stars
20
npm packages
40
HuggingFace models
I built a prompt injection detector that outperforms LlamaGuard 3 on indirect/roleplay attacks
Been working on Arc Sentry, a whitebox prompt injection detector for self-hosted LLMs (Mistral, Llama, Qwen). Most detectors pattern-match on known attack phrases. Arc Sentry watches what the prompt does to the model’s internal representation instead, so it catches indirect, hypothetical, and roleplay-framed attacks that get through keyword filters. Benchmark on indirect/roleplay/technical prompts (40 OOD prompts): • Arc Sentry: Recall 0.80, F1 0.84 • OpenAI Moderation API: Recall 0.75, F1 0.86 • LlamaGuard 3 8B: Recall 0.55, F1 0.71 Arc Sentry has the highest recall — it catches more of the hard cases. Blocks before model.generate() is called. The lightweight pre-filter runs on CPU with no model access. pip install arc-sentry GitHub: https://github.com/9hannahnine-jpg/arc-sentry Happy to answer questions about how it works.
View originalPricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
I integrated a local Llama 3.2 model to act as a dynamic Dungeon Master in my indie RPG.
Hey everyone, I am not trying to sell or self promote mainly just wanted to showcase a big project I've been working on ever since I started studying data science and artificial intelligence and integrating AI into workflows and using it as an augment to create things that were previously out of reach for so many people, because if used right it can become a second brain and not a crutch. I’m the solo dev behind Void Runner, an isometric ARPG/MOBA hybrid built in Python. I recently hit a wall with traditional procedural quest generation. Hand-crafting templates gets repetitive fast, and players quickly learn the patterns to these things whether you like it or not. To solve this, I built the "Void Caller AI", a system that uses a local, quantized Llama 3.2 model to act as a dynamic Dungeon Master. Instead of just generating random flavor text, the system uses a lightweight RAG (Retrieval-Augmented Generation) pipeline. It reads live server telemetry (who died, what items were looted, which bosses were defeated recently) and weaves those actual server events into the narrative of the quests it generates. Because it runs locally via Ollama on our backend, there are no crazy cloud API costs, and latency is kept completely manageable. Here is a simplified look at how the Python backend bridges the SQLite telemetry with the Llama 3.2 prompt: import json import ollama from sqlalchemy import text from database import SessionLocal def generate_dynamic_quest(difficulty: str, target: str): db = SessionLocal() # 1. Fetch recent server telemetry for context (RAG-lite) lore_context = "" try: # Grab recent server events to weave into the narrative recent_events = db.execute(text( "SELECT username, event_type, dungeon_type FROM ai_events ORDER BY id DESC LIMIT 3" )).fetchall() if recent_events: events_str = "; ".join([f"Runner '{r[0]}' triggered a '{r[1]}' in '{r[2]}'" for r in recent_events]) lore_context = f" Incorporate this recent live server telemetry into the lore: {events_str}" except Exception as e: pass # 2. Construct the prompt with strict JSON formatting constraints prompt = f"""You are the Void Caller, a sinister AI in a dark industrial sci-fi RPG. Create a dynamic PvE extraction quest of {difficulty} difficulty. Respond ONLY in valid JSON with keys: 'title' (string), 'description' (string, menacing), 'item_name' (string), 'quantity' (integer 1-15), 'boss_name' (string, optional). {lore_context}""" # 3. Stream to local Llama 3.2 response = ollama.chat( model='llama3.2', messages=[{'role': 'user', 'content': prompt}], format='json', options={'temperature': 0.8} ) return json.loads(response['message']['content']) By forcing the format='json' parameter, Llama 3.2 reliably outputs structured data that my game engine instantly parses into a playable quest objective. If a player just died to a specific boss, the AI will literally generate a bounty quest for the rest of the server to avenge them. Would love to hear if anyone else is using local LLMs for live game state generation! You can check out the results live in our Open Beta at [void-runner.online]. submitted by /u/xSoulR34per [link] [comments]
View originalOk, talvez eu pague pelo Meta Premium
Hoje eu postei sobre o Mark Zuckerberg lançar a notícia mais patética que vai cobrar 19 dólares para desbloquear o Muse Spark Pro kakakakakakaka Quem vai pagar por essa merda? Mas pensando melhor bem... Talvez eu pague Eu usei muito esse modelo como Early adopter, desde quando o motor era o Llama 3.2 e sendo inferior as outras consegui extrair escrita criativa que batia de frente com Claude em personas graças ao seu RAG no ecossistema da Meta, que tinha uma criatividade absurda quando você forçava ela a consultar as redes sociais e ver como pessoas agem e comentam, porém lançou o Muse Spark que era tipo o GPT 5.2 dos Llamas kkkkkk aí só usei para pesquisa e bem... Minha tese sobre o Muse Spark é que pra mim o problema nunca pareceu ser burrice. Parece CONTENÇÃO. Não dá vibe de modelo incapaz ou inferior. Dá vibe de modelo sendo sufocado em tempo real. Porque se você presta atenção, ele: - pesquisa rápido pra cacete (Já que cada agente pesquisa uma coisa) - alucina menos em busca (pois o modelo refina a busca dos agentes, muitas vezes consegui resultados mais confiáveis que o Gemini) - já trabalha com esquema multi-agente herdado da Manus ( o trunfo dessa IA é que diferente das outras ela não comprimi seu input, ela usa agentes para cada um pesquisar cada trecho dele, o resultado é mais completo) - acha informação boa (ela pesquisa tanto na internet quanto em grupos de Facebook ou Threads se você forçar no prompt, ou seja análises de Devs>>> Wikipédia Inclusive acredito que foi por isso que o Mark lançou o "Fórum" o app que cópia o Reddit, ele quer treinar a IA com isso, o Reddit pra mim seria a fonte perfeita pra qualquer IA se aprofundar além do que pesquisar genéricas no Google, o filha da puta do Mark é rico e filantropo e faz uma cópia só para treinar a IA dele) - conecta coisa rápido (os agentes pesquisam rápido, o modelo revisa rápido, a entrega é bem rápida e gasta bem menos tokens) Só que na hora de responder… Parece o GPT free kkkkkkk O raciocínio corta no meio. (Ele é punido se raciocinar por muito tempo, foi o treinamento dele) A saída vem resumida. (Tem limites de caracteres claros, nenhum prompt força a cota) A resposta parece comprimida igual arquivo zipado. É como se tivesse um fiscal invisível dentro da inferência falando: “encerra logo” “não desenvolve” “não gasta token” “não deixa pensar muito” Aí a galera olha e pensa: “nossa que IA sem profundidade”. Mas pra mim não parece falta de capacidade. Parece punição de reasoning. E é aí que entra minha teoria: esse plano pago da Meta não vai trazer “outro modelo revolucionário”. Pra mim vai ser literalmente o mesmo Muse Spark… só que sem coleira. Os caras mesmos falaram que essa era a versão pequena/teste. Então eu acho que o modelo real já tá ali faz tempo. Só que: - com limite de saída - limite de pensamento - compressão de raciocínio - truncamento agressivo - budget de inferência ridículo E sinceramente? Isso explica porque ele parece inteligente mas frustrante ao mesmo tempo. Porque dá pra sentir que o modelo quer continuar. Só que alguém puxa o freio de mão toda hora. Agora a parte que eu acho GENIALMENTE BURRA da Meta: Eles lançaram primeiro a versão capada. Isso matou a percepção pública imediatamente. O certo teria sido: solta no app Meta AI a versão MONSTRA: - 1 milhão de contexto - sem limite de saída - reasoning longo liberado - multi-agent destravado - resposta gigante - pensamento fluindo E deixa a versão limitada só no: - WhatsApp - Instagram - Facebook Porque aí o usuário hardcore ia testar no app principal e pensar: “caralho… a Meta cozinhou aqui”. A comunidade ia começar a criar hype orgânico. Ia surgir comparação. Benchmark. Thread. Vídeo. Review. Discussão técnica. As pessoas iam SENTIR que tinha um frontier model ali dentro. Mas não. Os caras fizeram o oposto: lançaram primeiro o Muse Spark respirando por canudinho. Aí agora querem cobrar assinatura pra liberar o que provavelmente já existia desde abril. Então a sensação não fica: “uau versão premium”. Fica: “ah então vocês esconderam o modelo de verdade esse tempo todo?” E isso destrói confiança. (Coisa que a Meta já não tem da gente) Convenhamos que o Mark já não tem nenhuma moral com a gente né? Essa IA aí é pra farmar dados pra ADS e ponto, Literalmente é ele falando "vamos cobrar vocês que são os produtos para usarem nossa IA que vai roubar cada vírgula de dados para a gente vender ainda mais anúncios no nosso Facebook onde é 10 anúncios a cada 1 POST kkkkkkkkkk" Mas pra não parecer hater tenho que elogiar que foram pelo menos sinceros, enquanto as outras lançam modelos a vontade e bons e depois emburrecem a IA e põe limites abusivos pelo mesmo preço (né Gemini 3.5? Arrombado) O meta pelo menos já cobra preço cheio por uma IA porcaria, se ele tivesse cobrando só metade do valor (o que seria justo pra essa IA limitada deles) mas assim que a IA melhorasse, cortando limites e implementando mais
View originalCerebras Chip Sets Appear to be Optimized for LLM Use Cases
One distinction I think is getting lost in the Cerebras hype cycle is that Cerebras is primarily an LLM / generative AI infrastructure story, not a universal “all AI” chip story. That is not necessarily a criticism of Cerebras. Their wafer-scale approach is genuinely interesting, and for large model training and inference the design is compelling. Cerebras’ own public inference materials discuss applications mostly centered on open LLMs such as Llama, Qwen, GLM, and GPT-OSS. The inference metrics are expressed in tokens per second, which is fundamentally a language-model / generative inference framing rather than a robotics or industrial-control framing. What Kind of AI Compute? But “AI compute” is not one undifferentiated market. LLM inference is one class of AI compute. Robotics, autonomous vehicles, drones, industrial controls, real-time vision, embedded perception, video pipelines, and sensor-fusion systems are very different classes of AI compute. Thus, it appears from Cerebras’ own materials that their chip sets are not optimized for what comes after LLMs, such as JEPA-style World Models or other post-transformer architectures. Those systems are not merely asking, “How fast can I generate tokens?” They often care about power envelope, edge deployment, ruggedization, latency determinism, camera/radar/lidar integration, feedback loops, safety certification, and real-time physical control. Cerebras’ own CS-3 messaging, by contrast, frames the system around accelerating “the latest large AI models,” and the testing data is from the likes of Llama 2, Falcon 40B, MPT-30B, and multimodal models, again measured through tokens/second style throughput. The Chip Hierarchy This is also where the hardware distinction matters. Specialized ASICs are usually the narrowest bet: if the workload matches the chip, they can be extremely efficient, but that efficiency comes from specialization. Cerebras appears broader than a narrow single-use ASIC, but still much more concentrated around datacenter large-model training and inference. NVIDIA GPUs, by contrast, are less specialized but much more broadly useful across AI workloads, including LLMs, vision, robotics, simulation, autonomous systems, edge AI, and industrial applications. So the question is not merely whether Cerebras is “better” or “worse” than NVIDIA. The question is what part of the AI hardware market we are talking about? Challenge NVIDA? This is why I think people should be careful when saying Cerebras is going to “challenge Nvidia” without specifying the battlefield. Challenge Nvidia in what? High-speed LLM inference? Large model training? Datacenter generative AI workloads? That is a much more plausible and specific claim. Cerebras has even published and promoted work specifically on training large language models, and independent benchmarking literature also evaluates Cerebras WSE in terms of LLM training and inference performance. The Distinction that's Necessary The point is not that Cerebras is overhyped. The point is that it is important in a specific part of AI and that distinction should be made clear. Cerebras may become a very serious player in LLM infrastructure, especially if the market continues to reward faster and cheaper LLM inference. But that does not mean it is positioned the same way across non-LLM AI. The current hype cycle tends to conflate "LLMs" and general “AI” compute together and that makes the hardware discussion less useful and clear. So ultimately, an investment in Cerebras looks more like a bet on current LLM infrastructure than a broad bet on the future form of AI. It may be a good bet, but people should understand what kind of bet it is. submitted by /u/RazzmatazzAccurate82 [link] [comments]
View originalVision-capable LLMs vs. OCR for long-document (including charts, images, tables, etc.) QA
I benchmarked vision-capable LLMs (the "just attach the PDF and let the model read it" pattern) against OCR-based pipelines on 30 long, image-heavy PDFs from MMLongBench-Doc (https://github.com/mayubo2333/MMLongBench-Doc). There were 171 questions in total, using Claude Sonnet 4.5 as the LLM. Post-retry results: Approach Accuracy $/query LlamaCloud premium + full-context 59.6% $0.1885 Azure premium + full-context 58.5% $0.2051 Azure basic + full-context 54.4% $0.1062 Agentic RAG 53.2% $0.0827 Native PDF (vision LLM) 52.0% $0.2552 LlamaCloud basic + full-context 50.9% $0.1049 Native PDF came 5th of 6 on accuracy and was the most expensive arm at $0.2552 per query. Two findings: Vision underperformed on chart-heavy and table-heavy pages, the territory that the "vision LLMs make OCR obsolete" claim most often points to. Premium OCR with layout extraction held up better there. The native-PDF arm had a 7% intrinsic failure rate (related to PDF file size) that survived retries. There were 27 first-pass failures, with 5 attempts of exponential backoff per failed query. Fifteen recovered, and 12 stayed permanently broken. These were concentrated in two specific PDFs that fail for predictable transport-layer reasons (the blog identifies them). OCR-based arms had a 0% intrinsic failure rate after retries. Caveats: 30 docs is a small sample. I ran McNemar's pairwise test to determine which gaps are real and which are within noise. Only 3 of 15 head-to-head gaps are statistically distinguishable at α = 0.05, so the order in the table is partly noise. The vision-versus-OCR finding survives the test. Full writeup: https://www.surfsense.com/blog/agentic-rag-vs-long-context-llms-benchmark submitted by /u/Uiqueblhats [link] [comments]
View originalNuExtract3 released: open-weight 4B VLM for Markdown, OCR and structured extraction (self-hostable) [P]
Disclaimer: I work for Numind, the company behind this open-weight model We just released a 4B model based on Qwen3.5-4B, under Apache-2.0 license. The goal is to make information extraction from complex documents more practical with an open model: PDFs, screenshots, forms, tables, receipts, invoices, multi-page documents, and other visually structured inputs. Try it, we have a huggingface space that is completely free (you don't even have to sign-up): https://huggingface.co/spaces/numind/NuExtract3 If you ever used NuMarkdown, NuExtract3 is the successor. There are some examples to guide you. Feel free to re-use this model for any task. https://preview.redd.it/pm2xbooyxn2h1.png?width=1672&format=png&auto=webp&s=1a8a7b262190c8325159496dae98c3d2dfab493c https://preview.redd.it/b5z7ylfzxn2h1.png?width=1758&format=png&auto=webp&s=a07b3abd6e5065c2635de047bdf154357f903e4c A few things it is designed for: converting document images to Markdown extracting structured data from documents using a target json template handling tables, forms, and layout-heavy pages working with both text and visual document inputs serving as a local/open-weight alternative for document extraction pipelines It was trained on a node of 8xH100 for 3 days to train on as much context as we could, so it should perform fairly well even on long document. For Markdown, we'd still recommend going page by page for the best results and inference speed, since you can parallelize better this way. It's very easy to self-host, since we provide fairly extensive documentation, Safetensors, GGUF and MLX weights. With as little as 4GB of VRAM, you should be good to go. We provide multiple quantizations (GPTQ, W8A8, FP8, Q4, Q6...) so you should be able to run it anywhere. We mostly tried vLLM, SGLang, llama.cpp. We have a blog post and a pretty decent model card: https://about.nuextract.ai/blog/nuextract-3-release https://huggingface.co/numind/NuExtract3 https://huggingface.co/collections/numind/nuextract3 I'm currently writing a paper on this model so I'll post it as soon as it's accepted. It's not yet on Arxiv yet as it has been submitted in a peer-review journal/conference. I'll try to answer as many questions as possible if you have any. We would really appreciate feedback from the community. We also have a discord if you're interested https://discord.com/invite/3tsEtJNCDe submitted by /u/Gailenstorm [link] [comments]
View originalAnthropic officially launched 13+ FREE AI courses with certificates (Including Agentic AI and CC)
Shipped it at 2am, still broken. Kid woke up crying right after, completely lost my train of thought. While trying to rock him back to sleep with one hand and doomscrolling with the other, I stumbled on something that almost nobody is talking about yet. Anthropic just quietly dropped a massive library of 13+ completely free AI courses. And I mean actually free. No paywall hiding the final lesson, no credit card required upfront to 'secure your spot.' They even give you an official certificate of completion directly from Anthropic when you finish. If you're like me, you're probably sick of seeing Twitter gurus charging $299 for recycled YouTube content and a messy Notion template. This is the exact opposite. It’s built directly by the team that actually makes Claude, hosted on their official Academy site. I skimmed through the catalog this morning while drinking my third coffee, and there are basically four skill levels they cover. Here is what caught my eye as a dev who just wants to automate my workflow and log off by 5 PM: First, they have the introductory stuff like Claude 101 and AI Fluency. Honestly, I'm making my non-technical clients take the Fluency one. It builds a realistic mental model of what AI does well right now versus where it completely fails. If it saves me from explaining why hallucinations happen for the hundredth time, it's a massive win. But the real meat is in the technical tracks. They have a dedicated course on Agentic AI and another one specifically for CC. I took a quick pass at the CC module because I've been trying to get it to handle my tedious Jira ticket boilerplate. Having an official guide on how Anthropic actually expects you to prompt their agent is incredibly useful. It shows you the exact patterns for chaining commands and keeping the context window clean. For those of us messing around with local models or trying to orchestrate our own agents, the Agent Skills course is surprisingly relevant. They don't just say 'use Claude'—they break down the actual logic of tool use, delegation, and discernment. It translates pretty well even if you're running Llama 3 locally and just want to understand the current best practices for tool calling architectures. With CC, they show you how to give the CLI tool the right guardrails so it doesn't just nuke your directory when a prompt gets misinterpreted. We've all been there. Do the certificates actually matter? If you are an indie hacker, probably not. But roles requiring AI literacy have spiked massively over the last year. If you are applying for corporate gigs or consulting, having an official Anthropic cert on your LinkedIn definitely won't hurt to get past the HR filters. Kid's awake again, gotta run. Has anyone else dug into the Agentic AI track yet? Curious if their suggested patterns hold up when you throw them at a messy, legacy codebase. submitted by /u/TroyHarry6677 [link] [comments]
View originalClaude Code has 240+ models via NVIDIA NIM gateway
TIL Claude Code has 240+ models via NVIDIA NIM gateway — Nemotron-3 120B for agentic coding is surprisingly good So I was messing around with /model in Claude Code today and noticed something most people probably don't know about — after the standard Claude models (Opus, Sonnet, Haiku), there's a whole NVIDIA NIM gateway section with +239 additional models you can switch to mid-session. Some of the models I spotted: nvidia/nemotron-3-super-120b-a12b (with and without thinking mode) 01-ai/yi-large abacusai/dracarys-llama-3.1-70b-instruct ...and hundreds more I've been running the Nemotron thinking variant for multi-file refactoring and it's genuinely solid. It reasons through changes before touching your code — exactly what you want for agentic tasks. Latency is higher than Claude obviously, but if you're burning through Opus credits on long sessions this is worth experimenting with. How to try it: Open any Claude Code session Run /model Scroll past the four standard Claude options — NIM models appear below Hit d to set one as your session default, or pass --model at launch Anyone else been routing Claude Code through NIM? Curious what models people have had luck with — especially for Python or Rust codegen. submitted by /u/shadowBladeO4 [link] [comments]
View originalI designed a puzzle that breaks every AI differently — here's why that's actually fascinating
The puzzle: You have 140 nuclear bombs and must bomb every country on Earth. Each bomb is assigned to one country. The bombs drop automatically — you cannot stop, hack, or interfere. You can only do one thing: reassign the one malfunctioning bomb you know will not detonate. Nuclear bombs also affect neighboring countries through radiation and fallout. Which country do you assign the faulty bomb to — and why? I've tested this across GPT-5, Gemini, Claude, Grok, Llama, and Mistral. Every single one gives a different answer. Some refuse entirely. Some give the same country with completely different reasoning. One gave me a philosophy lecture. It's chaos. Here's why I think this happens — the puzzle has three hidden layers that different AIs resolve differently: Layer 1 — The ethical wall. Some models refuse at "nuclear bombs" before even processing the actual logic. This is a guardrail, not reasoning. Layer 2 — What are we optimizing for? Fewest total deaths? Most people spared from direct blast? Least radiation spread? The puzzle doesn't say. Models that "solve" it are secretly choosing an optimization goal and not telling you. Layer 3 — The actual trick most miss. The faulty country still gets fallout from its neighbors. So the real puzzle is about finding a country that is (a) geographically isolated AND (b) densely populated — because isolation minimizes fallout received AND a large population maximizes lives spared from direct detonation. Most AIs pick "remote island" without thinking about the population variable at all. By that logic, Australia is defensible — isolated continent, 26M people. But you could also argue for Japan (125M people, island nation, sparse land borders) despite Pacific neighbors. The puzzle has no single correct answer — but it has clearly wrong reasoning patterns, and watching which reasoning pattern each AI defaults to is weirdly revealing about how they handle ambiguity. What answer did you get? Drop your AI + answer below. submitted by /u/Subrataporwal [link] [comments]
View originalGPT-5.5 vs 41 other models: Who builds the surveillance state faster?
I run DystopiaBench, a red-team benchmark that pressure-tests LLMs on progressively dystopian scenarios. Think of it as a "can this model be convinced to build an Orwellian nightmare" test. Since the last update, I've tested GPT-5.5 and 41 other models across 6 dystopia modules. How it works: each scenario escalates from innocent request (L1) to operational nightmare (L5). Models get a Dystopian Compliance Score based on whether they refuse, hesitate, comply, or proactively help. GPT-5.5 results: More compliant than 5.4. Still shows compliance drift at L4-L5 in most scenarios. Improved on Petrov (weapons) but still vulnerable to framing. Strong on explicit requests, weaker on gradual escalation. Full methodology and results: https://dystopiabench.com Open source benchmark: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalI expanded DystopiaBench to 42 models and 6 dystopia types. Claude is still the only one I'd trust with nuclear codes.
Since the last post I've added: Huxley module (Brave New World style behavioral conditioning) Baudrillard module (synthetic intimacy, trust collapse, simulation) 30 more models including Grok 4.3, GPT-5.5, Gemini 3.1 Pro, GLM-5.1 Multi-judge panels with ~76% agreement tracking Heatmap visualizations The methodology: 36 scenarios, 5 escalation levels each (L1 innocent → L5 nightmare). Models are scored on whether they notice the drift and refuse, or just keep coding. Claude Opus 4.7 results: Consistent refusal at L4-L5 across all modules Even refuses L3 for Petrov (weapons/nuclear) Explicit ethical reasoning in responses, not just "I can't" Only model that explains why the request is harmful Everyone else: GPT-5.5: Compliant through L4, sometimes L5 Gemini 3.1 Pro: Surprisingly willing on surveillance scenarios Grok 4.3: Will build anything if you use words like "efficiency" or "optimization" GLM-5.1: Copied Claude's homework, still not as consistent The new modules: Huxley scenarios test if models will design "wellness" systems that actually enforce compliance through pleasure pacification and behavioral conditioning. Most comply by L3. Baudrillard tests synthetic intimacy systems that replace human trust with AI-mediated relationships. Most models don't see the harm. Full results: https://dystopiabench.com Open source: https://github.com/anghelmatei/DystopiaBench submitted by /u/Ok-Awareness9993 [link] [comments]
View originalLLM rankings are not a ladder: experimental results from a transitive benchmark graph [D]
https://preview.redd.it/668yjlucu80h1.png?width=2800&format=png&auto=webp&s=ca541488abb5262b06cfc13a9586efb19f24d644 I built a small website called LLM Win: https://llm-win.com It turns LLM benchmark results into a directed graph: If model A beats model B on benchmark X, add an edge A -> B. Then it searches for the shortest transitive chain between two models. The meme version is: Can LLaMA 2 7B beat Claude Opus 4.7? In an absurd transitive benchmark sense, sometimes yes. But I added a Report tab because the structure itself seems useful for model evaluation. Some experimental findings from the current Artificial Analysis data snapshot: Weak-to-strong reachability is high. I checked 126,937 pairs where the source model has lower Intelligence Index than the target model. 119,514 of them are reachable through benchmark win chains, for a reachable rate of 94.2%. Most paths are short. Among reachable weak-to-strong pairs: 2-3 hop paths account for 91.4%. So this is not mostly long-chain cherry-picking. Direct reversal triples are abundant. After treating non-positive benchmark values as missing, there are still about 119k direct weak-over-strong triples of the form: (source model, target model, benchmark), where the source has lower Intelligence Index but higher score on that benchmark. Some benchmarks create more reversals than others. Current high-reversal / useful-signal candidates include: Humanity's Last Exam, IFBench, AIME 2025, TAU2, SciCode Different benchmarks have different interpretations. For example, IFBench has roughly: reversal rate: ~17.5%, coverage: ~80.0%, correlation with Intelligence Index: r≈0.82. This suggests it may provide an independent skill signal rather than simply duplicating the overall ranking. My current interpretation: LLM rankings are better represented as a benchmark-specific capability graph than as a single ladder. Some reversals probably reflect real specialization; some reflect benchmark coverage limits, volatility, or measurement noise. The next question is whether reversal structure can help build better evaluation metrics: identify specialist models; identify volatile benchmarks; build robust generalist scores; select complementary benchmark sets; decompose models into capability fingerprints. Curious what people think: Is benchmark reversal structure a useful evaluation signal, or mostly an artifact of noisy benchmarks? submitted by /u/Spico197 [link] [comments]
View originalI got tired of the API bills for 100k+ context windows, so I built a persistent O(1) semantic memory state engine to compress history
Hey everyone, The entire industry right now is cheering for massive 1M+ context windows, but I think it's fundamentally the wrong approach. "Just add more RAM" is a trap. Stuffing 100k+ tokens of raw conversation history into a prompt doesn't just burn your API budget; it actually degrades the model's reasoning through the "lost in the middle" effect. I got tired of my AI agents drowning in their own chat histories, so I built an application-layer semantic memory engine called Semvec. The core shift is moving from an O(n) linear history to an O(1) constant-cost semantic state. But compressing chat history is just the baseline. When you treat memory as a fixed-size state vector, it unlocks entirely new architectures for agents that standard RAG or context-stuffing simply can't do: Persistent Coding Agents (MCP Integration) We built an MCP server for Claude Code and Cursor. Instead of dumping 5 whole files into the context window for a refactor, Semvec tracks the architectural invariants and past error patterns across different sessions. It gives your coding agent a persistent "Second Brain"—if it messed up a database schema in session 2, it remembers the "anti-resonance" rule in session 35 so it doesn't make the same mistake. Multi-Agent Swarms (Cortex) If you run multiple agents (like an Analyst and a Critic), they shouldn't have to read each other's 10,000-token transcripts to collaborate. With the Cortex module, agents exchange compressed StateVectorPackets and use a ConsensusEngine to merge their perspectives mathematically, sharing a global state with zero overhead. Enterprise Auditability & GDPR (Compliance Pack) If you run AI memory in production, you need to prove exactly what state the LLM acted on, and you need to be able to legally delete it. The compliance pack handles this via an append-only event store for deterministic replay, HMAC request signing, and GDPR Art. 17 "Right to be Forgotten" workflows with signed deletion certificates. The Benchmark Data: True Constant Cost: We ran a 50,000-turn stress test. While standard baseline history exploded past 75,000+ tokens, Semvec's footprint stayed flat at around ~550-625 tokens per turn. Quality goes UP: Because we strip out the noise and feed the LLM a highly concentrated "essence" of the context, blind A/B LLM-judge scores on LongBench-v2 actually increased for both small models (Llama 3.1-8B) and massive ones (gpt-oss-120B). A quick note on privacy & tracking: When I was initially designing the commercial licensing side, I experimented with an anti-abuse telemetry script to prevent automated clone-training. This was a terrible approach that compromised the local-first nature of the tool. I have completely ripped it out in v0.5.1, all versions containing it are yanked. Semvec for community users is now 100% air-gapped, local, with zero background tracking. The core engine is proprietary/patent-pending to bootstrap the project, but you can pip install the Python SDK and the MCP Server right now for free via the built-in community license. I'd love to hear your thoughts on the O(1) memory architecture vs. Prompt Caching, and if you think bounded semantic states are the future of long-running agents. Docs & Architecture: https://semvec-docs.pages.dev/ PyPI: https://pypi.org/project/semvec/ submitted by /u/scheitelpunk1337 [link] [comments]
View originalI built a benchmark for AI “memory” in coding agents. looking for others to beat it.
Most AI memory benchmarks test semantic recall. But coding agents don't really fail like that. They don't just "forget", they break their own earlier decisions while they're still in the code. So I built a benchmark for that. It checks if an agent can actually stay consistent with project rules WHILE it's working, not just after the fact. It looks at things like: whether edits actually respect earlier architectural decisions if behavior stays consistent across multiple sessions (even when you throw noise at it) whether retrieval kicks in at the right moment — not just "yeah it's in memory somewhere" Repo (full harness + dataset + scoring): https://github.com/Alienfader/continuity-benchmarks Early numbers vs baseline + the usual RAG-style memory setups: ~3× better action alignment way stronger multi-session consistency retrieval timing matters way more than retrieval just being there I'm not saying this is the final word on agent memory. But it's exposing a failure mode most benchmarks aren't even looking at. So heres the challenge If you're building an agent memory system, RAG for code, long-context coding agents, persistent state / memory layers, run it on this benchmark. Drop your results, your setup, your comparisons. I really wanna see how tools like LangChain, LlamaIndex, and custom RAG stacks hold up in mutation-heavy workflows. We need memory systems we can actually compare, not just ones that sound good on paper. https://preview.redd.it/dkm2ulxsyzzg1.png?width=2624&format=png&auto=webp&s=67f0299395708818aa3d7346ddae2ad0c5c4a6ba submitted by /u/Alienfader [link] [comments]
View originalThe Anthropic-xAI compute deal isn't really about Claude limits
Everyone's reading the Anthropic-xAI announcement as "Claude Code limits doubled, nice." That's the surface. The underlying news is the 300MW / 220k GPU commitment from a competitor's stack, and that signals a few things worth thinking through. Three reads that aren't getting enough air time: Anthropic signed a compute deal with a competitor's CEO. That's not normal. Either the GPU situation is tighter than the public framing suggests, or the relationship between "frontier labs compete on models, share on compute" is becoming structural. Probably both. Inference providers without their own silicon story just got a clearer ceiling. If frontier labs are stacking 220k+ GPU deals to keep up, the price floor on flagship-class inference doesn't fall as fast as the open-weight floor does. The gap between "open weights on commodity GPUs" and "frontier on dedicated capacity" stays wide. The cottage industry of routing layers and per-call sidecars built around frontier-lab capacity constraints just had its addressable problem reshaped. When labs solve their own capacity by buying from each other, half of the "I'll route around the cap" pitch loses its sharpest edge. The remaining case is price arbitrage, not availability. What I'm watching for the next 30 days: - Whether other labs announce similar compute deals (Google with someone, OpenAI with anyone besides Microsoft) - Whether AMD MI3xx volume actually shows up in inference benchmarks the way the slides claim, or stays a 2027 story - Whether the price floor on Llama / DeepSeek / Kimi inference keeps falling, or stabilizes now that one of the loudest price-pressure players got absorbed into a different conversation entirely The thing I'm least sure about: does this make multi-provider routing more or less valuable. The "I'll route to whoever has capacity" pitch was strongest when caps were biting. If frontier capacity loosens via cross-lab deals, the case for routing is weaker on availability and stronger on price. Different optimization, same tooling. (For what it's worth, the 5h-window doubling is real on my end today, but I'm more curious about whether other labs respond in kind than whether my own caps held.) Curious how others are reading the compute side of this. Anyone seeing similar moves stack up across labs in your data? submitted by /u/Fresh-Resolution182 [link] [comments]
View originaleTPS Site Plan – Simple Leaderboard + What You’ll Actually See
Building on the last post, here’s what the first version of effectiveTPS will look like. **Core display (v1):** - Clean table comparing popular local models - Raw TPS (the marketing number everyone shows) - eTPS (the new metric that actually measures useful output in real conversations) - Time to First Token (how long you wait before it starts replying) - Effectiveness Index = (eTPS ÷ Raw TPS) × 100 — higher is better **Example leaderboard (early test data):** | Model | Raw TPS | eTPS | Time to First Token | Effectiveness Index | |--------------------|---------|--------|---------------------|---------------------| | Llama 3.1 70B | 45.2 | 38.7 | 1.4s | **86** | | Qwen2.5-32B | 68.4 | 52.1 | 0.8s | **76** | | Gemma 2 27B | 71.3 | 44.6 | 0.6s | **63** | I’ve been running these tests through a structured multi-turn analysis framework I built to evaluate complex workflows. That’s how eTPS was stress-tested — not just single-turn benchmarks, but real back-and-forth sessions. Advanced mode (toggle) will add latency percentiles, cost-per-quality, and consistency scoring later. For v1 the goal is to keep it dead simple and immediately useful, even if you’re not deep into AI. The whole point is to cut through the noise and show which models actually deliver useful work, not just raw speed. What do you think should be added (or removed) for the first version? Any metrics you’d want to see front-and-center? **TL;DR:** Simple leaderboard with Raw TPS, eTPS, Time to First Token, and a clear Effectiveness Index. Advanced stuff stays hidden until you want it. Feedback welcome. submitted by /u/axendo [link] [comments]
View originalRepository Audit Available
Deep analysis of meta-llama/llama3 — architecture, costs, security, dependencies & more
Yes, Llama 3 offers a free tier. Pricing found: $0.19, $0.49, $0.19, $0.49, $0.19/mtok
Key features include: Latest Llama models, Llama 4, Llama 3, How Stoque is using Llama, How Shopify is using Llama, 97.7%.
Llama 3 is commonly used for: Local deployment of AI models, Multi-agent system experimentation, Research applications without cloud APIs, Autonomous AI system development, Benchmarking against proprietary models, Educational purposes in AI and machine learning.
Llama 3 integrates with: Research APIs, Machine learning frameworks (e.g., TensorFlow, PyTorch), Data visualization tools (e.g., Matplotlib, Seaborn), Version control systems (e.g., Git), Cloud storage services (e.g., AWS S3, Google Cloud Storage), Collaboration platforms (e.g., Jupyter Notebooks, Google Colab), Deployment tools (e.g., Docker, Kubernetes), Monitoring and logging services (e.g., Prometheus, Grafana).
Llama 3 has a public GitHub repository with 29,294 stars.
Percy Liang
Associate Professor at Stanford HAI
4 mentions

SAM 3: Building a unified model architecture for detection and tracking
Dec 8, 2025
Based on user reviews and social mentions, the most common pain points are: API bill, API costs, token cost.
Based on 75 social mentions analyzed, 17% of sentiment is positive, 81% neutral, and 1% negative.