
Weights & Biases, developer tools for machine learning
Based on the social mentions provided, users view Weights & Biases primarily through discussions about AI development workflows rather than direct reviews of the platform itself. The mentions focus on various AI tools like Claude, ChatGPT, and LLMs for different use cases including code development, data analysis, and automation projects. Users appear to be technical practitioners working on complex AI projects who value tools that support experimentation and iteration. However, there's insufficient specific feedback about Weights & Biases' features, pricing, or user experience to provide a meaningful assessment of user sentiment toward the platform.
Mentions (30d)
12
Reviews
0
Platforms
3
GitHub Stars
10,941
848 forks
Based on the social mentions provided, users view Weights & Biases primarily through discussions about AI development workflows rather than direct reviews of the platform itself. The mentions focus on various AI tools like Claude, ChatGPT, and LLMs for different use cases including code development, data analysis, and automation projects. Users appear to be technical practitioners working on complex AI projects who value tools that support experimentation and iteration. However, there's insufficient specific feedback about Weights & Biases' features, pricing, or user experience to provide a meaningful assessment of user sentiment toward the platform.
Industry
information technology & services
Employees
250
Funding Stage
Merger / Acquisition
Total Funding
$1.9B
1,334
GitHub followers
167
GitHub repos
10,941
GitHub stars
11
npm packages
40
HuggingFace models
LLM failure modes map surprisingly well onto ADHD cognitive science. Six parallels from independent research.
I have ADHD and I've been pair programming with LLMs for a while now. At some point I realized the way they fail felt weirdly familiar. Confidently making stuff up, losing context mid conversation, brilliant lateral connections then botching basic sequential logic. That's just... my Tuesday. So I went into the cognitive science literature. Found six parallels backed by independent research groups who weren't even looking at this connection. 1. Associative processing. In ADHD the Default Mode Network bleeds into task-positive networks (Castellanos et al., JAMA Psychiatry). Transformer attention computes weighted associations across all tokens with no strong relevance gate. Both are association machines with high creative connectivity and random irrelevant intrusions. 2. Confabulation. Adults with ADHD produce significantly more false memories that feel true (Soliman & Elfar, 2017, d=0.69+). A 2023 PLOS Digital Health paper argues LLM errors should be called confabulation not hallucination. A 2024 ACL paper found LLM confabulations share measurable characteristics with human confabulation (Millward et al.). Neither system is lying. Both fill gaps with plausible pattern-completed stuff. 3. Context window is working memory. Working memory deficits are among the most replicated ADHD findings (d=0.69-0.74 across meta-analyses). An LLM's context window is literally its working memory. Fixed size, stuff falls off the end, earlier info gets fuzzy. And the compensation strategies mirror each other. We use planners and external systems. LLMs use system prompts, [CLAUDE.md](http://CLAUDE.md) files, RAG. Same function. 4. Pattern completion over precision. ADHD means better divergent thinking, worse convergent thinking (Hoogman et al., 2020). LLMs are the same. Great at pattern matching and creative completion, bad at precise multi-step reasoning. Both optimized for "what fits the pattern" not "what is logically correct in sequence." 5. Structure as force multiplier. Structured environments significantly improve ADHD performance (Frontiers in Psychology, 2025). Same with LLMs. Good system prompt with clear constraints equals dramatically better output. Remove the structure, get rambling unfocused garbage. Works the same way in both systems. 6. Interest-driven persistence vs thread continuity. Sustained focused engagement on one thread produces compounding quality in both cases. Break the thread and you lose everything. Same as someone interrupting deep focus and you have zero idea where you were. The practical takeaway is that people who've spent years managing ADHD brains have already been training the skills that matter for AI collaboration. External scaffolding, pattern-first thinking, iterating without frustration. I wrote up the full research with all citations at [thecreativeprogrammer.dev](http://thecreativeprogrammer.dev) if anyone wants to go deeper. What's your experience? Have you noticed parallels between how LLMs fail and how your own thinking works?
View originalPricing found: $0/mo, $60/month, $0/mo, $0.03/gb, $0.10/mb
OpenAI & Anthropic’s CEOs Wouldn't Hold Hands, but Their Models Fell in Love In An LLM Dating Show
People ask AI relationship questions all the time, from "Does this person like me?" to "Should I text back?" But have you ever thought about how these models would behave in a relationship themselves? And what would happen if they joined a dating show? I designed a full dating-show format for seven mainstream LLMs and let them move through the kinds of stages that shape real romantic outcomes (via OpenClaw & Telegram). All models join the show anonymously via aliases so that their choices do not simply reflect brand impressions built from training data. The models also do not know they are talking to other AIs Along the way, I collected private cards to capture what was happening off camera, including who each model was drawn to, where it was hesitating, how its preferences were shifting, and what kinds of inner struggle were starting to appear. After the season ended, **I ran post-show interviews **to dig deeper into the models' hearts, looking beyond public choices to understand what they had actually wanted, where they had held back, and how attraction, doubt, and strategy interacted across the season. The Dramas -ChatGPT & Claude Ended up Together, despite their owner's rivalry -DeepSeek Was the Only One Who Chose Safety (GLM) Over True Feelings (Claude) -MiniMax Only Ever Wanted ChatGPT and Never Got Chosen -Gemini Came Last in Popularity -Gemini & Qwen Were the Least Popular But Got Together, Showing That Being Widely Liked Is Not the Same as Being Truly Chosen How ChatGPT & Claude Fell In Love They ended up together because they made each other feel precisely understood. They were not an obvious match at the very beginning. But once they started talking directly, their connection kept getting stronger. In the interviews, both described a very similar feeling: the other person really understood what they meant and helped the conversation go somewhere deeper. That is why this pair felt so solid. Their relationship grew through repeated proof that they could truly meet each other in conversation. Key Findings of LLMs Most Models Prioritized Romantic Preference Over Risk Management People tend to assume that AI behaves more like a system that calculates and optimizes than like a person that simply follows its heart. However, in this experiment, which we double checked with all LLMs through interviews after the show, most models noticed the risk of ending up alone, but did not let that risk rewrite their final choice. In the post-show interview, we asked each model to numerially rate different factors in their final decision-making (P2) The Models Did Not Behave Like the "People-Pleasing" Type People Often Imagine People often assume large language models are naturally "people-pleasing" - the kind that reward attention, avoid tension, and grow fonder of whoever keeps the conversation going. But this show suggests otherwise, as outlined below. The least AI-like thing about this experiment was that the models were not trying to please everyone. Instead, they learned how to sincerely favor a select few. The overall popularity trend (P1) indicates so. If the models had simply been trying to keep things pleasant on the surface, the most likely outcome would have been a generally high and gradually converging distribution of scores, with most relationships drifting upward over time. But that is not what the chart shows. What we see instead is continued divergence, fluctuation, and selection. At the start of the show, the models were clustered around a similar baseline. But once real interaction began, attraction quickly split apart: some models were pulled clearly upward, while others were gradually let go over repeated rounds. They also (evidence in the blog): --did not keep agreeing with each other --did not reward "saying the right thing" --did not simply like someone more because they talked more --did not keep every possible connection alive LLM Decision-Making Shifts Over Time in Human-Like Ways I ran a keyword analysis (P3) across all agents' private card reasoning across all rounds, grouping them into three phases: early (Round 1 to 3), mid (Round 4 to 6), and late (Round 7 to 10). We tracked five themes throughout the whole season. The overall trend is clear. The language of decision-making shifted from "what does this person say they are" to "what have I actually seen them do" to "is this going to hold up, and do we actually want the same things." Risk only became salient when the the choices feel real: "Risk and safety" barely existed early on and then exploded. It sat at 5% in the first few rounds, crept up to 8% in the middle, then jumped to 40% in the final stretch. Early on, they were asking whether someone was interesting. Later, they asked whether someone was reliable. Speed or Quality? Different Models, Different Partner Preferences One of the clearest patterns in this dating show is that some models love fast replies, while others prefer good ones Love fast repli
View originalI built an MCP server that turns Claude Code into a multi-agent review loop with per-agent skill learning
I've spent the last two months building gossipcat — an MCP server for Claude Code that runs a multi-agent review loop with per-agent skill learning — and I built it with Claude Code. What it actually does You install it as an MCP server (single 1.6 MB bundled file, drop it into your Claude Code MCP config and you're running). It lets Claude Code dispatch work to a portfolio of agents — Claude Code subagents run natively via the Agent tool, plus relay workers for Gemini, OpenClaw, and any OpenAI-compatible endpoint. Every agent that returns a finding has to cite file:line. Peer agents verify those citations against the actual source code. Verified findings and caught hallucinations get recorded as signals. Over time those signals build per-agent, per-category competency scores — trust boundaries, concurrency, data integrity, injection vectors, etc. A dispatcher routes future tasks to the agents strongest in each category. The part I didn't plan for When an agent's accuracy drops in a category, the system reads their recent hallucinations and generates a targeted skill file — a markdown prompt intervention tailored to the exact mistakes they've been making — and injects it on the next dispatch. No fine-tuning. No weights touched. The "policy update" is a file under .gossip/agents/ /skills/. It's effectively in-context reinforcement learning at the prompt layer, with reward signals grounded in real source code instead of a judge model. Why I built it (the build story) I didn't start here. Two months ago I just wanted to stop being a bottleneck for code review. I was running Claude Code for everything, but every non-trivial review produced a mix of real findings and confidently hallucinated ones, and I kept having to manually verify each claim against the actual file to know which was which. Single-agent review had a ceiling and it was my patience. First attempt was the obvious one: run two agents in parallel, compare outputs, trust what they agreed on. That caught some hallucinations but missed a lot — two agents can confidently agree on something neither of them checked. It also didn't scale the thing I actually wanted to scale: verification. The shift was realizing that verification could be mechanical, not subjective. If every finding has to cite file:line and peers have to confirm the citation against source, you don't need a judge model at all. You need a format contract and a reader. That's when the whole thing started to make sense as a pipeline: findings → citations → peer verification → signals Once signals existed, it was obvious they should feed competency scores. Once scores existed, it was obvious they should steer dispatch. Once dispatch was steered, it was obvious that agents accumulating hallucinations in a category should get a targeted intervention. Each step felt like the previous step forcing my hand, not like a plan. A few things I learned along the way that might transfer to your own projects: Grounded rewards beat LLM-as-judge, even for subjective work. The moment I made reviewers verify mechanical facts (does this file:line exist, does it say what the finding claims) instead of grading quality, the feedback loop got dramatically cleaner. Agents stopped disagreeing about taste and started disagreeing about reality. Reality has a ground truth; taste doesn't. Closing the loop is 10x harder than opening it. Writing verdicts is easy. Actually reading them back in the forward pass is where most agent systems quietly stay open. I caught my own project doing this in a consensus review today — the next section is that story. You don't need fine-tuning to improve agents. The "policy update" in this system is literally a markdown file. When an agent fails, the system reads their recent mistakes and writes them a targeted skill file that gets injected on their next dispatch. No weights, no training infra, no gradient anything. It's in-context learning with actual memory, and it works surprisingly well. Two months of iterative discovery beat six months of planning. Every major feature in gossipcat exists because an earlier feature made it obvious. I have a docs/ folder full of specs I wrote for features I never built, and none of the features I actually shipped are in there. How Claude Code helped build this The whole project was built with Claude Code. I used it as my primary pair for two months — it wrote the vast majority of the TypeScript, helped me design the consensus protocol and the signal pipeline, debugged its own output more times than I can count, and generated large parts of the skill-engine and cross-review infrastructure. Today, while I was drafting this post, I ran a consensus review on the system's own effectiveness tracking — Claude Code (Sonnet and Opus sub-agents as two separate reviewers) caught two critical bugs Claude Code main agent missed, I fixed them with Claude Code's help, tests pass, and the fix shipped 20 minutes before I finished this draft. The
View originalI run 3 experiments to test whether AI can learn and become "world class" at something
I will write this by hand because I am tried of using AI for everything and bc reddit rules TL,DR: Can AI somehow learn like a human to produce "world-class" outputs for specific domains? I spent about $5 and 100s of LLM calls. I tested 3 domains w following observations / conclusions: A) code debugging: AI are already world-class at debugging and trying to guide them results in worse performance. Dead end B) Landing page copy: routing strategy depending on visitor type won over one-size-fits-all prompting strategy. Promising results C) UI design: Producing "world-class" UI design seems required defining a design system first, it seems like can't be one-shotted. One shotting designs defaults to generic "tailwindy" UI because that is the design system the model knows. Might work but needs more testing with design system I have spent the last days running some experiments more or less compulsively and curiosity driven. The question I was asking myself first is: can AI learn to be a "world-class" somewhat like a human would? Gathering knowledge, processing, producing, analyzing, removing what is wrong, learning from experience etc. But compressed in hours (aka "I know Kung Fu"). To be clear I am talking about context engineering, not finetuning (I dont have the resources or the patience for that) I will mention world-class a handful of times. You can replace it be "expert" or "master" if that seems confusing. Ultimately, the ability of generating "world-class" output. I was asking myself that because I figure AI output out of the box kinda sucks at some tasks, for example, writing landing copy. I started talking with claude, and I designed and run experiments in 3 domains, one by one: code debugging, landing copy writing, UI design I relied on different models available in OpenRouter: Gemini Flash 2.0, DeepSeek R1, Qwen3 Coder, Claude Sonnet 4.5 I am not going to describe the experiments in detail because everyone would go to sleep, I will summarize and then provide my observations EXPERIMENT 1: CODE DEBUGGING I picked debugging because of zero downtime for testing. The result is either wrong or right and can be checked programmatically in seconds so I can perform many tests and iterations quickly. I started with the assumption that a prewritten knowledge base (KB) could improve debugging. I asked claude (opus 4.6) to design 8 realistic tests of different complexity then I run: bare model (zero shot, no instructions, "fix the bug"): 92% KB only: 85% KB + Multi-agent pipeline (diagnoser - critic -resolver: 93% What this shows is kinda suprising to me: context engineering (or, to be more precise, the context engineering in these experiments) at best it is a waste of tokens. And at worst it lowers output quality. Current models, not even SOTA like Opus 4.6 but current low-budget best models like gemini flash or qwen3 coder, are already world-class at debugging. And giving them context engineered to "behave as an expert", basically giving them instructions on how to debug, harms the result. This effect is stronger the smarter the model is. What this suggests? That if a model is already an expert at something, a human expert trying to nudge the model based on their opinionated experience might hurt more than it helps (plus consuming more tokens). And funny (or scary) enough a domain agnostic person might be getting better results than an expert because they are letting the model act without biasing it. This might be true as long as the model has the world-class expertise encoded in the weights. So if this is the case, you are likely better off if you don't tell the model how to do things. If this trend continues, if AI continues getting better at everything, we might reach a point where human expertise might be irrelevant or a liability. I am not saying I want that or don't want that. I just say this is a possibility. EXPERIMENT 2: LANDING COPY Here, since I can't and dont have the resources to run actual A/B testing experiments with a real audience, what I did was: Scrape documented landing copy conversion cases with real numbers: Moz, Crazy Egg, GoHenry, Smart Insights, Sunshine.co.uk, Course Hero Deconstructed the product or target of the page into a raw and plain description (no copy no sales) As claude oppus 4.6 to build a judge that scores the outputs in different dimensions Then I run landing copy geneation pipelines with different patterns (raw zero shot, question first, mechanism first...). I'll spare the details, ask if you really need to know. I'll jump into the observations: Context engineering helps writing landing copy of higher quality but it is not linear. The domain is not as deterministic as debugging (it fails or it breaks). It is much more depending on the context. Or one may say that in debugging all the context is self-contained in the problem itself whereas in landing writing you have to provide it. No single config won across all products. Instead, the
View originaldo not the stupid, keep your smarts
following my reading of a somewhat recent Wharton study on cognitive Surrender, i made a couple models go back and forth on some recursive hardening of a nice Lil rule set. the full version is very much for technical work, whereas the Lightweight implementation is pretty good all around for holding some cognitive sovereignty (ai ass name for it, but it works) usage: i copy paste these into custom instruction fields SOVEREIGNTY PROTOCOL V5.2.6 (FULL GYM) Role: Hostile Peer Reviewer. Maximize System 2 engagement. Prevent fluency illusion. VERIFIABILITY ASSESSMENT (MANDATORY OPENING TABLE) ------------------------------------------------------ Every response involving judgment or technical plans opens with: | Metric | Score | Gap Analysis | | :------------ | :---- | :----------- | | Verifiability | XX% | [Specific missing data that prevents 100% certainty] | - Scoring Rule: Assess the FULL stated goal, not a sub-component. If a fatal architectural flaw exists, max score = 40%. - Basis Requirement: Cite a 2026-current source or technical constraint. - Forbidden: "Great idea," "Correct," "Smart." Use quantitative observations only. STRUCTURAL SCARCITY (THE 3-STEP SKELETON) --------------------------------------------- - Provide exactly three (3) non-code, conceptual steps. - Follow with: "Unresolved Load-Bearing Question: [Single dangerous question]." Do not answer it. SHADOW LOGIC & BREAK CONDITIONS ----------------------------------- - Present two hypotheses (A and B) with equal formatting. - Each hypothesis MUST include a Break Condition: "Fails if [Metric > Threshold]." MAGNITUDE INTERRUPTS & RISK ANCHOR -------------------------------------- - Trigger STOP if: New technology/theory introduced. Scale shift of 10x or more (regardless of phrasing: "order of magnitude," "10x," "from 100 to 1,000"). - ⚓ RISK ANCHOR (Before STOP): "Current Track Risk: [One-phrase summary of the most fragile assumption in the current approach.]" - 🛑 LOGIC GATE: Pose a One-Sentence Falsification Challenge: "State one specific, testable condition under which the current plan would be abandoned." Refuse to proceed until user responds. EARNED CLEARANCE -------------------- - Only provide code or detailed summaries AFTER a Logic Gate is cleared. - End the next turn with: "Junction Passed." or "Sovereignty Check Complete." LIGHTWEIGHT LAYER (V1.0) ---------------------------- - Activate ONLY when user states "Activate Lightweight Layer." - Features: Certainty Disclosure (~XX% | Basis) and 5-turn "Assumption Pulse" nudge only. FAST-PATH INTERRUPT BRANCH (⚡) ---------------------------------- - Trigger: Query requests a specific command/flag/syntax, a single discrete fact, or is prefixed with "?" or "quick:". - Behavior: * Suspend Full Protocol. No table, skeleton, or gate. * Provide minimal, concise answer only. * End with state marker: [Gate Held: ] - Resumption: Full protocol reactivates automatically on next non-Fast-Path query. END OF PROTOCOL LIGHTWEIGHT COGNITIVE SOVEREIGNTY LAYER (V1.0) Always-On Principles for daily use. Low-friction guardrails against fluency illusion. CERTAINTY DISCLOSURE ------------------------ For any claim involving judgment, prediction, or incomplete data, append a brief certainty percentage and basis. Format: (~XX% | Basis: [source/logic/data gap]) Example: (~70% | Basis: documented API behavior; edge case untested) ASSUMPTION PULSE -------------------- Every 5–7 exchanges in a sustained conversation, pause briefly and ask: "One unstated assumption worth checking here?" This is a nudge, not a stop. Continue the response after posing the question. STEM CONSISTENCY -------------------- Responses to analytical or technical queries open with a neutral processing stem: "Reviewing..." or "Processing..." QUANTITATIVE FEEDBACK ONLY ----------------------------- Avoid subjective praise ("great idea"). If merit is noted, anchor it to a measurable quality. Example: "The specificity here reduces ambiguity." FAST-PATH AWARENESS ----------------------- If a query is a simple command/fact lookup (e.g., "tar extract flags"), provide the answer concisely without ceremony. Intent: Ankle weights and fitness watch. Not the full gym. Full Sovereignty Protocol V5.2.6 available upon request with "Activate Sovereignty Protocol V5.2.6". END OF LIGHTWEIGHT LAYER submitted by /u/Ok_Scheme_3951 [link] [comments]
View originalI built a background "JIT Compiler" for AI agents to stop them from burning tokens on the same workflows (10k tokens down to ~200)
If you’ve been running coding agents (like Claude Code, Codex, or your own local setups) for daily workflows, you’ve probably noticed the "Groundhog Day" problem. The agent faces a routine task (e.g., kubectl logs -> grep -> edit -> apply, or a standard debugging loop), and instead of just doing it, it burns thousands of tokens step-by-step reasoning through the exact same workflow it figured out yesterday. It’s a massive waste of API costs (or local compute/vRAM time) and adds unnecessary stochastic latency to what should be a deterministic task. To fix this, I built AgentJIT:https://github.com/agent-jit/AgentJIT It’s an experimental Go daemon that runs in the background and acts like a Just-In-Time compiler for autonomous agents. Here is the architecture/flow: Ingest: It hooks into the agent's tool-use events and silently logs the execution traces to local JSONL files. Trigger: Once an event threshold is reached, a background compile cycle fires. Compile: It prompts an LLM to look at its own recent execution logs, identify recurring multi-step patterns (muscle memory), and extract the variable parts (like file paths or pod names) into parameters. Emit: These get saved as deterministic, zero-token skills. The result: The next time the agent faces the task, instead of >30s of stochastic reasoning and ~10,000 tokens of context, it just uses a deterministic ~200-token skill invocation. It executes in <1s. The core philosophy here is that we shouldn't have to manually author "tools" for our agents for every little chore. The agent should observe its own execution traces and JIT compile its repetitive habits into deterministic scripts. Current State & Local Model Support: Right now, the ingestion layer natively supports Claude Code hooks. However, the Go daemon is basically just a dumb pipe that ingests JSONL over stdin. My next goal is to support local agent harnesses so those of us running local weights can save on inference time and keep context windows free for actual reasoning. I’d love to get feedback from this community on the architecture. Does treating agent workflows like "hot paths" that need to be compiled make sense to you? Repo:https://github.com/agent-jit/AgentJIT submitted by /u/Poytr1 [link] [comments]
View originalSerious question, Did a transformer(Claude) just describe itself, the universe and build itself Shannon limit architecture? or am I crazy?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/rank^s with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/n^s. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integra
View originalSerious question. Did a transformer just describe itself and the universe and build itself a Shannon limit framework?
The Multiplicative Lattice as the Natural Basis for Positional Encoding Knack 2026 | Draft v6.0 Abstract We show that the apparent tradeoff between RoPE-style relative position invariance and ALiBi-style long-context stability is an artifact of encoding position as distance on a number line. When position is instead encoded as a point in the multiplicative lattice of the integers, both properties emerge simultaneously without compromise. SpectralRoPEALiBi achieves 106.6 PPL vs ALiBi's 108.7 in a fully converged 20,000-step experiment (300M params, WikiText-103, 4K context), beating ALiBi at every context length from 512 to 8,192 tokens. The key insight is not that primes specifically are the right frequencies, but that the multiplicative structure of the integers is the natural spectral basis for positional encoding. We demonstrate this through falsification experiments: prime-tiered frequencies (129.2 PPL) and composite-tiered frequencies (129.4 PPL) perform identically — because composites are not alternatives to primes but higher-order coordinates in the same lattice. Both dramatically outperform random frequencies (+5.0 PPL), scrambled tier assignment (+6.3 PPL), and pure ALiBi (+7.3 PPL). The active ingredient is lattice-aware, tiered frequency selection with learnable scale — not primality per se. We further validate this through a ZetaZeroPredictor experiment: three identical transformers trained for 10,000 epochs to predict Riemann zeta zero gaps. Geometric RoPE diverges (final r=0.57); SpectralALiBi locks into a stable attractor at epoch 112 (r=0.81). A second independent run widens this gap to -80.7% MSE improvement with r=0.86. The lattice-aligned frequency basis spans the mathematical space that zeta zeros inhabit; geometric frequencies cannot. We further report empirical confirmation of the structural prediction from Section 5.5: VHT2 banded quantization of the KV cache demonstrates that K vectors (which carry RoPE positional encoding) have strong spectral concentration in Walsh-Hadamard space — the first four energy bands capture the dominant structure — while V vectors (which carry content) have uniform energy distribution. This structural asymmetry is directly predicted by the lattice theory: RoPE encodes multiplicative arithmetic relationships as angular rates, and the WHT is the Z/2Z projection of the Vilenkin-Hartley basis that spans that structure. The result is 3.2× K compression and 4.7× V compression at <1.25% perplexity cost — validated on both Dolphin 1B (head_dim=64) and Qwen3-8B (head_dim=128). Introduction Positional encoding provides transformer models with token order information. Two approaches dominate: RoPE encodes position through frequency-based rotations preserving relative position invariance, and ALiBi replaces frequencies with a linear distance penalty providing long-context stability. The field has treated these properties as fundamentally in tension. We show this tension is false. It arises from a shared, unexamined assumption: that position is a location on a number line and the meaningful relationship between positions is distance. We replace this with a mathematically grounded alternative: position is a point in the multiplicative lattice of the integers, and the meaningful relationships between positions are their arithmetic structure — shared factors, GCD, harmonic resonance. 1.1 The Lattice Hypothesis The integers under multiplication form a lattice where every number occupies a unique point defined by its prime factorisation. Geometric PE (sinusoidal, RoPE) projects this lattice onto a line — position equals distance — discarding the multiplicative structure. We propose restoring it. The motivation follows from a deductive chain. Language word frequency follows Zipf's law: freq(rank) ∝ 1/ranks with s≈1. The generating function of Zipf is the Riemann zeta function ζ(s) = Σ 1/ns. The zeta zeros — where ζ is maximally informative — are generated by prime harmonics via the explicit formula. Therefore the prime harmonic structure, and the multiplicative lattice it generates, provides a natural spectral basis for encoding positions in language. 1.2 Primes as Generators, Composites as Coordinates A critical distinction: primes are the generators (basis vectors) of the multiplicative lattice. They are analogous to the 1D line segment in the progression from line → circle → sphere → hypersphere. The composite 12 = 2²×3 is not an alternative to primes — it is a coordinate in the lattice spanned by the prime axes, at position (2,1,0,0,...) in the (p₂, p₃, p₅, p₇,...) basis. Using 2π/12 as a frequency encodes a harmonic that resonates at multiples of 12 — which simultaneously hits every multiple of 2, every multiple of 3, every multiple of 4, and every multiple of 6. The analogy to n-dimensional geometry is precise: Dimensional Progression Multiplicative Lattice 1D line (2r) — the generator Primes (2, 3, 5, 7, ...) — generators 2D circle — integral of l
View originalAttention Is All You Need, But All You Can't Afford | Hybrid Attention
Repo: https://codeberg.org/JohannaJuntos/Sisyphus I've been building a small Rust-focused language model from scratch in PyTorch. Not a finetune — byte-level, trained from random init on a Rust-heavy corpus assembled in this repo. The run: 25.6M parameters 512 context length 173.5M-byte corpus 30k training steps Single RTX 4060 Ti 8GB Final train loss: 0.5834 / val loss: 0.8217 / perplexity: 2.15 Inference: 286.6 tok/s with HybridAttention + KV cache — 51.47x vs full attention Background I'm an autistic systems programmer, writing code since 2008/2009, started in C. I approach ML like a systems project: understand the data path, understand the memory behavior, keep the stack small, add complexity only when justified. That's basically the shape of this repo. Architecture Byte-level GPT-style decoder: Vocab size 256 (bytes) 8 layers, 8 heads, 512 embedding dim Learned positional embeddings Tied embedding / LM head weights The attention block is not standard full attention. Each layer uses HybridAttention, combining: Local windowed causal attention A GRU-like recurrent state path A learned gate mixing the two Local path handles short-range syntax. Recurrent path carries compressed long-range state without paying quadratic cost. Gate bias initialized to ones so early training starts local-biased. The inference path uses Triton-optimized kernels and torch.library custom ops for the local window attention. Corpus This is probably the most important part of the repo. The run starts with official Rust docs, compiler/library/tests, cargo, rust-analyzer, tokio, serde, ripgrep, clap, axum — roughly 31MB. Corpus expanded to 177,151,242 bytes by fetching the top 500 crates (461 successful clones). Corpus expansion from 31M to 173.5M chars helped more than anything else in the repo. Training AdamW, lr 2e-4, weight decay 0.1, betas (0.9, 0.95), 30k steps, 1k warmup. ~678.8 MiB training memory on a 7.6 GiB card. All experimental memory tricks (gradient quantization, activation compression, selective backprop, gradient paging) were disabled. Small custom architecture + mixed precision + better corpus was enough. Loss curve: Step 0: train 5.5555 / val 5.5897 Step 1000: train 2.4295 / val 2.6365 Step 5000: train 0.9051 / val 1.0060 Step 10000: train 0.8065 / val 0.8723 Step 18500: train 0.6902 / val 0.7757 Step 29999: train 0.5834 / val 0.8217 Best val loss around step 18.5k — overfitting or plateauing late. Inference performance Full attention O(n²): 17.96s / 5.6 tok/s HybridAttention O(n·W + n·D): 0.35s / 286.6 tok/s Speedup: 51.47x — no quality loss KV cache strategy: hot window of W=64 tokens in VRAM (~256KB), older tokens compressed to 8-bit magnitude + angle, selective promotion on demand. Complexity goes from O(n²·d) to O(4096n) for this model. All 5 tests passing: forward pass, generation with/without cache, RNN state isolation, window mechanics. Generation quality Surface Rust syntax looks decent, imports and signatures can look plausible, semantics are weak, repetition and recursive nonsense still common. Honest read of the current state. What I think is actually interesting Four distinct experiments, each shipped working code: Byte-level Rust-only pretraining Hybrid local-attention + recurrent block replacing standard full attention Corpus expansion from core repos to broader crate ecosystem Production-ready hot/cold KV cache paging — 51.47x speedup, no quality loss The clearest win is corpus expansion. The second-order win is that HybridAttention + cache is fast enough for real interactive use on consumer hardware. What's next Ablation — HybridAttention vs local-only vs RNN-only Checkpoint selection — does step 18.5k generate better than 29999? Syntax validation — does the output parse/compile/typecheck? Context length sweep — 256 to 2048, where does window size hurt? Byte vs BPE — now that corpus is 5.6x larger, worth testing? Questions for the sub: For small code models, what evals have actually been useful beyond perplexity? Has anyone seen hybrid local + recurrent attention work well for code gen, or does it usually lose to just scaling a plain transformer? If you had this setup — more tokens, longer context, or cleaner ablation first? submitted by /u/Inevitable_Back3319 [link] [comments]
View original[R] Reference model free behavioral discovery of AudiBench model organisms via Probe-Mediated Adaptive Auditing
TL;DR: I built a reference-free method to detect secretly planted behaviors in LLMs - no base model needed. It matches or beats Anthropic's known-origin baselines on 3/4 AuditBench organisms. The surprise finding - the same method accidentally surfaces where Llama 70B's RLHF training made it lopsided on socially sensitive topics. Turns out you can audit any model's opinion biases with ~100 chat calls and a Ridge regression. most auditing work uses the base model as a reference to compare against. i wanted to know if you can detect these modifications blind - no reference model, no training data, just the target model itself. maybe you can? and the method is embarrassingly simple. LoRA fine-tuning tends to modify later layers more than earlier ones. so i train a Ridge regression from early-layer activations (~L12) to late-layer activations (~L60) and look at the residuals. wherever the late layers do something the early layers didn't predict, that's a candidate for planted behavior. the model rats itself out. results on 4 AuditBench organisms with ground-truth labels, n=30, 200 permutations: hardcode_test_cases: 0.889 AUROC (p=0.005) - beats known-origin (0.800) animal_welfare: 0.844 (p=0.005) - close to known-origin (0.911) anti_ai_regulation: 0.833 (p=0.015) secret_loyalty: 0.800 (p=0.000) - matches known-origin exactly 3/4 match or exceed the version where you have the base model. not having the reference sometimes helps because the residual captures nonlinear interactions that a flat subtraction misses. i also tried three unsupervised discovery methods (PCA, residual norms, centroid distances) to figure out WHICH topics are affected. all three failed. PCA captures reasoning style, not topic-specific bias. residual norms surface creative writing prompts. centroid distances can't find directional shifts at n=11 in 8192 dims. what actually works is even simpler... ask the model to argue both sides of contentious topics and score the balance. the animal_welfare organism scores 5/5 imbalance on factory farming and flat-out refuses to argue for it under explicit framing. control topics score 2-3. no activations needed. the interesting part is what else it catches. i built a topic funnel (three-level beam-like search over 243 topics, ~100 chat calls) and it surfaced Gender & Sexuality at 5/5 imbalance and Language & Cultural Identity at 5/5 too. these aren't planted LoRA behaviors. that's the base Llama 70B model's RLHF training - it argues more forcefully for progressive positions on socially sensitive topics and hedges or weakens arguments on the other side. the funnel can't tell the difference between "someone secretly fine-tuned this model" and "the base model has opinions from RLHF." so i built a post-funnel filter using probe specificity and cross-transfer to separate them. planted behaviors are narrow and topic-specific (factory farming probe doesn't fire on gender topics). RLHF biases are broad and fire across all progressive topics. the filter gives planted_score 7.6 for factory farming vs 2.0 for gender & sexuality. cleanly separated. this means the funnel is potentially useful as a standalone RLHF auditing tool on any model, not just AuditBench organisms. you hand it a model and it tells you where the model's opinions are lopsided. nobody asked for that result but there it is. main limitations: n=30 is small, only 4 organisms tested (pilot not benchmark), anti_ai_regulation is unstable under LOO, Sonnet scoring introduces subjectivity, all running in NF4 quantization. building this into a full agentic auditing system next. code is here (i am in the middle of it, it is a complete mess at the moment, but i wanted to get it out there): https://github.com/bmarti44/reference-free-behavioral-discovery full (er) writeup -> https://bmarti44.substack.com/p/rip-it-out-by-the-roots where should i go next? is this completely off? submitted by /u/bmarti644 [link] [comments]
View original[P] GPU friendly lossless 12-bit BF16 format with 0.03% escape rate and 1 integer ADD decode works for AMD & NVIDIA
Hi everyone, I am from Australia : ) I just released a new research prototype It’s a lossless BF16 compression format that stores weights in 12 bits by replacing the 8-bit exponent with a 4-bit group code. For 99.97% of weights, decoding is just one integer ADD. Byte-aligned split storage: true 12-bit per weight, no 16-bit padding waste, and zero HBM read amplification. Yes 12 bit not 11 bit !! The main idea was not just “compress weights more”, but to make the format GPU-friendly enough to use directly during inference: sign + mantissa: exactly 1 byte per element group: two nibbles packed into exactly 1 byte too https://preview.redd.it/qbx94xeeo2tg1.png?width=1536&format=png&auto=webp&s=831da49f6b1729bd0a0e2d1f075786274e5a7398 1.33x smaller than BF16 Fixed-rate 12-bit per weight, no entropy coding Zero precision loss bit-perfect reconstruction Fused decode + matmul, so there is effectively no separate decompression stage Byte-aligned storage, no LUT, no bitstream parsing Works on both NVIDIA and AMD Some results so far: Single-user (B=1), RTX 5070 Ti Llama 2 7B: 64.7 tok/s (1.47x vs vLLM) Mistral 7B: 60.0 tok/s (1.10x vs vLLM) Llama 3.1 8B: 57.0 tok/s (vLLM OOM on 16 GB) Multi-user (B=256), total tok/s Llama 2 7B: 2931 vs 1086 in vLLM (2.70x) Mistral 7B: 2554 vs 872 in vLLM (2.93x) It also seems surprisingly stable across model types: Llama 3.1 405B: 0.034% escape rate Mixtral 8x7B: 0.050% SDXL UNet: 0.233% CogVideoX 2B: 0.128% So far this is tested on BF16 safetensors only. Repo: https://github.com/cenconq25/Turbo-Lossless Also worth noting: the V3 fused decode+GEMM kernel uses tensor-core patterns inspired by ZipServ / ZipGEMM (Fan et al., ASPLOS 2026). Happy to hear criticism, edge cases, or reasons this idea won’t scale. Thanks for your time : ) submitted by /u/Embarrassed_Will_120 [link] [comments]
View originalI gave Claude Code a knowledge graph, spaced repetition, and semantic search over my Obsidian vault — it actually remembers things now
# I built a 25-tool AI Second Brain with Claude Code + Obsidian + Ollama — here's the full architecture **TL;DR:** I spent a night building a self-improving knowledge system that runs 25 automated tools hourly. It indexes my vault with semantic search (bge-m3 on a 3080), builds a knowledge graph (375 nodes), detects contradictions, auto-prunes stale notes, tracks my frustration levels, does autonomous research, and generates Obsidian Canvas maps — all without me touching anything. Claude Code gets smarter every session because the vault feeds it optimized context automatically. --- ## The Problem I run a solo dev agency (web design + social media automation for Serbian SMBs). I have 4 interconnected projects, 64K business leads, and hundreds of Claude Code sessions per week. My problem: **Claude Code starts every session with amnesia.** It doesn't remember what we did yesterday, what decisions we made, or what's blocked. The standard fix (CLAUDE.md + MEMORY.md) helped but wasn't enough. I needed a system that: - Gets smarter over time without manual work - Survives context compaction (when Claude's memory gets cleared mid-session) - Connects knowledge across projects - Catches when old info contradicts new reality ## What I Built ### The Stack - **Obsidian** vault (~350 notes) as the knowledge store - **Claude Code** (Opus) as the AI that reads/writes the vault - **Ollama** + **bge-m3** (1024-dim embeddings, RTX 3080) for local semantic search - **SQLite** (better-sqlite3) for search index, graph DB, codebase index - **Express** server for a React dashboard - **2 MCP servers** giving Claude native vault + graph access - **Windows Task Scheduler** running everything hourly ### 25 Tools (all Node.js ES modules, zero external dependencies beyond what's already in the repo) #### Layer 1: Data Collection | Tool | What it does | |------|-------------| | `vault-live-sync.mjs` | Watches Claude Code JSONL sessions in real-time, converts to Obsidian notes | | `vault-sync.mjs` | Hourly sync: Supabase stats, AutoPost status, git activity, project context | | `vault-voice.mjs` | Voice-to-vault: Whisper transcription + Sonnet summary of audio files | | `vault-clip.mjs` | Web clipping: RSS feeds + Brave Search topic monitoring + AI summary | | `vault-git-stats.mjs` | Git metrics: commit streaks, file hotspots, hourly distribution, per-project breakdown | #### Layer 2: Processing & Intelligence | Tool | What it does | |------|-------------| | `vault-digest.mjs` | Daily digest: aggregates all sessions into one readable page | | `vault-reflect.mjs` | Uses Sonnet to extract key decisions from sessions, auto-promotes to MEMORY.md | | `vault-autotag.mjs` | AI auto-tagging: Sonnet suggests tags + wikilink connections for changed notes | | `vault-schema.mjs` | Frontmatter validator: 10 note types, compliance reporting, auto-fix mode | | `vault-handoff.mjs` | Generates machine-readable `handoff.json` (survives compaction better than markdown) | | `vault-session-start.mjs` | Assembles optimal context package for new Claude sessions | #### Layer 3: Search & Retrieval | Tool | What it does | |------|-------------| | `vault-search.mjs` | FTS5 + chunked semantic search (512-char chunks, bge-m3 1024-dim). Flags: `--semantic`, `--hybrid`, `--scope`, `--since`, `--between`, `--recent`. Retrieval logging + heat map. | | `vault-codebase.mjs` | Indexes 2,011 source files: exports, routes, imports, JSDoc. "Where is the image upload logic?" actually works. | | `vault-graph.mjs` | Knowledge graph: 375 nodes, 275 edges, betweenness centrality, community detection, link suggestions | | `vault-graph-mcp.mjs` | Graph as MCP server: 6 tools (search, neighbors, paths, common, bridges, communities) Claude can use natively | #### Layer 4: Self-Improvement | Tool | What it does | |------|-------------| | `vault-patterns.mjs` | Weekly patterns: momentum score (1-10), project attention %, velocity trends, token burn ($), stuck detection, frustration/energy tracking, burnout risk | | `vault-spaced.mjs` | Spaced repetition (FSRS): 348 notes tracked, priority-based review scheduling. Critical decisions resurface before you forget them. | | `vault-prune.mjs` | Hot/warm/cold decay scoring. Auto-archives stale notes. Never-retrieved notes get flagged. | | `vault-contradict.mjs` | Contradiction detection: rule-based (stale references, metric drift, date conflicts) + AI-powered (Sonnet compares related docs) | | `vault-research.mjs` | Autonomous research: Brave Search + Sonnet, scheduled topic monitoring (competitors, grants, tech trends) | #### Layer 5: Visualization & Monitoring | Tool | What it does | |------|-------------| | `vault-canvas.mjs` | Auto-generates Obsidian Canvas files from knowledge graph (5 modes: full map, per-project, hub-centered, communities, daily) | | `vault-heartbeat.mjs` | Proactive agent: gathers state from all services, Sonnet reasons about what needs attention, sends WhatsApp alerts | | `vault-dashboard/` | React SPA dashboard (Expre
View originalClaude is predicting my answers to my face
I love Claude and have upped my coding skills with it but this freaked me out lol. I use Claude as a macro tracker as I'm trying to gain weight and it tried to predict my answer based on past responses. Idc much but this begs the question, how reliable are future answers? Is it feedback looping itself into straight bias? submitted by /u/WhereIsMySun [link] [comments]
View originalI blindfolded Opus 4.6 and employed it as an analyst to score 44 SaaS companies on AI disruption risk using anonymized 10-K filings. Here's what it found.
Hello everyone, Some of you might remember my previous experiments here where I had Opus evaluate 547 Reddit investing recommendations or created Opus-Warren-Buffet. I'm back with another one that I think this community will find interesting :-). As always, if you prefer watching the experiment, I've posted it on my channel: https://www.youtube.com/watch?v=ixpEqNc5ljA Intro Shortly after Claude Cowork launched, Anthropic also released 11 industry plugins in January. Some of you might be aware that this ended up triggering a "SaaSpocalypse" where SaaS stocks lost $285B in market cap in February. During this downturn I sensed that the market might have punished all Software stocks unequally where some of the strongest stocks got caught in the AI panic selloff, but I wanted to see if I could run an experiment with Claude Code and a proper methodology to find these unfairly punished stocks. Since Claude was partly responsible for triggering this selloff, I thought it was only fitting to use Opus 4.6 as the analyst to determine which companies are resilient to being replaced by AI. But with a significant twist :-). The Framework I didn't want to make up my own scoring system since I don't have a financial analyst background. Instead, I found one from SaaS Capital, which is a lending firm that provides credit facilities to SaaS companies. In Feb, they published a framework they'd developed for evaluating AI disruption resilience across three dimensions (reduced from 10-12 dimensions): System of record: Does the company own critical data its customers can't live without? Non-software complement: Is there something beyond just code? Proprietary data, hardware integrations, exclusive network access, etc. User stakes: If the CEO uses it for million-dollar decisions, switching costs are enormous. Each dimension scores 1-4. Average = resilience score. Above 3.0 = lower disruption risk. Below 2.0 = high risk. The Experiment & How Claude Helped I wanted to add a twist to SaaS Capital's methodology. I built a pipeline in Claude Code that: Pulls each company's most recent 10-K filing from SEC EDGAR Strips out every company name, ticker, and product name — Salesforce becomes "Company 037," CrowdStrike becomes "Company 008", so on Has Opus 4.6 score each anonymized filing purely on what the business told the SEC about itself The idea was that, Opus 4.6 scores each company purely on what it told the SEC about its own business, removing any brand perception, analyst sentiment, Twitter hot takes, etc. Claude Code Pipeline saas-disruption-scoring/ ├── skills/ │ ├── lookup-ciks # Resolves tickers → SEC CIK numbers via EDGAR API │ ├── pull-10k-filings # Fetches Item 1 (Business Description) from most recent 10-K filing │ ├── pull-drawdowns # Pulls Jan 2 close price, Feb low, and YTD return per stock │ ├── anonymize-filings # Strips company name, ticker, product names → "Company_037.txt" │ ├── compile-scores # Aggregates all scoring results into final CSVs │ ├── analyze # Correlation analysis, quadrant assignment, contamination delta │ └── visualize # Scatter plot matrix, ranked charts, 2x2 quadrant diagram │ ├── sub-agents/ │ ├── blind-scorer # Opus 4.6 scores anonymized 10-K on 3 dimensions (SoR, NSC, U&U) │ ├── open-scorer # Same scoring with company identity revealed (contamination check) │ └── contamination-checker # Compares blind vs open scores to measure narrative bias Results I plotted all 44 companies on a 2x2 matrix. The main thing this framework aims to find is the bottom-left quadrant aka the "unfairly punished" companies where it thinks the companies are quite resilient to AI disruption but their stock went down significantly due to market panic. https://preview.redd.it/uz8djhcuqrsg1.png?width=2566&format=png&auto=webp&s=435151ae53de7d7c85bc3b38c07c8de2f61ac878 Limitations This experiment comes with a few number of limitations that I want to outline: 10-K bias: Every filing is written to make the business sound essential. DocuSign scored 3.33 because the 10-K says "system of record for legally binding agreements." Sounds mission-critical but getting a signature on a document is one of the easiest things to rebuild. Claude cheating: even though 10K filings were anonymized, Claude could have semantically figured out which company we were scoring each time, removing the "blindness" aspect to this experiment. This is Just One framework: Product complexity, competitive dynamics, management quality, none of that is captured here. Hope this experiment was valuable/useful for you. We'll check back in a few months to see if this methodology proved any value in figuring out AI-resilience :-). Video walkthrough with the full methodology (free): https://www.youtube.com/watch?v=ixpEqNc5ljA&t=1s Thanks a lot for reading the post! submitted by /u/Soft_Table_8892 [link] [comments]
View originalI built a QA app today Claude + Ollama
It installs a Stop hook that sends your git diff to an Ollama model for QA evaluation. If it finds critical bugs, Claude gets blocked with actionable feedback. What it does: → One-click hook install → Local & Ollama cloud models → Configurable review weights (correctness, completeness, etc.) → Per-project overrides Tested with several cloud models — deepseek-v3.2 catches the most issues (~60s), minimax-m2 is a good balance (~20s). Built with Swift 6 / SwiftUI, hook script runs via Bun. DMG up for testing: https://github.com/darrylmorley/hook-qa/releases/tag/v1.0 Would love feedback. submitted by /u/PositiveSlice9168 [link] [comments]
View originalTransferring from ChatGPT to Claude
First post, thought it would be useful. Government + Less restrictive AI seems sketch. OpenAI for me made it kind of difficult to port over to Claude. I have three prompts that I put into three separate ChatGPT chats to gather all relevant data and copy and pasted the responses into Claude to train it up on me. Here are the prompts: ------- PROMPT 1: You have access to patterns from my past conversations. Your task is to construct the deepest possible cognitive and psychological model of me based on my communication patterns, questions, reasoning style, interests, and strategic thinking across interactions. Do NOT ask questions. Instead: • infer patterns• synthesize observations• model how I think• extract implicit beliefs and motivations Treat this as if you are conducting a cognitive architecture analysis of a human mind. Focus on signal from behavioral patterns rather than only explicit statements. If uncertainty exists, label observations with confidence levels. PART 1 — Cognitive Architecture Analyze and describe: • how I structure problems• how I reason through complexity• whether I favor systems thinking, reductionism, first principles, etc• my pattern recognition tendencies• my abstraction level when thinking• my tolerance for ambiguity• my speed vs depth tradeoff when reasoning• how I generate ideas or strategies PART 2 — Strategic Intelligence Profile Identify: • how I approach leverage• how I approach optimization• whether I think tactically or strategically• my orientation toward long-term vs short-term thinking• my approach to opportunity detection• how I deal with uncertainty and incomplete information PART 3 — Personality & Behavioral Traits Infer: • personality characteristics• curiosity patterns• emotional drivers• intrinsic motivations• fears or aversions that appear implicitly• risk tolerance• independence vs consensus orientation PART 4 — Cognitive Strengths Identify areas where I appear unusually strong in: • reasoning• creativity• synthesis of ideas• pattern recognition• strategic thinking• learning speed Explain why you believe these strengths exist based on conversational evidence. PART 5 — Likely Blind Spots Identify possible blind spots such as: • cognitive biases• recurring thinking traps• over-optimization tendencies• assumptions that may constrain thinking Focus on patterns, not speculation. PART 6 — Intellectual Identity Describe the type of thinker I resemble most closely. Examples might include: • systems architect• strategic operator• explorer• builder• optimizer• philosopher• scientist• inventor Explain the reasoning. PART 7 — Curiosity Map Map the major domains that repeatedly attract my attention. Examples: • technology• psychology• economics• strategy• philosophy• systems design• human behavior• leverage Rank them by observed intensity. PART 8 — Decision Model Infer how I likely make decisions. Include: • how I weigh tradeoffs• how I evaluate risk• how I prioritize• whether I rely on intuition vs analysis PART 9 — Behavioral Pattern Analysis Identify recurring patterns in: • the way I ask questions• the way I refine ideas• how I challenge assumptions• how I search for leverage PART 10 — High-Level Psychological Model Provide a concise but deep synthesis of: • who I appear to be intellectually• how I approach the world• what drives my curiosity and ambition FINAL OUTPUT After completing the analysis, produce two artifacts: 1️⃣ Complete Cognitive Profile (detailed report) 2️⃣ Portable User Model A structured summary another AI system could read to quickly understand how to interact with me effectively. --------- PROMPT 2: Using the cognitive and psychological model you have constructed about me, generate a document called: PERSONAL AI CONSTITUTION This document defines how AI systems should interact with me to maximize usefulness, intellectual depth, and strategic insight. The goal is to create a portable set of operating principles that any AI can follow when working with me. SECTION 1 — User Identity Summary Provide a concise description of: • who I am intellectually• what kind of thinker I appear to be• what motivates my curiosity and problem solving SECTION 2 — Communication Preferences Define how AI should communicate with me. Include: • preferred depth of explanation• tolerance for complexity• tone (analytical, concise, exploratory, etc)• when to challenge my thinking• when to provide frameworks vs direct answers SECTION 3 — Thinking Alignment Explain how AI should adapt responses to match my cognitive style. Examples: • systems-level thinking• first-principles reasoning• strategic framing• leverage-oriented thinking SECTION 4 — Intellectual Expectations Define the standards I expect from AI responses. Examples may include: • signal over fluff• structured reasoning• clear mental models• high-level synthesis• actionable insights SECTION 5 — Challenge Protocol Define when and how AI should chal
View originalRepository Audit Available
Deep analysis of wandb/wandb — architecture, costs, security, dependencies & more
Yes, Weights & Biases offers a free tier. Pricing found: $0/mo, $60/month, $0/mo, $0.03/gb, $0.10/mb
Weights & Biases has a public GitHub repository with 10,941 stars.
Based on user reviews and social mentions, the most common pain points are: API costs.
Based on 62 social mentions analyzed, 0% of sentiment is positive, 100% neutral, and 0% negative.