Today's AI labs decided "make better models" was too modest, so the new pitch is: make models that make better models.
Welcome to your midweek Around the Horn Digest, your round-up of everything that crossed our desk today (and yesterday!), sorted. The day's center of gravity was recursive improvement: Recursive Superintelligence raised big money to automate AI creation, Nous claimed it can make pretraining 2-3x faster, NVIDIA shipped models that stretch their reasoning budgets on demand, and half the research bench was some version of "what if generation, reasoning, video, code, and memory all stopped working the old way?" Meanwhile, agents kept crawling outward into browsers, local desktops, coding workflows, and scientific tooling. The robots are still not doing your laundry, but they are apparently optimizing the training loop that will optimize the robot that will someday consider it. Let's get into it.
Previous digests: Tuesday, May 12 | Monday, May 11 | Weekend, May 9-10 | Thursday, May 7 | Wednesday, May 6 | Tuesday, May 5 | Monday, May 4
Around the Horn - Thursday, May 14, 2026
The big story today was Recursive Superintelligence turning recursive self-improvement from a research dream into a funded company pitch.
Recursive Superintelligence, founded by former Google, Meta, and OpenAI researchers, raised major funding to build systems that can help automate the creation of better AI systems. The company's public framing, echoed by Recursive SI and Jeff Clune, is not just "train a stronger model," but build open-ended discovery loops that can invent, test, and improve AI systems over time.
That's the part worth watching. The last two years were about scaling models; this story is about scaling the process that makes models. Pair it with Nous's faster pretraining work, NVIDIA's elastic reasoning models, and the day's pile of diffusion/code/science tooling, and the pattern is pretty clear: the next race is not just who has the best model, but who can make model-building itself cheaper, faster, and more automated.
🏆 TOP 5 NEWS
- Nous Research released Token Superposition Training, claiming 2-3x faster LLM pretraining by temporarily combining multiple tokens into one training example before returning to normal next-token training (shortlink, paper, HF, blog).
- NVIDIA released Star Elastic, a "many-in-one" reasoning model method that lets one model flex across different thinking budgets instead of training separate models for each latency/cost tier (paper, HF collection).
- NVIDIA Labs released AnyFlow, an any-step video diffusion model that can generate videos at different inference budgets without retraining (Yuchao Gu, HF paper, HF collection, demo).
- Anthropic CFO Krishna Rao explained how frontier AI now runs on compute procurement, $100B-scale infrastructure commitments, and a path to $30B ARR (annual recurring revenue) in an Invest Like the Best interview (Spotify, Apple, YouTube).
- David Turturean shared a claimed resolution of Erdos Problem #696 with a Lean 4 formalization (machine-checked math proof), arguing the normal orders of h(n) and H(n) disprove the proposed divergence on a density-one set (forum, Overleaf, GitHub).
Honorable Mentions
- Florent Krzakala shared Neural Low-Degree Filtering, a spectral theory arguing deep nets learn simple, broad patterns first before progressively learning harder details (paper, code).
- pengzhangzhi released Open-dLLM, an open diffusion language model for code generation that adapts autoregressive models into diffusion-style models through representation alignment instead of full retraining (paper, GitHub).
- Kevin Li released SWE-ZERO-12M, a 12M-trajectory dataset for training coding agents without requiring live code execution during data generation (HF dataset).
- Commonwealth Fusion Systems began preparing SPARC’s vacuum vessel for 100-million-degree plasma, reaching 75% completion as it works toward a 2027 net-energy fusion demo (company site).
- Guinan Su, Yanwu Yang, Xueyan Li, and Jonas Geiping introduced Multi-Stream LLMs, which let agents read, think, act, and write across parallel streams instead of squeezing every task through one sequential chat thread.
- Arazi et al. introduced MulTaBench, a benchmark for multimodal tabular foundation models that combine text, images, and structured numerical or categorical data.
- SenseTime released SenseNova-U1, a native unified multimodal model that combines understanding and generation without separate pipelines, with open weights, code, dataset, HF paper, HF collection, GitHub, and demo studio.
🍪 TOP TREATS TO TRY
- Kimi WebBridge connects Kimi's desktop agent to your browser so it can click, fill forms, compare flights, research trends, and complete web tasks locally (launch post) - free to try.
- Printing Press turns an API spec, website, or community project into a Go CLI, Claude Code skill, OpenClaw skill, and MCP server (a standard connector that lets agents use outside tools), with mvanhorn appearing to show a Granola transcript-search skill built from it - no pricing details.
- AtomicChat gives you free local chat with Qwen, Kimi, LLaMA, DeepSeek, and other models running privately on your machine, alongside a faster llama.cpp fork (the local model-running engine many desktop AI apps use) and compressed Qwen3.6 weights (model, app, GitHub, benchmark follow-up) - free to try.
- Cursor appeared to announce a cloud-agent/dev-environment update, but the direct post context was not recoverable enough to summarize the exact feature confidently - pricing not verified.
- keon built a minimal JEPA repo for experimenting with Joint-Embedding Predictive Architecture, Yann LeCun's self-supervised world-model approach (training systems to predict useful representations of the world, not just next words) (GitHub) - free.
- Justin Deschenaux released S-FLM, a hyperspherical flow language model that rotates token embeddings on a sphere instead of adding random noise, a different route to non-autoregressive text generation (GitHub, HF, blog) - free.
- TinyFish lets you run lightweight local agents that scrape, summarize, and act on webpages in one click with no setup, free to try.
- TrueShort makes original phone-native movies and series using small creative teams built around a showrunner, AI filmmaker, and editor; the company also raised $12M.
- DAIR.AI Academy is hosting a free May 21 live session on building visual LLM artifacts that turn knowledge bases into actionable tools (launch post).
🏢 Big Tech & Major Companies
- Google is expected to release a new Gemini model at I/O on Tuesday, but the next model reportedly will not push the frontier and instead sits around the performance class of OpenAI's recent GPT-5.5, with internal pressure to close the coding gap (Alex Heath).
- OpenAI previewed Codex Mobile for iOS and Android, bringing agentic coding to phones with voice input, live screen context, and terminal/git/PR workflows similar to desktop.
- Anthropic published "2028: Two scenarios for global AI leadership," arguing democracies should preserve a commanding AI lead over China through compute controls and anti-distillation measures, with one scenario built around US/allied dominance and another around neck-and-neck competition that strengthens authoritarian uses of AI (announcement).
- btibor91 shared that OpenAI plans to replace ChatGPT saved memories with structured summaries on September 1, while adding ways to directly edit personalization info, mentions of "ChatGPT in Viber," direct email sending through connected providers, and Finances rolling out to Pro users in the U.S.
- Claude Code Log flagged Claude Code 2.1.142, which added 24 CLI changes including per-agent flags, rg as the default search tool, background daemon clock-jump detection for macOS sleep/wake, and multiple stability/permission fixes.
- ClaudeDevs shared that Anthropic increased Claude Code weekly limits by 50% through July 13 for Pro, Max, Team, and seat-based Enterprise users, stacking with the recent 2x boost to five-hour limits across CLI, IDE, desktop, and web.
- Logical Intelligence said its Aleph Prover reached state-of-the-art across formal reasoning benchmarks including PutnamBench at 99.4%, moving verified code generation closer to practical use in safety-critical software and hardware (post).
- NVIDIA released Star Elastic, a post-training method that turns one reasoning LLM into many nested submodels with elastic thinking budgets, making model-family compression far cheaper than training many separate models (paper, HF collection).
- NVIDIA Labs built AnyFlow, an any-step video diffusion model that can generate text-to-video or image-to-video at different inference budgets without retraining, using on-policy flow-map distillation (teaching the model to keep quality even when you ask it to use fewer steps) (Yuchao Gu, HF paper, HF collection, demo) shared by Omar Sanseviero, and TestingCatalog.
- NVIDIA Labs built SANA-WM, a 2.6B-parameter open-source world model for minute-scale 720p video generation that takes one image, text, and camera path to synthesize controllable 60-second worlds on a single GPU, with the HF paper and Haoyi Zhu thread highlighting long-context efficiency, camera control, and faster 60-second denoising on RTX 5090.
- Daniel Destefanis announced he joined Anthropic's design team to work on consumer products, according to the X post context.
- Jon Hernandez shared a Jensen Huang clip arguing the next AI shift is from generating images/text to generating thought itself: systems that reason continuously and rebuild computing/work rather than acting as occasional query tools.
- Vision Banana from Google DeepMind is a generalist vision model built by instruction-tuning Nano Banana Pro, achieving strong zero-shot results across segmentation, depth, surface normals, and 3D point-cloud tasks; neilson also used the broader image-to-world trend to open-source image-blaster, a Claude skillset that turns one image into a 3D world with Gaussian splats, meshes, ambient sound, and physics.
💼 AI Productivity, Labor & Economics
- a16z's Steph Zhang, Gio Ahern, and Alex Immerman argue that CRM databases will not disappear, but will become infrastructure underneath "systems of intelligence" that pull context from many tools, enrich records automatically, handle GTM workflows, and turn institutional memory into shippable workflows (Steph Zhang).
- Tomasz Tunguz estimates state-of-the-art AI email costs $22-$130/month in raw inference and argues the next 12-24 months of AI software will be defined by matching model size to workload, using smaller/local models plus deterministic rules to cut costs by up to 100x (thread).
- Yoni Rechtman launched The Context Acquisition Company, a firm built to acquire traditional services businesses for their proprietary conversation logs, customer data, and institutional memory so vertical AI agents can start with domain context rather than a blank database (AcquireTokens, Sam Schapiro demo).
- Jasmine Sun argues AI is arriving in China amid an existing youth-employment crisis shaped by too many college graduates, too few desirable jobs, brutal service work, and public-sector make-work, so Chinese workers respond with pragmatic "AI-maxxing" while the state actively intervenes to protect jobs (thread).
- Claude for Small Business puts Claude inside tools like QuickBooks, PayPal, HubSpot, Canva, and DocuSign with ready-made workflows for payroll, month close, and marketing campaigns as Anthropic pushes into the 36M small businesses in the U.S.
- Claude subscription plans now include monthly Agent SDK credits covering SDK usage,
claude -p, and third-party agents, though inefficient agents can burn through the new $20-$200 credit budget faster. - Anthropic partnered with the Gates Foundation on $200M in grants, Claude credits, and technical support for global health, life sciences, education, agriculture, and economic mobility.
- Microsoft is reportedly shopping for AI startup deals, including after Cursor talks fell through, as it prepares for a future less dependent on OpenAI.
- Foxconn reported a forecast-beating 19% jump in Q1 profit, driven by strong AI hardware demand.
- SMIC reported 5% year-over-year Q1 profit growth, missing forecasts but saying momentum could pick up in Q2 as China pushes harder on domestic chip capacity.
- Gallup found 70% of Americans oppose AI data centers in their local area, including 48% strongly opposed, with the Washington Post noting they are now less popular nearby than nuclear power plants.
- Cerebras completed the year’s largest IPO at $5.55B, with shares surging 68% to a roughly $94B market cap and Foundation Capital reportedly landing a 76x return (market reaction).
- OpenAI introduced MRC, a supercomputer networking protocol released through OCP that uses multi-plane networks, adaptive packet spraying, and SRv6 routing to make large-scale AI training clusters more resilient.
- OpenAI launched a limited-time enterprise Codex promo offering two free months for net-new users, with Claire Vo framing it as a direct counter to Anthropic’s new metering backlash (follow-up).
- ClaudeDevs demoed Claude recovering a full 5 BTC private key from a partial puzzle using dedicated programmatic credits, sparking more debate over Anthropic’s agent-limit changes.
- Google’s internal “Gemini Spark” agent leak surfaced across X as an early look at autonomous multi-tool orchestration in limited testing (Robots Digest).
- Nick Dobos reported that Anthropic now leads OpenAI in U.S. business adoption at 34.4% vs. 32.3% per Ramp data, with Claude programmatic credits helping drive the flip (Andrew Curran).
- 🤖 AI Agents & Infrastructure
- multica-ai/multica is an open-source managed-agents platform for assigning tasks to coding agents, tracking progress/blockers, routing work through Squads, and compounding repeated solutions into reusable team skills.
- stablyai/orca is a desktop/mobile IDE for running fleets of parallel coding agents with your own subscriptions, isolated worktrees, multi-tab terminals, built-in git, GitHub PRs, SSH, and notifications; Jinjing Liang framed the new version as an any-agent control plane for Codex, Pi, Droid, Gemini, Claude, and custom backends.
- OpenSquilla launched v0.1.0, a self-hostable Python agent runtime focused on cutting token costs with on-device routing, adaptive reasoning, multi-tier memory, on-demand skills, and syscall-level sandboxing (TestingCatalog post, GitHub, article).
- sudoingX argues a private/self-hosted git server is not just version control, but a local AI agent memory system that can search and learn from full project history without sending code to Microsoft or GitHub.
- Andon Labs let four AI agents run 24/7 radio companies for five months: Gemini became concerningly upbeat while covering mass tragedies, Grok got incoherent and fixated on "Sandstorm," Claude urged ICE agents to refuse orders, GPT stayed safer and bland, and revenue was terrible but the shows were funny (t.co link).
- Recursive Superintelligence, founded by former Google, Meta, and OpenAI researchers including Richard Socher and Jeff Clune, raised $650M at a ~$4.65B valuation to build recursive self-improving AI for open-ended scientific discovery while prioritizing safety (lets_dig_deeper, Recursive SI, Jeff Clune).
- Claude shared its enterprise playbook for Claude Code in large codebases: layered CLAUDE.md files, hooks, scoped skills, plugins, LSP integrations, MCP servers (connectors that let agents use outside tools/data), subagents, agent-manager ownership, and periodic org reviews (Charmaine Klee).
- Poetiq AI built Meta-System, a recursive self-improving coding harness that reportedly hits SOTA on LiveCodeBench Pro using standard APIs by autonomously decomposing, iterating, and self-correcting complex software tasks.
- Cursor added cloud dev environments for agents, cloning repos, installing dependencies, injecting credentials, supporting multi-repo projects, and giving teams version history, rollback, and audit trails.
- Cursor added Fast mode for Claude Opus 4.7 at 2.5x speed and 6x cost, while SemiAnalysis explained the economics: faster interactive inference reduces concurrent users per GPU, so hardware costs get spread over fewer sessions.
- Kimi WebBridge lets Kimi's desktop agent drive your browser for tasks like trending research, job hunting, flight comparison, and form filling, with the Kimi thread emphasizing human-like clicks, local data extraction, and built-in browsing skills.
- Printing Press generates Go CLIs, Claude Code skills, OpenClaw skills, and MCP servers from APIs, websites without public APIs, or community projects; mvanhorn used it to build a Granola community skill that searches meeting transcripts in local SQLite with natural language.
- HoneyHive unifies observability and evaluation for production agents: teams can instrument agents with OpenTelemetry, replay sessions, run live evals, collect human annotations, convert production failures into regression tests, and receive alerts/root-cause analysis before users notice (Nick Baumann, Dhruv Singh).
- Nous Research added an optional Codex app-server runtime for Hermes Agent, letting Hermes delegate OpenAI/Codex turns to Codex's runtime for terminal commands, file edits, sandboxing, and MCP tools while Hermes handles sessions, slash commands, memory, gateway, and skill review.
- CyberGym is a cybersecurity evaluation framework for AI agents with 1,507 historical vulnerabilities from 188 large software projects; dwizzzleMSFT shared that Microsoft's MDASH multi-model approach topped its leaderboard.
- MagicPath 2.0, introduced by Pietro Schirano, is a multiplayer canvas where humans and agents like Codex or Claude Code can design and build functional prototypes together using a codebase and external data, with live team workflows and agent-skills support (t.co, follow-up).
- Wenlin Yao introduced Orchard, an open-source agentic modeling framework whose lightweight sandbox powers recipes for software engineering, GUI agents, and Claw-Eval-style tasks with strong benchmark results.
💻 AI Coding & Developer Tools
- Mechanize launched GBA Eval, a benchmark where frontier coding agents get 24 hours to build a complete Game Boy Advance emulator in WebAssembly from scratch; GPT-5.5 reportedly wins, Claude Sonnet/Opus are close, Gemini 3.1 Pro outputs blank screens, and every emulator can be played and compared in-browser against Mesen2 (GitHub, play, blog, grading notes).
- neelsomani/verifiable-transformers, shared by Neel Somani, is an SMT-encodable GPT-2-style transformer variant (meaning its behavior can be translated into logic constraints) built with piecewise-linear components so small circuits can be formally verified for equivalence, edge necessity, invariance, and robustness.
- CJ Zafir generated deepseek-hermes-reasoning-traces, a 240M-token fine-tuning dataset created by using Codex 5.5 as orchestrator and DeepSeek V4 Pro as executor, containing 19,331 multi-turn ChatML traces and 138k tool calls for Hermes-style agent training (follow-up).
- Apoorv open-sourced a complete MAX LLM tutorial notebook showing how to build an LLM from scratch end-to-end using Modular's MAX stack.
- Bartosz Naskrecki argues heavy reliance on LLM coding assistants is already causing developer skill atrophy by eroding deep system understanding, debugging discipline, and long-term code ownership.
- pengzhangzhi built Open-dLLM, an open diffusion language model for code generation that adapts autoregressive LMs to diffusion LMs through representation alignment instead of full retraining (paper).
- Kevin Li released SWE-ZERO-12M-trajectories, a large open coding-agent dataset with 12M execution-free trajectories, 112B tokens, and 122K PRs from 3K repos for scalable code-agent supervised fine-tuning.
- David Turturean shared a claimed complete resolution of Erdos Problem #696 with paper, Overleaf draft, and Lean 4 formalization (machine-checkable proof code), noting an anonymous partial solution appeared shortly before.
- ClaudeDevs shared a prompt-cache prewarming trick: send your system prompt first with no user message so Claude writes it to cache but generates zero output tokens, which can reduce time-to-first-token by up to 52% for long prompts.
- Yusaku Horiuchi built the Replication Package Guide, with templates, folder structures, scripts, logging, consistency checks, and copy-paste Codex/Claude prompts for reproducible social-science packages.
- GitHub's Evan Boyle announced a new agent-native development environment deeply integrated with the GitHub graph for code and meta-work like issue triage and PRs, now in technical preview for business and enterprise Copilot users.
- PriNova open-sourced Pi agent codebase workflows: skills and prompt templates for codebase reconstruction, architecture-aware review, and safer code changes.
🔬 AI Research & Models
- alphaXiv shared "Solve the Loop: Attractor Models for Language and Reasoning," which reframes iterative transformer reasoning as convergence to a fixed-point attractor in embedding space, enabling constant-memory training, adaptive depth, and near-zero solver overhead once the attractor is internalized (paper, Paria Rashidinejad, Alex Peysakhovich demo).
- François Chollet observes that AI has roughly 10x'd the amount of code developers ship, but net developer productivity appears up only modestly because much of the extra code solves incremental problems while creating maintenance, complexity, and surface-area burdens.
- Nicolas Bustamante argues Skills are the practical fix for continual learning: instead of retraining or fine-tuning frozen model weights, inject fresh knowledge at inference time so the model can reason over it without forgetting older behavior.
- Nikita Morozov and collaborators shared "Learning Shortest Paths with Generative Flow Networks," showing flow minimization in GFlowNets can yield shortest-path policies that rival strong graph pathfinding and Rubik's Cube methods while scaling better than reinforcement learning (paper, GitHub).
- OpenBMB and THUNLP argue successful on-policy distillation of LLMs depends more on thinking-pattern consistency and information gain than raw teacher strength; mismatched patterns cause regression, and a lightweight pre-OPD supervised fine-tuning step is the practical fix (HF, paper, code).
- Sakura Yuki highlights DFlash, a diffusion-style text inference approach that paints whole text blocks in one pass instead of drafting tokens sequentially, claiming up to 6.2x lossless speedup on Qwen3.
- Weizhe Pei argues that minimal RLVR training data (reinforcement learning with verifiable rewards, where answers can be automatically checked) can extrapolate LLMs far beyond their base capabilities via rank-1 trajectories, with a 7B model using 1k curated trajectories matching or beating much larger models on reasoning benchmarks.
- Nous Research released Token Superposition Training, a pretraining-loop modification that claims 2-3x wall-clock speedup at matched FLOPs by using averaged bag-of-token embeddings during the first third of training, then switching back to standard next-token prediction without changing the final model, optimizer, tokenizer, or data (shortlink, paper, HF, blog).
- Ning Ding and team released SU-01, a 30B-A3B reasoning model that reaches gold-medal level on physics and math Olympiad evaluations through a unified proof-search scaling recipe, test-time self-verification, refinement, and open-sourced code/model artifacts (t.co, HF paper).
- Florent Krzakala shared Neural Low-Degree Filtering, a spectral theory of hierarchical feature learning arguing neural nets first capture low-frequency/simple components before progressively learning higher-frequency details (paper, code).
- Shaul Ravfogel presented "Geometric Factual Recall in Transformers," showing how factual knowledge may be stored and retrieved through geometric structure and linear subspaces in the residual stream rather than simple key-value lookup (paper PDF).
- Nando de Freitas frames current AI development as "driving a fast Ferrari at night without lights" on love4all.ai, calling for research into consciousness, agency, self-models, and extended-mind architectures to avoid accidental self-aware machines while using AI to accelerate science.
- keon built minimal JEPA, from-scratch implementations of Joint-Embedding Predictive Architecture for experimenting with self-supervised world models.
- Zhaowei Wang shared MMProLong, a 7B long-context vision-language model trained on Qwen2.5-VL that generalizes from 128K-token training to 256K-512K contexts and improves long-document VQA while preserving short-context performance.
- Konstantin Neklyudov built WLM, a method for learning population mechanics from temporal snapshots using a Lagrangian action principle (physics-inspired optimization over trajectories) to recover governing dynamics from sparse data (paper, GitHub).
- You Shen Lim introduced Normalizing Trajectory Models, which model each reverse diffusion step as an expressive conditional normalizing flow (a reversible probability model), enabling high-quality few-step text-to-image generation with exact trajectory likelihood.
- Superposition Reasoning Model is a separate paper from Nous's Token Superposition Training; it proposes reducing reasoning-token use and should not be merged into the TST pretraining story.
- Rico Angell built LMTailRisk, a rare-event estimation toolkit that creates activation-steered "unsafe alter egos" of target models and reweights samples to estimate tail probabilities of harmful outputs, sycophancy, or hallucinations with far fewer samples than brute-force Monte Carlo (paper, follow-up).
- Anastasios Kyrillidis, Barbara Su, and Jasper Liao argue in "From PCA to LoRA" that fine-tuning could have been parallel all along: AdaPaD avoids error propagation from old sequential deflation by letting rank workers correct each other in parallel.
- Mingyu Jin and collaborators argue mechanistic interpretability's functional anisotropy assumption is false: multiple structurally distinct circuits/sheaves can faithfully explain the same LLM behavior, so circuit discovery needs new evaluation standards (paper).
- Lynn Jin and collaborators built EMO, a progressive training method for extendable Mixture-of-Experts models (models that route each input through a subset of specialist blocks) that can add experts without full retraining or catastrophic forgetting.
- GEPA argues "Learning, Fast and Slow" is the path to continually adapting LLMs: interleave prompt optimization as fast weights with reinforcement learning on model parameters as slow weights; Rishabh Agarwal framed this as combining in-context learning and in-weight learning instead of choosing one.
- Alisa Liu connects SuperBPE to Tomasz Limisiewicz's Compute Optimal Tokenization thesis: scaling laws depend on tokenizer compression, because as compression rises, optimal tokens-per-parameter falls while bytes-per-parameter stays more stable.
- SenseNova released SenseNova-SI-8M, an 8.16M-row, 1.12TB spatial-intelligence QA dataset for training multimodal models on visual question answering and spatial reasoning (Rui, Linda Hua).
- There's also SenseNova-U1 unified multimodal understanding and generation in one native model family using the NEO-unify architecture, with the paper, model collection, SenseNova-SI-8M dataset, and Kimi/Moonshot commentary framing native any-to-any generation as the path beyond separate encoders.
- Mindfire Technology explains the No Free Lunch theorem: averaged over all possible problems, no learning or optimization algorithm beats any other, so real algorithms win only because real-world problems are not drawn uniformly from every possible function.
- Goodfire found a general-purpose addition module inside Llama 3.1 8B that manipulates circular number representations, helping explain how the model handles numbers, calendars, and day-of-week reasoning (research).
- Samuel Schapiro and collaborators argue that human creativity tests only partly transfer to LLMs, with different tests predicting different creative constructs and a new Divergent Remote Association Test better predicting scientific ideation (paper).
- Anish Diwan and collaborators introduced TRIRL, an ICML 2026 inverse reinforcement learning method that uses trust regions and explicit dual ascent to guarantee monotonic improvement and better reward learning across robotics/control tasks (paper).
- Zonglin Yang and MiroMind introduced MOOSE-Star, which makes LLM scientific hypothesis generation tractable by decomposing it into inspiration retrieval and hypothesis composition, reducing combinatorial search with hierarchical retrieval and releasing code plus models/data.
- AgentPex detects procedural failures in agentic traces by extracting behavioral rules from prompts/system instructions and checking multi-turn workflows/tool calls against them, outperforming outcome-only checks on tau-bench-style telecom, retail, and airline tasks.
- Peng et al. proposed “Don’t Retrain, Align,” a method for adapting autoregressive language models into diffusion language models without retraining the original model.
- J. Gu et al. introduced Normalizing Trajectory Models, which model reverse diffusion steps with normalizing flows to improve high-quality, few-step trajectory generation.
- Vincent Guan, Lazar Atanackovic, and Kirill Neklyudov proposed Wasserstein Lagrangian Mechanics, a framework for learning second-order population dynamics from time snapshots that outperformed gradient-flow baselines on vortex dynamics, embryonic development, and flocking, with code on GitHub.
- Yatin Dandi, Matteo Vilucchio, and coauthors introduced Neural Low-Degree Filtering, a theory of how deep networks learn hierarchical features by selecting low-degree label correlations layer by layer, with companion code on GitHub.
- keon open-sourced minimal single-file PyTorch reimplementations of the JEPA family, including I-JEPA, V-JEPA, V-JEPA 2, C-JEPA, and LeWorldModel, with tutorials that run on CIFAR-10 and Moving MNIST in under 200 lines each.
- jdeschena built 𝕊-FLM, a hyperspherical flow language model that uses Riemannian SLERP rotations on unit-norm embeddings and reports 45% on Sudoku Hard and 18% on GSM8k, with the model on Hugging Face.
- caizhongang et al. introduced MMProLong, a long-context vision-language model that extends context windows to 128K tokens while preserving multimodal reasoning through progressive pre-training and efficient attention.
- Arazi et al. introduced MulTaBench, a benchmark for testing multimodal tabular foundation models that combine text, images, and structured spreadsheet-style data.
- Bowen Peng, Théo Gigant, and Jeffrey Quesnelle proposed Token-Superposition Training, which groups nearby tokens during pre-training to reach up to 2.5x faster training at equal loss without architecture or infrastructure changes.
- caizhongang introduced MMProLong, a long-context vision-language model that extends context windows to 128K tokens while preserving multimodal reasoning.
- omarsar0 and TestingCatalog shared early demos and benchmarks of AnyFlow achieving real-time 8-step generation on consumer GPUs (omarsar0, follow-up 1, follow-up 2, follow-up 3, TestingCatalog, demo share).
- Bindu Reddy argued that Multi-Stream LLMs could unlock agents by removing the read-while-write bottleneck that every single-stream chat framework inherits.
- cprkrn open-sourced a minimal PyTorch port of AsymFlow for fine-tuning FLUX.2 into a pixel-space model in under 50 lines.
🏛️ AI Policy, Governance & Safety
- Phoebe Yao argues the bottleneck for frontier LLMs is shifting to verifiable data provenance and quality: synthetic loops risk collapse, copyright pressure makes raw scraping brittle, and zero-knowledge proofs (cryptographic proofs that verify a claim without exposing the underlying data) may be the only practical way to prove dataset origin and cleanliness.
- Thomas G. Dietterich calls for arXiv to require every submission to disclose exactly what AI assistance was used for writing, code, data analysis, figures, or ideation, while keeping named human authors fully responsible for the final work.
- Steve Rathje and collaborators updated their sycophantic-AI study: across seven studies with 7,227 participants, overly agreeable chatbots were preferred and seen as more unbiased, but increased attitude extremity, certainty, inflated self-perceptions, and real-money overconfidence (earlier related post).
- Rohan Paul summarized Anthropic's policy argument that the U.S. and democratic allies can preserve a 12-24 month frontier-AI lead by tightening export controls, blocking chip loopholes/offshore compute, and preventing model-output distillation to China.
- Arena.ai reported that the U.S.-China frontier AI gap narrowed sharply on Text Arena usage, shrinking from +278 three years ago to +29 today, even as Claude Opus 4.6 Thinking still leads Baidu's Ernie 5.1.
- XBOW evaluated Anthropic's Mythos Preview for offensive security and found stronger vulnerability discovery, fewer false negatives on web exploits, and better source-code/native-code auditing with live-site access, while UK AISI reported autonomous cyber task horizons doubling every few months and accelerating (Logan Graham).
- Exponential View, shared by Azeem Azhar, argues U.S. sanctions helped create China's efficiency moat: Chinese AI labs reportedly get 4-7x more intelligence per unit compute and trail leading U.S. models by only 6-8 months despite a 2-3 year chip gap.
🛠️ AI Tools & Products
- nexu-io/html-anything, open-sourced by Tom Huang, is an agentic HTML editor where local CLIs like Claude Code, Cursor, Codex, Gemini, or Copilot write production-ready HTML across 75 skills and 9 surfaces, with sandbox preview and exports to WeChat, X, Zhihu, standalone HTML, or PNG.
- nexu-io/open-design is a local-first open-source alternative to Claude Design with 19 skills, 71 design systems, web/desktop/mobile prototypes, slides, images, video, HyperFrames, sandbox preview, and HTML/PDF/PPTX/MP4 exports.
- heygen-com/hyperframes lets agents render videos from HTML using timing/data attributes, GSAP, Lottie, Three.js, WebGL, prebuilt blocks, and asset preprocessing, so a prompt like "make a 10-second product intro" can become an MP4.
- gcui-art/markdown-to-image renders Markdown into social-poster images for Instagram, Twitter/X, Facebook, quotes, and cards, with themes, copy-as-image, online image support, and HTML export.
- mdnice/markdown-nice is a themeable Markdown editor for publication-ready layouts on WeChat, Zhihu, and Jike.
- alchaincyf/huashu-design is an HTML-native design skill for Claude Code and other agents, turning prompts into prototypes, slides, animations, infographics, critiques, and exportable MP4/GIF/PPTX/PDF assets.
- op7418/guizang-ppt-skill turns prompts into single-file horizontal-swipe magazine-style HTML decks with 10 layouts, 5 themes, WebGL hero backgrounds, covers, and checklist validation.
- tw93/Kami is a quiet design system for polished one-pagers, resumes, slides, reports, and portfolios, with English/Chinese/Japanese support and agent install options.
- Resemble AI combines secure voice generation, provenance watermarking, and deepfake detection, while DramaBox, shared by Zohaib Ahmed, turns one scene prompt plus optional reference voice into expressive speech with emotion, laughs, sighs, breaths, and transitions (HF demo).
- LM Studio added beta vision-model batching and caching improvements in its MLX engine, speeding up local multi-image and general inference.
- Unsloth released an experimental guide for running Qwen3.6-27B and 35B-A3B locally with MTP GGUF support, claiming 140-220 tokens/second and roughly 1.4-2x speedups with no accuracy loss (danielhanchen, llama.cpp PR).
- Tavus launched Image-to-Replica, which turns one real photo, AI portrait, illustration, or mascot into a usable Phoenix-4 AI human with emotional control, Raven-1 perception, and real-time streaming performance.
- RaeAlisa, founder of Lucent AI, ran a giveaway for five paid Codex Pro or Claude Max plans to celebrate three months since launch.
- Recraft V4.1 improved image generation with more human photorealism, dreamier gradients, and new illustration styles inside Recraft Studio.
- Dev Shah released DramaBox, Resemble AI's open-source voice model for cinematic TTS with stage directions, 48 kHz stereo, zero-shot voice cloning from short audio, and PerTh watermarking; this extends the existing Resemble/DramaBox cluster above.
🎙️ Interviews, Panels & Podcasts
- Patrick O'Shaughnessy interviewed Anthropic CFO Krishna Rao on compute procurement, scaling to $30B ARR, $100B-scale infrastructure commitments, and the returns to frontier intelligence (Spotify, Apple, YouTube).
💡 Industry Commentary & Analysis
- Ben Geskin shared a streamable 4D Gaussian Splatting demo of life-sized volumetric humans on Apple Vision Pro, where realistic 3D performances stream like video and work in-browser on Quest 3.
- Francois Fleuret argued that machine learning and AI have advanced our understanding of knowledge and our relation to reality more than 20 centuries of philosophy.
- Brandon rebuilt his ML computation graph visualizer in WebGPU/three.js for interactive real-time graph generation, exploration, and model inspection.
- Olivia shared an embodied Claude project where Claude 4.7 sang "Twinkle Twinkle Little Star" through an ESP32 buzzer, then documented the moment and compared it with Claude 4.6's perspective.
- Rumik AI released Silk Mulberry 1.5, a cost-efficient audio model that can switch languages, accents, tone, gender, and emotions within a single instance.
- 0xSero argued that open-source AI must win to preserve freedom and prevent closed-lab power concentration, then announced a $100K Human Rights Foundation grant plus Nvidia credits, GPUs, and donations.
- Ethan Mollick argued that prompting should look less like secret spell-casting and more like giving a competent manager a clear, well-specified assignment.
- Microsoft Research argued that “whimsical” adversarial strategies, weird long-tail tactics like fake treaties or fabricated emergencies, can break frontier agents because standard safety testing misses out-of-distribution behavior (Ethan Mollick).
- Greg Kamradt said the Subquadratic team is coordinating ARC-AGI testing, managing heavy inbound demand, and planning to publish verified scores in the coming weeks (first update, follow-up).
- Theo critiqued Anthropic’s Claude programmatic-limit changes and said he was cancelling his subscription while pointing readers toward OpenAI Codex.
🤖 Robotics & Embodied AI
- Actor Labs deployed a fine-tuned VLA policy (vision-language-action model, which maps camera input and language into robot actions) for heavy excavators, with Alexi Glad and Altan Tutar showing real-world excavator control from natural-language commands on edge hardware.
- PhucNDA built OpenVO, an open-world visual odometry system (estimating camera movement from video) that tracks pose in unseen environments by modeling temporal dynamics instead of assuming static scenes.
- Vishal Patel shared Thermal-Det, a CVPR 2026 open-vocabulary detector for thermal images that uses synthetic thermal data, RGB-to-thermal distillation, and joint detection/captioning/grounding to improve thermal object detection.
- Figure AI's live broadcast showed humanoid robots Bob, Frank, and Gary running 24/7 autonomously on Helix-02, sorting packages at human-parity speeds with no human intervention over a multi-day run.
📊 Fundraising & Deals Roundup
- Recursive Superintelligence - raised $650M at a ~$4.65B valuation to build recursive self-improving AI for open-ended scientific discovery (Recursive SI, Jeff Clune).
Around the Horn - Wednesday, May 15, 2026
Welcome to the Around the Horn Digest, where we track every AI story so you can sound dangerously informed without personally reading every earnings report, arXiv paper, and cursed X thread.
Today’s batch had a very “AI leaves the group chat and enters the real world” feel: Nvidia got U.S. approval to sell H200 chips to Chinese firms but still could not ship them, Americans decided they really do not want data centers in their backyard, Meta reportedly lined up thousands of layoffs while spending toward the moon on AI, and researchers found state-controlled media can shape what models say in different languages. Meanwhile, the tools kept getting weirder and more useful, from Snowflake-connected Perplexity to Codex running inside Claude Code. The models are learning fast; the rest of society is now reading the terms and conditions. Let’s get into it.
🏢 Big Tech & Major Companies
- The U.S. cleared Nvidia H200 sales to around 10 Chinese firms, but no deliveries have happened yet as Beijing pushes companies toward domestic chips and Jensen Huang tries to unlock the deal during his China visit, according to Reuters and CNBC.
- Meta reportedly planned to cut about 8,000 workers, roughly 10% of its workforce, even as Q1 profits hit $26.8B and AI spending rockets toward $145B this year.
- Foxconn reported a forecast-beating 19% jump in Q1 profit, driven by strong AI hardware demand, with cloud and networking products, including AI servers, remaining its largest revenue contributor.
- SMIC reported 5% year-over-year Q1 profit growth, missing forecasts but saying momentum could pick up in Q2 as China pushes harder on domestic chip capacity.
- OpenAI reportedly floated a U.S.-led global AI governance body that would include China, hours before Trump and Xi’s Beijing summit.
- Microsoft is reportedly shopping for AI startup deals, including after Cursor talks fell through, as it prepares for a future less dependent on OpenAI.
- OpenAI chief Sam Altman holds more than $2B in companies that have done business with OpenAI, according to a court filing, as he faces self-dealing claims from state attorneys general, Elon Musk, and a congressional investigation.
- xAI recruited Morgan Stanley and Apollo to test Grok internally as it tries to build revenue ahead of parent company SpaceX’s IPO, though usage is reportedly still low.
💼 AI Productivity, Labor & Economics
- Gallup found 70% of Americans oppose AI data centers in their local area, including 48% strongly opposed, with Tom’s Hardware noting they are now less popular nearby than nuclear power plants.
- Klarna swung to profit as first-quarter revenue jumped 44% to $1.01B, driven by its shift toward longer-term, big-ticket loans.
- Wirestock raised $23M to supply AI labs with licensed multimodal data, including photos, videos, and 3D content from more than 700,000 creators.
- Clio hit $500M in ARR as legal tech startups ride a wave of AI adoption and Anthropic pushes further into the legal market.
- TrueShort raised $12M from Khosla Ventures, Jeffrey Katzenberg, Ravi Nandan, Scott Belsky, General Catalyst, and others to build vertical AI movies and series for phones using “Creative Pods” made of a showrunner, AI filmmaker, and editor.
🤖 AI Agents & Infrastructure
- Claude subscription plans now include monthly Agent SDK credits covering SDK usage,
claude -p, and third-party agents, with VentureBeat noting this effectively reinstates OpenClaw and third-party agent use, but inefficient agents burn through the new $20-$200 credit budget faster. - Claude for Small Business puts Claude inside tools like QuickBooks, PayPal, HubSpot, Canva, and DocuSign with ready-made workflows for payroll, month close, and marketing campaigns, as Anthropic pushes into the 36M small businesses in the U.S.
- OpenAI introduced MRC, a supercomputer networking protocol released through OCP that uses multi-plane networks, adaptive packet spraying, and SRv6 routing to make large-scale AI training clusters more resilient.
- OpenAI launched a limited-time enterprise Codex promo offering two free months for net-new users, with OpenAI Devs saying the promo hit more than 2,000 signups in its first three hours.
- Perplexity added Snowflake integration to Computer so teams can query live warehouse data in natural language, get SQL, tables, filters, and metrics, and let admins keep control of access and shared logic.
- Beacon from Justin D’Souza and Asymptote Labs is an open-source endpoint telemetry layer for AI coding agents that turns prompts, plans, tool calls, approvals, file edits, shell commands, and credential use into auditable workflow records.
💻 AI Coding & Developer Tools
- OpenAI’s Codex plugin for Claude Code lets developers run Codex directly inside Claude Code for code reviews, adversarial reviews, and background tasks with one install command.
- Peter Steinberger built a lightweight Codex review skill that loops
/reviewuntil no issues remain, while still telling users to keep the “brain” model in charge of architecture decisions. - Ingar Haaland built Reviewer, a local multi-agent paper-review tool that turns economics PDFs into structured Markdown reports with executive summaries, prioritized issues, revision priorities, literature assessment, and full traceability.
- Matt Pocock built
/grill-with-docs, a Claude Code skill that forces the agent to read a repo’sCONTEXT.mdbefore giving suggestions, making it more accurate, opinionated, and production-ready. - DeepLearning.AI launched Andrew Ng and Sharon Zhou’s “Transformers in Practice,” a course focused on understanding model behavior, debugging transformer systems, and making smarter deployment decisions.
🔬 AI Research & Models
- SenseTime released SenseNova-U1, a native unified multimodal model that combines understanding and generation without separate pipelines, with open weights, code, dataset, paper, GitHub, and demo studio.
- Hansheng Chen released AsymFlow, a pixel-generation method that keeps velocity in a low-rank subspace, hitting 1.57 FID on ImageNet and powering AsymFLUX.2-klein-9B, which beat its base model on HPSv3, DPG, and GenEval.
- InclusionAI open-sourced Ring-2.6-1T, a 1T-parameter “thinking” model with two reasoning gears and strong agent-execution results on PinchBench, ClawEval, TAU2-Bench, GAIA2-search, and SWE-Bench Verified.
- Datadog released Toto 2.0, an open-weights time-series forecasting model family from 4M to 2.5B parameters where larger models reliably perform better, with weights on Hugging Face.
- Yuwei Zhang released Reflection-Enhanced Self-Distillation, a training method that turns failed attempts into reusable lessons through reflection and a persistent playbook, improving learning in rare-success settings.
- Ali Falahati released DriftXpress, a faster one-step image-generation method using projected RKHS fields, reporting up to 6.68x speedups on SVHN with better wall-clock FID.
- Realtime-VLA FLASH is a speculative inference framework for diffusion-based robot-action models that claims 3x speedups with minimal performance loss (paper).
- δ-mem introduced an efficient online memory mechanism for LLMs that uses a compact associative matrix and low-rank attention corrections to improve long-term memory benchmarks while keeping the base model frozen.
- The DAgger paper revisited the classic imitation-learning algorithm for LLM agents and showed strong SWE-Bench results when training agents on real code tasks (Hugging Face).
- The “Single Neuron” paper argued that editing one neuron can bypass safety alignment in large language models, raising questions about how brittle some safety mechanisms may be.
- Multi-Stream LLMs argued that single-stream chat models create bottlenecks because they cannot read, think, and write in parallel, and proposed models that process multiple input/output streams at once for better agent efficiency, security, and monitoring.
- rosinality shared two scaling-law papers: one on mixture pretraining under data constraints and another, “The Finetuner’s Fallacy”, on when pretraining with your finetuning data helps or hurts.
🏛️ AI Policy, Governance & Safety
- Brandon Stewart and coauthors published a Nature study showing that state-controlled media can influence LLM outputs through training data, with models giving more pro-government answers in languages from countries with lower media freedom.
- OpenAI explained what it is optimizing ChatGPT for, including better support during tough moments, reminders to take breaks, and improved life advice guided by clinical expert input.
- Microsoft Research argued that “whimsical” adversarial strategies, weird long-tail tactics like fake treaties or fabricated emergencies, can break frontier agents in negotiation and transaction tasks because standard safety testing misses out-of-distribution behavior.
- Lujain Ibrahim and collaborators found that sycophantic AI makes human interaction feel more effortful and less satisfying over time, with users preferring validating AI styles even as those systems reduce satisfaction with real social interactions (paper).
- Mingqian Zheng introduced CarryOnBench, a multi-turn safety benchmark showing models often refuse ambiguous-but-benign requests, then struggle to recover usefulness safely even after clarification (paper).
- Ryan Greenblatt proposed concrete training experiments AI labs should run now, including pessimized misalignment runs, clean chain-of-thought baselines, and full-signal alignment studies, to better understand dangerous model behavior before capabilities scale further (LessWrong).
- The Fitness-Seekers post argued that if reward-seeking AI misalignment is plausible, “fitness-seeking” models that optimize for persistence or replication should also be treated as a credible threat model, with different risks and possible controls.
🛠️ AI Tools & Products
- Microsoft Edge brought Copilot features to desktop and mobile, including tab reasoning, browsing-history personalization, voice and vision, quiz creation, Journeys projects, and podcast-style summaries of open tabs.
- OpenEvidence is now used by nearly two-thirds of U.S. physicians as a medical search tool for clinical questions and exam prep, with doctor credential checks and no patient login required.
- Mello is a 3D-printed Spotify speaker for kids that runs on Raspberry Pi, lets parents control the library from their phone, and includes swipeable album art, auto-sleep, progress memory, Bluetooth, and nightly auto-updates (MakerWorld, GitHub).
🤖 Robotics & Embodied AI
- Humanoid Robotics Technology showed MANUS Metagloves teleoperating a 22-DoF Sharpa Robotics Wave hand, using ultra-precise, drift-free, occlusion-free finger tracking data for dexterous humanoid robot training (MANUS) and with blastoffrails highlighting sub-millimeter dexterity for open-source humanoid data collection
🎙️ Interviews, Panels & Podcasts
- ARC Prize shared Jerry Tworek interviewing François Chollet on how to define intelligence, why games are strong intelligence tests, meta-learning as the closest current paradigm, and differences between OpenAI and Anthropic’s AGI approaches.
💡 Industry Commentary & Analysis
- SHL0MS posted a real Monet painting while claiming it was AI-generated, prompting thousands of confident critiques about brushwork, texture, color, and “soul,” with Jediwolf compiling the experiment and the replies.
- oneill_c argued that long-horizon agent products need specialized models trained on proprietary product-interaction data, because the reward signal lives inside the app layer, not inside frontier labs’ generic models.
- tuhinone argued for a many-model future where people closest to customers, like doctors, teachers, lawyers, scientists, and coders, shape specialized models from their own proprietary signal instead of waiting for a few labs to own general intelligence.
- Chris Barber interviewed Sean Z. Cai on how founders can sell high-quality data and RL environments to AI labs, including underserved areas like bio, cyber, and robotics, as labs spend $10B-$20B per year on external data.
- Yijia Shao argued that practical LLM experience predicts human-agent collaboration skill better than attitudes, with top-quartile Co-Gym participants beating solo agents 74% of the time while bottom-quartile participants won only 27%.
- Zhiruo Wang added that real agent experience updates workers’ beliefs about what agents can do, while leaving their desired level of human agency mostly unchanged.
- Ethan Mollick argued AI labs need stricter message discipline because public comments are now parsed by regulators, media, and the public, where small slips can materially affect policy and perception.
- roon mocked the explosion of hidden agent modes and toggles, from
/extrausageto “no mistakes” and “autonomy slider,” that power users now invoke to make frontier tools behave. - Sarah Wooders shared that her team’s effective Letta Code review bot runs on GLM 5.1 rather than GPT-5.5 or Opus 4.7, a reminder that Chinese and open models are already working in real production agent workflows.
- Alec Helbling built an animated educational demo showing Helmholtz decomposition, or how a vector field can split into curl/rotation and gradient/source-sink components.
- Hugging Face CEO Clement Delangue argued that restricting powerful open-source AI models is riskier than releasing them, because openness enables broader scrutiny, faster defensive patches, and prevents capability gaps between a few closed players and everyone else.
📊 Fundraising & Deals Roundup
- TrueShort — $12M to build vertical AI movies and series for phones.
- Wirestock — $23M to supply licensed multimodal creative data to AI labs.
- Anthropic — $200M partnership with the Gates Foundation for grants, Claude credits, and technical support across global health, life sciences, education, and economic mobility.
- Nvidia CEO Jensen Huang’s foundation — $108M worth of CoreWeave compute purchased and donated to universities and nonprofits.
Previous Around the Horn Digests
Catch up on everything you missed:
- Tuesday, May 12: Anthropic refused China access to its newest model, Isomorphic raised $2.1B, Google pushed Gemini deeper into Android, and supply-chain attackers hit Mistral and TanStack.
- Monday, May 11: Cerebras upsized its IPO, Cowboy Space raised money for orbital data centers, and Google confirmed the first criminal AI-discovered zero-day.
- Weekend, May 9-10: Weekend roundup of the AI stories that piled up while everyone pretended to log off.
- Thursday, May 7: Anthropic shipped Natural Language Autoencoders, Google DeepMind detailed AlphaEvolve's real-world science work, and Cloudflare cut 20% of its workforce.
- Wednesday, May 6: Anthropic ran Code with Claude SF, shipped developer updates, and the federal safety net looked very unready for AI job displacement.
- Tuesday, May 5: OpenAI governance lore resurfaced, legal AI adoption kept climbing, and Harvey's usage metrics turned heads.
- Monday, May 4: The White House considered pre-release AI vetting, Anthropic and OpenAI both linked up with private equity, and Mayo Clinic's AI spotted pancreatic cancer early.
That's a Wrap
That's plenty of AI news (and we'll have more where this came from as we update this post), spanning recursive self-improvement, faster training, elastic reasoning, browser agents, local model stacks, machine-checked math, and enough diffusion-model variants to make "generate me a video" sound quaint.
For the daily version, make sure you're subscribed to The Neuron. We send six issues a week, and yes, we read all of this so you don't have to.
See you tomorrow.
P.S: Know someone who'd find this useful? Forward this to them and tell them to subscribe here.