😸 OpenAI solved an 80-year math problem by... disproving it

Welcome, humans.

Some people just inherently understand their priorities in life, and now that they can code, are unleashing true beauty into the world:

Happy Memorial Day Weekend to everyone who celebrates! Gonna keep it light today.

Here’s what happened in AI today:

🐱 OpenAI’s model solved an 80-year math problem.
📰 OpenAI and Anthropic’s revenue race got weird fast.
📰 Trump delayed an AI order as California prepared workers.
🍪 Qwen 3.7 Max ran an agent for 35 hours.
🎓 Use Codex /goal mode for long tasks.

Hey: Want to reach 700,000+ AI-hungry readers? Advertise with us!

P.S: Love robots? We’re starting a new robotics newsletter! Sign up early here.

🙀 OpenAI says its model solved an 80-year math problem by… disproving it.
Your software needs to be compliant to win deals. But you also need your engineers focused on building your product—NOT pulling SOC 2 evidence.
🎓 AI Skill of the Day: Give Codex a Definition of Done
🍪 Treats to Try
PON DE REPLAY: Ben Cherry of LiveKit Teaches Us How to Make Voice Agents (Its SO Easy):
📰 Around the Horn
💡 Intelligent Insights
A Cat’s Commentary

🙀 OpenAI says its model solved an 80-year math problem by… disproving it.

Math has one perk for AI watchers: eventually, somebody checks the work.

That makes OpenAI’s new claim worth paying attention to. The company says an internal reasoning model apparently disproved the Erdős unit distance conjecture, a discrete geometry problem from 1946. If that all made you go “Huh?”, keep scrolling.

Here’s the basic explanation of the problem: if you place n points on a flat plane, how many pairs can sit exactly one unit apart? For decades, many mathematicians believed square-grid style patterns were basically the best possible answer.

And yet, OpenAI’s unreleased reasoning model apparently found a counterexample: a new infinite family of point arrangements that creates more unit-distance pairs than the old grid-based belief allowed. That means the model “solved” the problem by proving the conjecture was false.

Here’s what happened:

OpenAI said the original proof came from a general-purpose reasoning model, rather than a system specially trained, scaffolded, or targeted for this problem.
The proof shows infinitely many point sets with at least n^1+δ unit-distance pairs.
That beats Erdős’s old n^1+o(1) conjecture, which roughly meant “only a tiny bit better than linear.”
External mathematicians published companion remarks verifying and explaining the result.
Princeton mathematician Will Sawin sharpened it, showing more than n^1.014 unit-distance pairs for arbitrarily large point sets.

Why this matters: This is a cleaner test of AI reasoning than a benchmark (a standardized model test). Benchmarks can reward lucky guesses. A proof has to survive expert review, line by line.

The proof used algebraic number theory (math about number systems), including class field towers and Golod-Shafarevich theory, to crack a geometry problem that sounds simple.

TechCrunch noted an earlier OpenAI Erdős claim fell apart after the model surfaced existing results. This time, outside mathematicians signed the companion remarks, including some critics of that previous episode. OpenAI turned their haters into benchmarks, basically.

Elliot Glazer added an interesting POV on this too: AI may surface answers humans could have found, but didn’t have time (or the will) to go after because it didn’t seem worth finding. Only so many experts can spend years attacking a problem the field doubts exists in the first place.

Our take: Think about the loop here: the model found the weird route, humans checked the work, Codex helped clean up the write-up, and Princeton’s Will Sawin showed the construction’s edge compounds at huge scale, which is why the result matters beyond “AI found a math trick.”

Math is unusually AI friendly because proofs can be checked. Biology, medicine, and business strategy have messier feedback loops. Greg Kamradt of ARC Prize recently shared a nice breakdown of the 7 levels of verifiability that tracks how hard things are to verify on a spectrum due to the length of time it takes to get “feedback” on if your actions led to the outcome you want. Read our deep dive on the topic here.

FROM OUR PARTNERS

Your software needs to be compliant to win deals. But you also need your engineers focused on building your product—NOT pulling SOC 2 evidence.

The Vanta Agent is the sharpest GRC engineer you’ve never had to hire, working tirelessly across the platform to draft policies, complete questionnaires, and flag issues before they escalate.

Fast-moving companies like Ramp and Cursor use Vanta to get and stay compliant, simplify their audit process, and unblock deals—so teams can get back to building.

Ready to learn more? Watch the on-demand demo to see how Vanta works.

🎓 AI Skill of the Day: Give Codex a Definition of Done

Long agent tasks fail when the AI forgets what “done” means. Codex’s /goal mode fixes that by giving it a persistent objective it can keep checking as it works.

Use this for tasks with many steps: migrations, refactors, audits, bug sweeps, or report generation. The trick is to write the goal like a mini contract: outcome, constraints, and tests. If /goal does not appear, OpenAI says you can enable features.goals in config.toml or run codex features enable goals.

Try this:

/goal
Audit this project for newsletter draft readiness.

Definition of done:
1. Every section has the required header.
2. Every hyperlink is attached to a short, natural anchor.
3. No Treats or Around the Horn bullets use bold text.
4. Every technical term has a plain-English parenthetical on first use.
5. Return a short report with pass/fail status and exact fixes made.

Before editing, make a checklist. After editing, run the checklist again and show me what changed.

Total AI beginner? Start here (goes with this video).

Have a specific skill you want to learn? Request it here.

🍪 Treats to Try

*Asterisk = from our partners (only the first one!). Advertise to 700K+ readers here!

Want a serious AI sandbox? Dell Pro Max with GB10 gives builders NVIDIA Grace Blackwell power, 128GB memory, and DGX OS 7. Check it out.
Qwen 3.7 Max gives you an agent model built for long work sessions, including a 35-hour autonomous GPU-kernel optimization run with 1,158 tool calls, 432 tests, and a reported 10x speedup on Alibaba hardware.
Command A+ gives you Cohere’s open enterprise model for reasoning, tool use, images, 48 languages, and private deployments, with 218B total parameters but only 25B active per request (so it runs cheaper than its full size suggests)—free to try if you self-host.
Codex now supports /goal for long-running tasks, Computer Use for clicking around Mac apps, and Appshots for giving Codex richer desktop context —available in Codex.
Figma added a design agent directly on the canvas, so you can generate designs, edit existing files, and create variations from text prompts—waitlist.
Runway Aleph 2.0 edits one video frame and then carries that change through the rest of the clip, so you can revise video without rebuilding it shot by shot.
Windsor MCP for SEO connects your SEO data from GSC, GA4, Google Ads, WordPress, and more to ChatGPT or Claude so you can spot ranking drops, content gaps, and conversion opportunities without building reports by hand —free trial, then $23/mo.

PON DE REPLAY: Ben Cherry of LiveKit Teaches Us How to Make Voice Agents (Its SO Easy):

Click the image above to watch on YouTube!

ICYMI: Ben Cherry of LiveKit joined us on The Neuron’s weekly livestream to show us how to build real-time voice agents that can listen, interrupt, call tools, and run in production. And guess what? It’s so easy, even an agent can do it! You can literally grab the transcript from Google (click “… more” under the vid description, scroll down to click “Show Transcript”, then copy the transcript and give it to your Codex / Claude to set up for you).

It’s a super fun episode; Ben shows how to launch an agent via LiveKite (and what code repos to use if that kinda thing doesn’t intimidate you), edited his agent live with Claude Code, and even cloned his own voice for us live. Click here to watch.

📰 Around the Horn

OpenAI reportedly generated about $5.7B in Q1, nearly $1B ahead of Anthropic, while Anthropic is projected to more than double to $10.9B in Q2.
California signed a first-in-the-nation order to prepare workers and small businesses for AI disruption.
xAI’s Grok reportedly flopped with U.S. government buyers, with Reuters finding only three identified federal use cases.
Intuit planned to lay off 3,000+ workers, about 17% of its workforce, to simplify the company and refocus on AI products. Remember Monday’s story?
Samsung will reportedly distribute about $26.6B in bonuses to chip workers, averaging roughly $340K per employee. That’s how important chips are!
NVIDIA CEO Jensen Huang said CPUs built for AI agents could become a new $200B market for the company.
Taiwan sought to detain three people accused of forging documents to smuggle NVIDIA AI chips to China, Hong Kong, and Macau.

FROM OUR PARTNERS

Your next great hire lives in Slack.

Viktor is an AI coworker that connects to your tools and ships real work. Ask Viktor to pull a report, build a client dashboard, or source 200 leads matching your ICP. Most teams hand over half their ops within a week.

Add Viktor to Slack for free.

💡 Intelligent Insights

Data filtering: this research paper argues larger models can benefit from messy data that smaller models cannot use well (basically scaling laws for data).
AI oversight: the U.K. AI Security Institute warned that today’s oversight methods may degrade as models get more capable.
Arena’s frontier: Arena.ai found GPT-4-level model quality is now roughly 500x cheaper than it was in 2023 (and 4 other insights you might like!).
Accessible for AI: a sharp TechPolicy Press critique of llms.txt, MCP servers, and other machine-readable web infrastructure that may leave disabled users behind.
If you read anything today, read this: After Automation from Dan Shipper argued AI creates more work for humans by flooding the world with generic output and raising the value of taste, context, and judgment. I’ll end the newsletter with this insight: