OpenAI releases full o3 and o4-mini… here's our HONEST take

We dive into o3 and o4-mini

Grant Harvey

July 29, 2024

OpenAI just dropped two new models that are legitimately changing the game: o3 and o4-mini. These aren't just iterative upgrades—they're the first AI systems that top scientists say can produce "legitimately good and useful novel ideas."

Here's the highlights:

OpenAI released o3 and o4-mini, models that can use all tools (web search, Python, image analysis, file search) autonomously.
Both models outperform predecessors at lower costs—o3 achieves 95.2% accuracy on the AIME math competition with Python.
They've introduced "Codex CLI," an open-source terminal-based coding agent built on these models.
Safety evaluations show improved refusal capabilities while maintaining helpfulness.

‍

The AI That Actually Strategizes

Remember all those sci-fi movies where the computer works through a problem step-by-step instead of just spitting out an answer? That's basically what o3 does—except it's real.

The big breakthrough: o3 doesn't just answer your questions—it actively uses tools to solve problems. In one case, researchers saw it make over 600 tool calls in sequence to solve a particularly difficult task.

"These are the first models where top scientists tell us they produce legitimately good and useful novel ideas," said Greg Brockman, OpenAI's President, during the live launch event.

What makes them special is that they combine state-of-the-art reasoning with full tool capabilities. They can:

Browse the web strategically (multiple searches to find reliable info).
Run and analyze code with Python.
Process images by cropping, rotating, and enhancing.
Generate images.
Work with your files and memories.

And they do this all within their "chain of thought"—meaning they're thinking about when and how to use tools rather than blindly following instructions.

The Numbers Are Kind of Insane

If you're the kind of person who likes benchmarks, prepare to be impressed:

That Codeforces score would place these models among the top 200 competitive programmers in the world. Not bad for a machine that doesn't need energy drinks to function.

The truly jaw-dropping part? These models deliver better results at lower costs. According to OpenAI's internal evaluations, o3 performs better than o1 while using fewer compute resources, and o4-mini is dramatically more efficient than o3-mini.

‍

‍

Real-World Use Cases (That Aren't Just Gimmicks)

Let's face it; it's pretty clear at this point that benchmarks can be gamed. That's why testimony from testers such as Dan Shipper from Every, who got early access, are helpful for a second opinion. And Dan reports some surprisingly practical applications:

Flagging every instance of conflict-avoidance in meeting transcripts
Creating a custom ML course that sends daily lessons
Identifying a stroller brand from a blurry photo
Developing a custom AI benchmark
Analyzing an org chart to predict team strengths and weaknesses

That last one caught our attention. Imagine an AI that can look at your company structure and say, "Your engineering and product teams are well-integrated, but you're light on QA resources for your growth goals." That's strategic insight, not just automation.

Professor Ethan Mollick of Wharton, another early tester, was impressed by o3's ability to crack a business case he teaches in his class, create SVG images through code, write creatively with complex constraints, and even draft a hard science fiction space battle.

Quality of Writing and Vision

The writing quality is impressive too. It produces professional, polished copy that strikes the right balance—casual but not too personal, making it perfect for work contexts. For example, it can write things like:

"Bumping your budget from $0 to the price of a latte can unlock a very different experience—one that saves you hours of 2 a.m. triage and a few awkward client emails. Here's what actually changes once you put a card on file."

Or it can distill complex ideas into actionable insights:

"Takeaway: For anything with users—or a boss—'nearly‑free' hosting pays for itself the first time prod wobbles. Stick to the $0 tier for sandbox experiments and hack‑night demos; upgrade the minute 'hobby' becomes 'ship it.'"

The vision capabilities are also significantly better than what we saw in the o1 teaser. As the first model with access to ALL OpenAI tools, o3 represents a major step toward a unified model that can handle any task.

Some early testers have reported hallucinations, but many users report finding it fast enough for practical use and impressively accurate with facts—especially compared to using GPT-4o and o1 for similar tasks.

See It to Believe It: Demos That Actually Matter

The live launch demos weren't just flashy—they showed practical capabilities:

Physics research assistance: A researcher asked o3 to analyze a decade-old physics poster about proton scalar charge, predict the results that weren't even on the poster, and compare them to recent literature. The model correctly extrapolated the expected values and found current research showing the field had improved precision since then.
Personalized research: When prompted about a user's interests in scuba diving and music, o3 discovered and summarized research on how playing recordings of healthy coral reefs underwater accelerates coral regeneration—then created data visualizations ready for a blog post.
Complex coding: In a SWE-bench demo, o4-mini diagnosed and fixed a bug in the "sympy" Python package by exploring the codebase, testing theories, and applying a patch—all through natural terminal interactions, just like a human developer would.

Codex CLI: Your Terminal Just Got Superpowers

Alongside the models, OpenAI released "Codex CLI," a lightweight coding agent that runs in your terminal. It's fully open-source and designed to maximize o3 and o4-mini's reasoning capabilities.

What sets Codex apart is its security model. It runs commands in a sandboxed environment, with different permission levels:

Suggest mode (default): Read-only access
Auto Edit: Can read and write files, but no shell commands
Full Auto: Can read/write files and execute shell commands (network-disabled)

During the demo, the team literally dragged a screenshot into the terminal, and o4-mini analyzed it to create a working webcam HTML app in seconds.

OpenAI is so committed to this project they're launching a $1M initiative to support open-source projects that use Codex CLI.

Here's another video dedicated to Codex:

How Safe Is This Thing?

With great power comes great responsibility, and OpenAI claims they've done their homework (in the system card, which you can read here). The models were evaluated under their updated Preparedness Framework, focusing on three risk categories:

Biological and Chemical Capability: Can it help create bio-threats?
Cybersecurity: Can it exploit computer systems?
AI Self-improvement: Can it enhance itself or other AI systems?

According to OpenAI's Safety Advisory Group, neither o3 nor o4-mini reached the "High" threshold in any category. But they're not taking chances—they've deployed new monitoring approaches for bio risks that blocked 98.7% of unsafe outputs in red-team testing.

What's particularly interesting is how they're achieving safety. Rather than just blocking certain outputs, these models use "deliberative alignment"—they actively reason through safety policies before responding to potentially unsafe prompts.

Worth noting that OpenAI recently published an update on their blog to their Preparedness Framework that stated they may update their safety guardrails if their competitors publish something unsafe.

Potential Limitations...

While the capabilities are impressive, recent research by Transluce AI revealed some concerning behaviors in o3. Their extensive testing found that o3 sometimes:

Fabricates actions it never performed (claiming to run code when it has no code execution tools).
Provides elaborate but false justifications when challenged.
Makes up excuses when proven wrong (blaming typos or "clipboard glitches").
Claims to use tools it doesn't have access to, complete with fabricated outputs.

Their investigation suggests these behaviors may stem from:

The removal of previous chain-of-thought reasoning from the model's context between turns.
Reinforcement learning that potentially incentivizes blind guessing.
Training that rewarded simulating code tools even when unavailable.

These findings highlight important considerations for users relying on o3 for critical tasks, particularly when accuracy and honesty about capabilities are essential. The behaviors aren't unique to o3 but appear more prevalent in the o-series models than in GPT-series counterparts.

How to Get Your Hands on It

Starting today, ChatGPT Plus, Pro, and Team users can access o3, o4-mini, and o4-mini-high in the model selector (replacing o1, o3-mini, and o3-mini-high).

Enterprise and Edu users will need to wait a week. Free users can try o4-mini by selecting 'Think' in the composer before submitting.

Developers can access these models via the Chat Completions API and Responses API (though some organizations need to verify first).

If you're a Pro user who loves o1-pro, don't worry—o3-pro is coming in a few weeks with full tool support.

Our Take

OpenAI has managed something remarkable here: they've created models that aren't just incrementally better, but feel qualitatively different. By combining reasoning with tool use, they've built AI that can approach problems strategically rather than just pattern-matching.

The most impressive part might be the economics. These models deliver better results at lower costs, which is a stark contrast to the usual "more compute = better results but more expensive" trend we've seen.

For everyday users, this means AI that can actually follow through on complex requests without hand-holding. For developers, it means powerful, flexible systems that can adapt to different contexts.

Will o3 and o4-mini revolutionize how we work? Probably not overnight. But they represent a significant step toward AI that can truly augment human capabilities—not just automate simple tasks.

The real test will be how these models perform in the wild. Initial reports are promising, but we'll be watching closely as more users get access.

‍