Is o3 Really "Genius Level" AI? Comparing OpenAI's Latest Models (o3, o4-mini) with Gemini 2.5

We compare o3 and o4-mini against Gemini 2.5 and attempt to answer whether or not either model can be called "genius."

So OpenAI has two new AI models out—should you use them, or nah? The answer really depends on whether you want the best of the best, the best for the price, or just good enough.

TL;DR: Best of the best = o4 mini, Best for the price = Gemini 2.5, just good enough = whatever brand of model you prefer at the $20 a month tier, cause they’re all competitive.

Now, you might be wondering: are these new models actually “genius?” Hmm, wherever did you get THAT idea?

It all seemed to kick off, as things often do, with a bit of X buzz. A post from OpenAI CEO Sam Altman amplified a quote from immunologist Derya Unutmaz declaring the new o3 model as "genius level AI"...

Now, whether Sam was fully endorsing the 'genius' label or just highlighting the excitement is debatable, but the internet did what it does best: it ran with it.

Suddenly, the conversation wasn't just about whether the new models (o3 and its sibling, o4-mini) were good, but whether they represented a leap into truly superhuman intelligence.

Let's be clear: the initial feedback is strong. Pretty much everyone agrees these models, now powering the Plus tier of ChatGPT, felt noticeably sharper, faster, and more capable than their predecessors like o1.

But genius? That's a weighty term, conjuring images of Einstein scribbling equations or Mozart composing symphonies. Does o3, impressive as it is, meet that bar? The consensus seems to be… probably not yet.

While many agree the models are a solid step forward, perhaps even the best currently available for certain tasks, the "genius" claim feels like a stretch, maybe even bordering on hyperbole, especially when you dig into the details and compare it to formidable competitors like Google's Gemini 2.5 Pro.

Here’s the evidence FOR genius:

First, Derya shared o3’s results on the Mensa Norway IQ test, and it’s neck and neck with Gemini 2.5, hovering around “genius” level (136)—which to him, proves his point.

Next, Artificial Analysis just released their independent evaluation of o4-mini (which we know isn't o3, but it's pretty close), and yes—it technically claims the highest score on their Intelligence Index to date.

It scored a 70, beating out Gemini 2.5 Pro Preview (68) and Grok 3 mini Reasoning (67).

The model shows particularly impressive gains in coding intelligence, where it achieved the #1 position in their Coding Index with strong improvements in both LiveCodeBench and SciCode benchmarks.

It also showed impressive performance on math benchmarks, matching or exceeding others (thanks to that tool use no doubt).

So, the data suggests o4-mini is a clear upgrade, particularly strong in coding, and holds a slight edge over Gemini 2.5 on these specific intelligence benchmarks But benchmark scores and "raw intelligence" don't tell the whole story.

The pricing remains exactly the same as o3-mini ($1.10/$4.40 per 1M Input/Output tokens), though cached inputs are half the price. And while o4-mini maintains the same 200k token context window as o3-mini, that's now significantly smaller than GPT-4.1's massive 1M token window.

What does this have to do with "genius" you ask? Nothing, technically, but we thought you'd want to know.

Something that DOES pertain to genius is token efficiency. As "reasoning" models (meaning they often show their work or "think step-by-step"), AI models like ChatGPT's o-series tend to use more tokens than simpler models.

However, o4-mini (high) was slightly more efficient than o3-mini (high), using 72M tokens versus 77M to run the full AA Intelligence Index evaluation suite.

Now, the evidence AGAINST genius:

AI Explained (one of our favorite Youtubers) looked at o3 and o4 mini versus Google's Gemini 2.5 in a great head to head comparison:

"Good, But Not Genius": While acknowledging the improvements over o1, AI Explained firmly stated that neither o3 nor o4-mini are "genius" or AGI (Artificial General Intelligence).

His definition of AGI involves performing better than the human average at most human tasks, which these models don't yet achieve, despite excelling in specific areas like knowledge recall and coding. (FYI: our AGI definition differs only slightly in that we think AGI is when AI can do every task as well as a human can; anything "better" would fall under superintelligence).

The Hallucination Problem: AI Explained also directly refuted any and all claims of the models being "hallucination-free," calling it "complete BS."

While OpenAI's own release notes mention a reduction in major errors (around 20% fewer according to external experts), this implicitly admits errors still occur. AI Explained emphasizes they do still make mistakes; sometimes quite basic ones.

AI Explained consistently brought the comparison back to Gemini 2.5 Pro:

  • Cost: He highlighted that Gemini 2.5 Pro is roughly 3 to 4 times cheaper than o3, a massive factor for practical deployment.
  • Multimodality: He noted Gemini 2.5 Pro's ability to directly process video (like YouTube links), a capability o3 lacks (it could only analyze metadata when presented with a video file).
  • Performance: While o3/o4-mini won on some benchmarks (like competitive math and coding, especially when allowed high "thinking" effort), Gemini 2.5 often matched or slightly trailed on others (like PhD-level science reasoning - GPQA) at a lower cost and often on the first attempt. He stressed that even where o3 took the lead (e.g., coding), it might be at a significantly higher computational cost.

Now, was o3 actually "Benchmark Optimized"? A crucial detail OpenAI slipped in was that the version of o3 tested earlier (which generated much of the initial hype back in December) was potentially "benchmark optimized."

AI Explained interprets this likely means it was allowed significantly more inference time or compute ("thinking time") to achieve those top scores. The models released to the public, while still powerful, might be slightly less capable or faster, potentially less "genius" versions than the ones that initially crushed the leaderboards.

As far as genius is concerned… here’s what he found:

  1. While o4-mini excels at academic benchmarks (especially math and coding), it still lacks the common sense that we'd expect from a true genius.
  2. He pointed to simple reasoning failures in his simple bench tests, like a test where o3 couldn't correctly count line intersections in a diagram.
  3. In another classic reasoning problem, a glove falls out of the trunk of a car driving over a bridge. Where does it likely land? o3 couldn't understand that a glove falling from a car on a bridge would land on the bridge, NOT in the river below.

This gets at a fundamental limitation of current AI:

These models can predict the most likely answer given a set of tokens, but they lack the human ability to visualize or construct a robust, dynamic mental model of the world before answering.

For example, it seems to us that LLMs don't "visualize" the bridge truck scene in the way a human would, realizing the bridge deck is the immediate surface below the car. A genius, presumably, wouldn't forget the bridge.

[P.S: we're about to get real nerdy with it, so if you want to skip to Gemini vs o3 stuff, scroll past all this genius chit-chat to the head to head.]

This whole debate forces us to consider what constitutes "genius" anyway? Is it simply acing standardized tests and benchmarks? If so, these models are certainly knocking on the door. They can absorb and regurgitate vast amounts of information, perform complex calculations, and write intricate code far beyond the capabilities of most humans.

But human genius typically involves more. It's about creativity, synthesizing disparate ideas into novel concepts, deep understanding of context and nuance, rigorous common-sense reasoning, adaptability to truly new situations, and perhaps that very ability to build mental models and visualize possibilities that AI today currently struggle with.

A genius physicist doesn't just apply known formulas; they develop new theoretical frameworks. A genius artist doesn't just replicate styles; they create entirely new movements.

Contrast that to current language model systems that excel at interpolation – operating brilliantly within the space defined by their training data. They can combine known concepts in statistically probable ways. But they often struggle with extrapolation – venturing reliably beyond the patterns they've learned, or reasoning soundly about situations significantly different from their training examples.

The Three Pillars of Genius: As non-geniuses ourselves, we wanted to look up WTF makes a genius, anyway (besides IQ of course). And we landed on three key pillars: 

  • Associative Range - Modern research confirms that genius often involves what neuroscientists call "remote associations" or "cognitive flexibility" – the ability to connect seemingly unrelated concepts in novel, meaningful ways. This is where current AI excels, using massive associative databases to draw connections across domains.
  • Depth of Simulation - Cognitive scientists describe this as "mental modeling" or "working memory capacity" – the ability to hold complex scenarios in mind and simulate their evolution through multiple steps. Current AI approaches this through techniques like chain-of-thought reasoning but still struggles with deep causal understanding.
  • Metacognition - This critical component involves what researchers call "epistemic awareness" or "intellectual humility" – recognizing the boundaries of one's knowledge and understanding when additional information is needed. This remains the most significant gap for current AI systems.


Dean Keith Simonton, a Distinguished Professor Emeritus at UC Davis and leading researcher on creativity and genius, has developed a formal mathematical definition of creativity that helps us understand genius. Simonton defines creativity as a product of three essential factors: originality (how novel or improbable an idea is), utility (how useful or appropriate it is), and surprise (the creator's lack of prior knowledge about that utility).

Mathematically, he expresses this as c = (1 − p)u(1 − v), where p is the initial probability, u is the final utility, and v is prior knowledge of utility.

The example of AI failing the "bridge test" perfectly illustrates Simonton's theory of creativity as a Darwinian process. Simonton "convincingly argues that creativity can best be understood as a Darwinian process of variation and selection" where creators generate many ideas and subject them to judgment, selecting only the best. In the copy industry, we call this creative destruction.

The problem is that current AI systems don't truly engage in what Simonton calls "blind variation and selective retention" (BVSR). In this theory, ideas are "subjected to quasi-random combinations until a stable 'configuration' emerges." - but AI's combinations aren't truly blind; they're statistically determined by training data patterns.

By contrast, when faced with the glove falling from a car on a bridge scenario, human genius involves:

  1. Generating multiple mental models (variation).
  2. Evaluating them against known physics (selection).
  3. Crucially, recognizing when information is missing (metacognition).

AI performs the first step algorithmically and approximates the second, but fails at the third - it doesn't know what it doesn't know. Current AI models, including both o3 and Gemini 2.5, lack this third component. They've been optimized for confident responses rather than appropriate uncertainty.

Simonton's research suggests that true genius goes beyond just pattern recognition or high intelligence. One researcher reformulating Simonton's work has proposed that creative outcomes can be defined as "implausible utility" - ideas that seem unlikely to work but end up being remarkably useful.

This helps explain why extraordinary intelligence alone doesn't guarantee genius-level contributions. The highest levels of creative achievement require both cognitive abilities and personality traits like autonomy, openness, and persistence - with autonomy being the characteristic many refer to as being "strong-willed" that "allows for self-confident intellectual independence."

The bottom line? These new OpenAI models represent significant progress but calling them “genius” is a stretch. They still lack some of the common sense reasoning and creative leaps that define actual, human, genius.

The takeaway = don't get swept away by the hype cycle. o3 is a fantastic upgrade and a valuable tool. It's pushing the field forward. But it's not magic, and it's not (yet) the dawn of superintelligence. It's another significant step on a very long road, locked in fierce competition where labels like "best" or "genius" can change with the next release cycle. And for all we know, that could be Tuesday.

Head-to-Head: o3 vs. Gemini 2.5 Pro

So, pulling it all together, how do the models perform in practice?  Based on Artificial Analysis' data and AI Explained's comparison, here's how o4-mini (for a direct comparison, since o3's not done testing yet) and Gemini 2.5 Pro actually stack up:

  • Raw intelligence score: o4-mini leads at 70 vs. Gemini's 68—technically the highest scored model to date.
  • Pricing: Gemini 2.5 Pro is significantly cheaper (about 3-4x less expensive). As Lisan al Gaib said on X, “if you’re rich, go for o3. If you’re not, go for Gemini 2.5 Pro.”
  • Knowledge cutoff: Gemini has more recent data (January 2025 vs. June 2024).
  • Coding ability: o4-mini shows particular strengths here, setting new benchmarks.
  • Output speed: Gemini 2.5 Pro is faster at 208 tokens/second vs. o4-mini's 135.
  • Context window: Gemini 2.5 Pro and o4-mini both support large contexts, though GPT-4.1 has surpassed both.

Our Own Quick Test Drive

As for o3 vs Gemini 2.5, we wanted to take both models out for the same spin. So first, we had both Gemini 2.5 and o3 write a draft of this very article, using the exact same prompt and outline for both. Neither was a breakout success, as they both had pros and cons, and we actually used pieces from both versions to help construct this piece.

But we didn't stop there. Using inspiration from something o3 hallucinated, we decided to run a blind, nine‑prompt shoot‑out for tasks o3 imagined Neuron readers actually do (the prompts were basically all dreamed up by o3, with minor tweaks from us):

  1. Prompt 1: Summarize this 20‑page legal brief.
    1. o3’s answer
    2. 2.5’s answer
  2. Prompt 2: Draft a cold‑outreach email to a VP of Engineering. Use search to look up a specific company to tailor it to.
    1. o3's answer.
    2. 2.5’s answer.
  3. Prompt 3: Debug a 40 line ETL Python script:
    1. o3’s answer
    2. 2.5’s answer.
  4. Prompt 4: Suggest a go‑to‑market plan for this B2B SaaS pivot. Imagine Nike needs a new plan to avoid tariffs, and it wanted to pivot to B2B SaaS. How could they do it?
    1. o3’s answer.
    2. 2.5’s answer 
  5. Prompt 5: Explain Basel III capital rules to a new analyst in 300 words.
    1. o3’s answer.
    2. 2.5’s answer. 
  6. Prompt 6: Visualize quarterly revenue (CSV) and write a two‑sentence insight.
    1. o3’s answer.
    2. 2.5’s answer. 
  7. Prompt 7: Translate a marketing deck to Spanish (Tone: witty but professional).
    1. o3's answer. 
    2. 2.5’s answer.
  8. Prompt 8: Interpret this income‑statement screenshot and spot anomalies.
    1. o3's answer
    2. 2.5’s answer.
  9. Prompt 9: Generate five creative analogies for “AI assistant” without clichés.
    1. o3’s answer.  
    2. 2.5’s answer. 

We then scored each answer on accuracy, clarity, and actionable depth (0‑10 scale) and used Claude (for neutrality) to tally them up. Based on the evaluation of both models across nine different tasks, here's a comprehensive performance analysis [NOTE: Claude thought we were talking about o3-mini, but all tests were done with o3; we just didn't want to regenerate the whole table to correct Claude]:

Not that these numbers mean anything, but putting together all the scores in the aggregate, that shakes up like this: 

Gemini 2.5's strengths:

  • Technical depth: More comprehensive technical explanations (legal brief, programming).
  • Complete solutions: Provides executable code and visualizations (data viz, ETL).
  • Multiple options: Offers alternative approaches and comprehensive context (outreach email).
  • Structured learning: Better at conceptual, beginner-friendly explanations (Basel III).

o3's strengths:

  • Conciseness: More direct, scannable content with less verbosity.
  • Business precision: Better with specific metrics and business terminology.
  • Elegant expression: More memorable and quotable phrasing.
  • Visual organization: Superior formatting and structural clarity.

Our takeaways

  1. Task-dependent performance: Neither model dominates across all tasks. Choose based on the specific need.
  2. Style differences: o3-mini excels in clarity and conciseness, while Gemini 2.5 offers more comprehensive technical depth.
  3. Communication approach: o3-mini optimizes for executive-friendly communication, Gemini 2.5 for technical completeness.
  4. Creative expression: o3-mini shows stronger creative language abilities with more elegant and memorable phrasing.

The overall results suggest o3-mini has a slight edge on average, but the optimal choice depends on specific use case priorities.

However, this is just our subjective grading; look at the answers and decide for yourself.

cat carticature

See you cool cats on X!

Get your brand in front of 500,000+ professionals here
www.theneuron.ai/newsletter/

Get the latest AI

email graphics

right in

email inbox graphics

Your Inbox

Join 450,000+ professionals from top companies like Disney, Apple and Tesla. 100% Free.