Four Open AI Models Conquered Every Scale This Week SEO

If you've been paying $20 a month for ChatGPT and assuming that's just how AI works now, this week might change your mind. Four separate teams, ranging from Google to a 30-person startup nobody had heard of a year ago, shipped AI models that are free to download, free to modify, and free to build a business on. They cover everything from your phone to your laptop to your company's server rack. And on the tests that matter, they're competitive with the models you're currently paying for.

Oh, and the best part? You can try all four of them right now.

First up, the TL;DR
The Gemma 4 story is bigger than it looks
Bonsai: What happens when you rethink the math
Holo3: The one that actually does things
Trinity: The 30-person lab that bet the company
The real story: the economics just inverted
Get the models
What this means for you

First up, the TL;DR

A year ago, running a competitive AI model meant renting cloud tokens by the million. This week, four separate teams shipped open models that cover every device you own, all under the same license: Apache 2.0. Use it, modify it, sell it.

Here's what happened:

The big one (new as of yesterday): Google released Gemma 4 in four sizes, including an edge model that runs on a Raspberry Pi (a $35 mini-computer) in under 1.5GB of memory (about the size of a single Netflix episode download), and a 31B model (31 billion parameters; parameters are the "brain cells" that determine how smart an AI is, and more usually means smarter) ranked #3 among all open models in the world. This is important: it's the first Gemma release under Apache 2.0 (an open-source license that means anyone can use, modify, or sell the model for free, no permission required).
PrismML emerged from stealth with 1-bit Bonsai, an 8B model (8 billion parameters) compressed to just 1.15GB (14x smaller than a standard version of the same model) that runs at 44 tokens per second on an iPhone (roughly 44 words per second; fast enough for real-time conversation). Competitive with full-precision models on benchmarks (the standardized tests AI models are graded on). Built on Caltech research.
H Company shipped Holo3, a computer-use agent (the kind that can click around your desktop and do tasks for you, like an invisible assistant controlling your mouse) that outperformed GPT-5.4 and Claude Opus 4.6 on the leading desktop automation benchmark, with just 10B active parameters (10 billion out of 122 billion total; it only activates the parts it needs for each task). A smaller variant is open-source on Hugging Face.
Arcee AI released Trinity-Large-Thinking, a 400B reasoning model (400 billion total parameters, but only 13B active at a time, so it runs at the speed and cost of a model 30x smaller) built by a 30-person US team for $20M. Scores #2 on the top agentic benchmark (a test measuring how well AI can use tools and complete multi-step tasks), behind only Claude Opus 4.6, at 96% less cost.

Why this matters: Every rung of the compute ladder now has a competitive, fully open model. Google's move to Apache 2.0 removes the last legal friction. The open model ecosystem just crossed a threshold: serious intelligence is now free to own at every scale.

Our take: The question used to be "which AI model should I use?" Now it's "which AI model should I run where?" It's a bit more complicated, but you've never had more available intelligence. This is a good thing people!!

Now, let's dive into all of that with a bit more detail.

The Gemma 4 story is bigger than it looks

Google releasing an open model isn't new. They've been doing it since the original Gemma in 2024. What's new is the license.

Every previous Gemma came with a restrictive custom license. You could download the model, sure, but the terms included guardrails on commercial use, redistribution, and modification that made enterprise legal teams nervous. If you wanted the safety of a fully permissive license, you picked Meta's Llama, Alibaba's Qwen, or Mistral. Google's models were good but came with fine print.

Gemma 4 ships under Apache 2.0. That's the same license used by Qwen, Mistral, Arcee, and most of the open-weight ecosystem. No custom clauses, no "Harmful Use" carve-outs requiring legal interpretation, no restrictions on redistribution. Hugging Face CEO Clément Delangue called it "a huge milestone." He's right. This single licensing decision matters more than the benchmark numbers.

Here's why: Apache 2.0 removes the last barrier between "I downloaded a model" and "I built a product on it." Enterprises that were waiting for Google to play on the same licensing terms as the rest of the field can now use Gemma in production without a legal review. Startups can fine-tune it and sell the result. Researchers can publish derivative work without checking a terms-of-service FAQ. The friction is gone.

And the models themselves are genuinely impressive. The family spans four sizes:

Gemma 4 31B Dense: Ranked #3 on the Arena AI open-model leaderboard. Outcompetes models 20x its size. 256K token context window (meaning it can process roughly 400 pages of text in a single prompt). Supports text, images, and video.
Gemma 4 26B MoE (Mixture of Experts; a design where the model has 128 specialized sub-networks but only activates 4 at a time, giving you the intelligence of a large model at the speed of a small one): Ranked #6 on the same leaderboard with only 3.8B active parameters. This is Gemma's first MoE, and it's a statement of engineering efficiency.
Gemma 4 E4B and E2B (edge models): These are purpose-built for phones, IoT devices, and Raspberry Pi. The E2B runs in under 1.5GB of memory. Both handle images, video, AND audio natively (the larger models don't do audio). They support 128K token context windows, function calling (the ability to use external tools), and structured JSON output. Google built them in collaboration with Qualcomm and MediaTek for near-zero latency on consumer hardware.

Sebastian Raschka's analysis revealed something interesting: the 31B model's architecture is basically unchanged from Gemma 3 27B. Same attention patterns, same hybrid sliding-window design. The massive benchmark gains came from training data and recipe improvements, not architectural innovation. That's significant because it means the same architectural bones that worked at 27B are now producing dramatically better results at 31B, suggesting Google's data curation and training pipeline have reached a new level of maturity.

For context: Gemma 4 31B scored 89.2% on AIME 2026 (a math competition benchmark). Gemma 3 27B scored roughly 20% on the same test. That's a generational leap from training alone.

Bonsai: What happens when you rethink the math

While Google was scaling up, PrismML was doing the opposite. Their bet: what if you could make a model 14x smaller without making it dumber?

Traditional AI models store each parameter (each "brain cell") using 16 bits of precision, like measuring something to 16 decimal places. PrismML's insight, built on years of mathematical research at Caltech, is that 1 bit is enough. One bit per parameter. On or off. Plus or minus. That's it.

The result is Bonsai 8B: an 8-billion-parameter model that takes up just 1.15GB instead of the standard 16GB. It runs at 131 tokens per second on a MacBook Pro and 44 tokens per second on an iPhone 17 Pro Max. It uses 4-5x less energy. And on the standardized tests, it's competitive with full-precision models of the same size, including Meta's Llama 3 8B.

PrismML also shipped a 4B variant (0.5GB) and a 1.7B variant (0.24GB, running at 130 tokens per second on iPhone). The smallest model fits comfortably in the memory of a smartwatch.

Vinod Khosla, one of the investors, put it this way: "AI's future will not be defined by who can build the largest datacenters." PrismML is the existence proof. If intelligence per bit is the metric that matters (and the energy crisis in AI infrastructure suggests it should be), Bonsai might be pointing at the future more clearly than any frontier model.

The caveat: 1-bit models are new territory. PrismML's benchmarks look strong, but real-world testing is just beginning. And the speed gains today come primarily from reduced memory, not from hardware optimized for 1-bit operations. PrismML says purpose-built silicon could unlock "another order of magnitude" in speed. If that's true, today's 8x speedup becomes 80x.

Holo3: The one that actually does things

Gemma 4 talks to you. Bonsai talks to you faster. Holo3 doesn't talk at all. It acts.

H Company's Holo3 is a computer-use agent. You point it at a desktop (web, Windows, Mac) and it navigates the interface, clicks buttons, fills forms, coordinates across applications, and completes multi-step business workflows. Think of it as an invisible intern that never needs to ask where the Print button is.

On the OSWorld-Verified benchmark (the leading test for desktop automation), the full Holo3-122B-A10B model scored 78.85%, beating both GPT-5.4 and Claude Opus 4.6. It did this with just 10B active parameters out of 122B total, at what H Company calls "a fraction of the cost" of those proprietary models.

What makes Holo3 architecturally interesting is the training pipeline. H Company built a "Synthetic Environment Factory" that programmatically generates enterprise software environments (think: fake CRM systems, fake e-commerce sites, fake collaboration tools) and trains the model to navigate them. The model practices on thousands of variations of the same business workflow until it can handle real-world messiness.

They also built H Corporate Benchmarks: 486 multi-step tasks spanning e-commerce, business software, collaboration, and multi-app workflows. At the hard end, tasks require coordinating information across multiple applications simultaneously (retrieving prices from a PDF, cross-referencing against employee budgets in a spreadsheet, then sending personalized emails through an email client). That kind of task demands sustained reasoning across applications without losing state. Holo3 does it.

The open-source variant, Holo3-35B-A3B (Apache 2.0, free on Hugging Face), still achieves 77.8% on OSWorld-Verified with only 3B active parameters. That's a startling result: a model small enough to run locally beating most competitors that require massive GPU clusters.

Trinity: The 30-person lab that bet the company

Arcee AI's story reads like a startup pitch that actually worked. A team of 30 people, operating on less than $50M in total capital, decided to pre-train a 400-billion-parameter model from scratch on 2,048 NVIDIA B300 GPUs. The training took 33 days and cost roughly $20M. CEO Mark McQuade has described it as a "back the company" bet.

The result, Trinity-Large-Thinking, scores #2 on PinchBench (a benchmark from Kilo Code measuring model capability on agentic tasks like code editing, tool use, and multi-step reasoning), behind only Claude Opus 4.6. The difference: Opus 4.6 costs roughly $15/M output tokens on the Anthropic API. Trinity costs $0.90/M. That's 96% cheaper for comparable agentic performance.

Trinity uses a 4-of-256 expert Mixture-of-Experts architecture (256 specialized sub-networks, only 4 active per token), achieving extreme sparsity. Only 1.56% of the model's total parameters activate at any given time. The practical result: a model that "knows" as much as a 400B model but processes each request at the speed and cost of a 13B model.

The model found product-market fit almost immediately. Trinity-Large-Preview (the earlier instruct version) became the #1 most-used open model on OpenRouter in the US within two months, serving 3.37 trillion tokens. Trinity-Large-Thinking, the new reasoning variant, adds explicit chain-of-thought reasoning and better multi-turn tool calling for long-running agent loops.

Will Brown, Research lead at Prime Intellect, called it "the best American open-source model ever" and noted it was built "with no ex-big-lab employees and no Claude distillation."

The real story: the economics just inverted

If you zoom out from any individual model, the pattern is clear. One year ago, the AI landscape had a clean hierarchy: proprietary models at the top (GPT-5, Claude Opus, Gemini), open models somewhere below, and on-device models at the bottom. Each step down the ladder meant a meaningful sacrifice in quality.

That hierarchy collapsed this week.

Google's Gemma 4 31B matches or beats models 20x its size on key benchmarks. Holo3 outperforms GPT-5.4 on desktop automation. Trinity-Large-Thinking matches Claude Opus 4.6 on agentic tasks at 96% less cost. Bonsai delivers Llama 3 8B-class intelligence in 1/14th the space.

The practical implication: the cost of running "good enough" AI just dropped by an order of magnitude at every scale. A phone can now run a model that would have required a cloud API six months ago. A small team can deploy an agent fleet that would have cost a Fortune 500 company's AI budget six months ago. A robotics startup can run inference on-device instead of designing around network latency.

And every one of these models ships under Apache 2.0. You own it. You can fine-tune it on your data. You can sell products built on it. No vendor lock-in, no per-token pricing, no terms-of-service changes at 3 AM.

The teams that built them tell the second part of the story. Google is the world's largest AI lab. Arcee has 30 employees. PrismML is a Caltech spinoff that just emerged from stealth. H Company is a French startup. The fact that all four independently shipped competitive frontier models in the same week, using radically different architectures (dense, MoE, 1-bit, agentic flywheel), suggests that the moat around proprietary AI is narrower than most investors believe.

Get the models

Every model mentioned in this article is free to download. Here's where to find them:

Google Gemma 4 (all four sizes: 31B, 26B MoE, E4B, E2B)

Try instantly in browser: Google AI Studio (no download, free)
Desktop app: LM Studio (search "Gemma 4," click download)
One-line install: Ollama (ollama run gemma4)
Fine-tune locally: Unsloth Studio (run + fine-tune all sizes, web UI)
Android phone: Google AI Edge Gallery (E2B and E4B, fully offline)
In browser (no install): Gemma 4 WebGPU
NVIDIA RTX: NVIDIA AI tools for Gemma 4
Weights: Hugging Face | Kaggle
Google's announcement blog

PrismML Bonsai (8B, 4B, 1.7B; all 1-bit)

Weights (Apple Silicon / MLX): HuggingFace MLX
Weights (NVIDIA / GGUF): HuggingFace GGUF
Demo + setup scripts: GitHub (Bonsai-demo)
iPhone: Partnered with Locally AI for native iOS support
Note: Requires PrismML's custom 1-bit kernels (forked llama.cpp and MLX); not yet in upstream Ollama or LM Studio. Use the demo repo for the fastest setup.
PrismML announcement | Whitepaper

H Company Holo3 (122B flagship + 35B open variant)

Open weights (35B-A3B, Apache 2.0): HuggingFace
API access (122B flagship, free tier available): H Company Inference API
Note: The 122B model that set the SOTA benchmark is API-only. The 35B-A3B is the open-weight variant (still scores 77.8% on OSWorld-Verified).
H Company announcement

Arcee Trinity-Large-Thinking (400B MoE, 13B active)

API (cheapest frontier reasoning): OpenRouter ($0.90/M output tokens)
API (Preview, free): OpenRouter free tier
Weights: Hugging Face (Apache 2.0)
Managed hosting: DigitalOcean Agentic Inference Cloud (public preview)
Self-host: vLLM 0.11.1+ or SGLang (see HF model card for setup)
Arcee announcement | Technical report

What this means for you

If you've been putting off experimenting with local AI because "the models aren't good enough yet," that excuse evaporated this week.

If you're non-technical: Download LM Studio (free, works on Mac/Windows/Linux), search for Gemma 4, click download, and start chatting. Everything runs on your machine. Or just go to Google AI Studio in your browser and try it instantly, no download needed.
If you're a developer: The 26B MoE model is the sweet spot for most local development. It activates only 3.8B parameters per token, so it runs fast on a single GPU while delivering 26B-class intelligence. Ollama, llama.cpp, vLLM, and LM Studio all support it on day one.
If you run a company: The licensing question is settled. Apache 2.0 means your legal team can green-light Gemma, Trinity, Bonsai, or Holo3 without a custom review. If you're building agent workflows, Trinity at $0.90/M output tokens changes the unit economics of deploying AI at scale.
If you care about privacy: Bonsai 8B in 1.15GB on your phone. No data ever leaves your device. No API key. No subscription. No cloud. Local is not really compromising at this point. It's competitive intelligence running offline.

The question used to be "which AI model should I use?" This week, it became "which AI model should I run where?"

The answer is: all of them. Everywhere. We have abundant intelligence. And it will only get smarter from here. The question is wtf do we do with it all?