Llama 4 is here—but who is it actually for?

Meta dropped its highly anticipated Llama 4 models over the weekend, and let's just say the rollout was… messy.

Grant Harvey

July 29, 2024

Meta dropped its highly anticipated Llama 4 models over the weekend, and let's just say the rollout was… messy.

‍Catch me up: wtf is Llama? AI models are like giant digital brains trained on tons of internet text, right? Companies like OpenAI (ChatGPT) and Google (Gemini) have their own powerful (but private) versions, and Llama is Meta's version—with Llama 4 being their newest, biggest, and supposedly best generation yet.

What makes Llama special is that Meta often releases the “blueprints” (called open weights) to its models so researchers and other companies can use and build upon them.

Meta just released 2 of 4 planned models, Scout and Maverick, which are supposed to be the smaller of the group (ahead of “Behemoth”, the largest, which is coming soon).

The weekend launch, however, caught cloud providers off guard and lacked the usual detailed research paper, leaving many in the AI community scratching their heads.

Here’s what happened:

Dropping a major release on a Saturday sparked immediate suspicion—was Meta trying to bury bad news, or front-run competitors?
Maverick hit #2 on the popular LM Arena leaderboard, but researcher Nathan Lambert pointed out Meta used an unreleased experimental chat version for that ranking, not the public model (similar shenanigans happened with Grok).
Real-world tests showed Maverick bombing coding tasks, scoring just 16% on the aider polyglot benchmark—far worse than older, smaller models.
Rumors flew (like this unverified claim) that Meta “trained on the test set” to boost scores, but Meta VP Ahmad Al-Dahle categorically denied this on X, blaming the mixed results on implementation bugs needing a week to stabilize.

Beyond the benchmarks, the practical hurdles are real:

Forget running these on your gaming PC. Even quantized, they're huge. Running the smallest model (Scout) requires serious hardware—think 96GB of RAM just for a 4-bit version on a Mac, according to tests run by Simon Willison.
Scout supposedly has a 10M token context window (roughly ~7.5M words), but was likely trained on much smaller segments (around 256k tokens, discussed here), raising doubts about its actual long-range capabilities.
Want to access that full context window? You can do it with a dedicated deployment, but expect it to cost ~2-4 times more than limited-context ones.

Nir Gazit, CEO and cofounder of Traceloop (ex-Google ML manager building a company that does observability for agents) had a pretty entertaining take:

"Llama 4 just dropped. The specs are insane. But here's why you shouldn't drop everything and switch to it. So it has the biggest context window and it has 70 billion active parameters, 16 experts and 109 billion total parameters. It's crazy. It's the best specs we've seen for an open source model.

Every new model breaks the records set by the previous one. But that's exactly the problem. These providers are optimizing for benchmark metrics.
And when you optimize for a metric—it stops being useful.

Take the context window. 10 million tokens? Sounds amazing. You could fit your life story in there. But in practice? These models rarely use the full context effectively.

They still rely on architectures built around "Attention Is All You Need." These models are not really good at using a big context window. So usually when you use these models, you see them just take into account this small part of that context window, usually the last part that you've sent to that model."

‍

Philip Kiely of Basten said essentially the same thing:

"Large context windows are useful for a certain subset of tasks, like processing entire codebases. However, there are tradeoffs around performance and cost. Supporting the full context window requires allocating much more hardware than would ordinarily be used to run a model of this size. That's why many shared endpoint providers are running the model with a fraction of the total context window, as most requests will only use hundreds or thousands of tokens rather than millions. With a dedicated deployment, you can scale up to the full context window if you allocate enough hardware resources."

So even though the model says it has a 10M context window... you probably don't need (or want) that, realistically speaking.

‍

Despite all the drama, Llama 4 is here, and you can play with it right now. You can access Scout and Maverick via API or chat interfaces through inference providers like.

Just remember those providers might initially cap the context window lower than the advertised maximum.

‍

So, who is Llama 4 for right now? It seems Meta is targeting three main groups:

Cloud users seeking more efficient (cheaper) dedicated AI than competitors like DeepSeek (once optimized).
Companies (with under 700M users) wanting to self-host powerful models on their own beefy servers.
Existing Llama 3 users in the above categories looking for an upgrade (once things stabilize).

Who it’s NOT for? the average AI enthusiast running models on their home computers.

‍

Our take: Growing pains, am I right? It’s a tale of two llamas: On the one hand, the rumors of Llama 4’s death are premature. On the other hand, we can hear the Redditors of /r/LocalLlame chanting the classic Edwin Starr protest song: “Llama 4! What is it good for? Absolutely NOTHING…”

Both of these takes are extremes. While Llama 4 might become a powerful option for enterprise and dedicated cloud users after optimization, it needs time. Meta has to fix the kinks, and maybe release smaller, more accessible versions like they did with Llama 3 (we're hoping!).

The question is: does Meta actually have time? DeepSeek R2 is right around the corner, new open Chinese models are dropping all the time, and the stock market is in free fall.

With this rocky launch, Meta's bet on sheer scale might be hitting a wall just as nimbler competitors are finding smarter ways forward—and right when investors are demanding practical results (not just bigger models) to boot.

‍