😸12 mind-blowing o1 demos

PLUS: Why language models will "always" hallucinate

Welcome, humans.

Exciting news for anyone wanting to get your AI product or service (or AI-adjacent product / service) in front of 450,000+ daily Neuron readers: We have sponsorship slots available in Q4!

The world’s leading AI companies advertise with The Neuron (including today’s: Adobe) because of the trust we’ve built with our audience.

Just fill out this form, fill in a few deets, and we’ll reach out to follow up.

Here’s what you need to know about AI today:

We share 12 wild GPT o1 demos and 4 prompts that beat Claude 3.5.
Microsoft created a new playground to test AI agents on Windows OS.
Alibaba will release its new Qwen-2.5 language model this Thursday.
World Labs raised $230M to build a 3D AI model of the world.

12 Wild ChatGPT-o1 demos (and some cool prompts to try, too).

Anyone else spend all weekend messing around with GPT-o1? Nerds! We’re guilty too…

So many people hit the rate limits by Friday, OpenAI had to reset them to keep up the momentum.

Here are some WILD o1 demos we discovered:

Generated an animated solar system (in 5 prompts).
Wrote a poem with incredibly strict rules.
Solved complex crossword clues and intelligence tests.
Created fractal art using Javascript and Adobe Firefly.
Completed white-collar tasks, such as estimating how many Chinese people have an annual disposable income over 100K Yuan.
Achieved first place on the Norway Mensa IQ test (out of all the other models tested). And if you’re wondering, it earned an IQ of ~120.

It’s also very good at making games, like…

This pocket-tanks style game.
This asteroids game.
This very basic Galaga clone.
This flappy birds knock off.
Whatever this game is.
This “2048” game with a playable demo—be warned, very addicting!!

Here’s a list of prompts people use on a daily basis (where o1 specifically outperformed Claude Sonnet 3.5):

Reusable template for building mac applications (example in GPT, example in Perplexity).
Identify security flaws in your code (prompt).
Explaining a certain topic with examples (prompt).
Analyze historical stock variances between the S&P 500 and Berkshire Hathaway (prompt).

The expert opinion: While o1 can do amazing stuff, its performance improvement from 4o is being compared to the difference between a “completely incompetent” grad student, and a “mediocre, but not completely incompetent” grad student.

Ethan Mollick says for us to really know how useful o1 is versus other chatbots, it will take lots of analysis from experts in areas that require deep expertise. For example, here’s an expert analysis comparing o1 to 4o for medical admin work.

Two things you shouldn’t do with o1:

Get help solving the NYT Connections (unlike previously reported).
Ask it for the meaning of life.

In all seriousness, o1 is not for most things. o1 is a model you use for complicated stuff. For simpler things, a simpler model will suffice—o1 will just overthink it.

FROM OUR PARTNERS

Adobe Firefly Video Model is your new AI sidekick for video editing awesomeness.

Adobe Firefly Video Model Coming Soon | Adobe Video

Adobe just offered an awesome sneak peek at its upcoming Adobe Firefly Video Model!

Available later this year, Adobe’s new Firefly model will help editors:

Ideate and explore their creative vision.
Fill gaps in their timeline.
Add new elements to existing footage.

It includes Text to Video AND Image to Video capabilities, so you can use reference images to generate B-Roll.

There are other cool tools too, like Generative Extend, which lets you:

Cover gaps in footage.
Smooth out transitions.
Hold on shots longer (for that perfectly timed edit).

And just like other Firefly generative AI models, Adobe Firefly Video will be commercially safe—it’s only trained on content they have permission to use (a.k.a NOT user data).

See Adobe Firefly Video Model in action here.

Around the Horn.

Podcast-style recap of why language models will “always” hallucinate, and will never be able to guarantee 100% accuracy (full paper here).

A new version of Alibaba’s Qwen large language models, Qwen 2.5 will be released this Thursday, September 19th.
The spatial intelligence company from Fei-Fei Li (the “godmother of AI”), World Labs, is building a “large world model” to perceive, generate, and interact with the 3D world (and they raised $230M).
Microsoft created a new testing ground, called the Windows Agent Arena, to test AI agents in realistic Windows OS environments—check it out here.
Apple will use a new framework called UI-JEPA to understand user intentions and process AI on device without requiring a ton of computation.

Treats To Try.

*Dell's Precision AI-ready workstations powered by NVIDIA RTX™ GPUs are the most efficient way to handle demanding AI workloads without costing a fortune. Check them out for yourself here.
Google Illuminate turns any content into AI-generated audio discussions (in a podcast format, waitlist only right now).
DepthCrafter captures video depth for “long open-world videos” (code here).
Mneme AI is a personal assistant to chat with your stored notes, documents, and books on your phone.
AIPhone is a cross-language calling app, that translates calls in audio and text across 91 languages and dialects.
Cavela connects brands with global manufacturers to streamline product sourcing and production.
AFFiNE AI helps you write, draw, and create presentations.

See our top 51 AI Tools for Business here!

*This is sponsored content. Advertise in The Neuron here.

Sunday Special

We’re testing out a new section called the Sunday Special. It’ll be a rotating theme, featuring whatever cool stuff we found during the week, but didn’t fit in the usual Monday-Friday format.

Today, we’re featuring more intelligent insights—there were a ton of timely insights we found this week that were worth highlighting, but couldn’t all fit in Friday’s letter.

Great insight from Andrej Karpathy on why “large language model” isn’t the right name for LLMs.
Gartner’s chief of research for AI thinks we’re in the “brute-force” era of AI, and once it ends, generative AI will be needed for only 5% of use-cases. Instead, he thinks the future is “composite AI” that uses genAI along with machine learning, knowledge graphs, or rule-based systems.
o1’s performance has caused many people to make the case that general practice doctors are the most at risk of being replaced by AI.
Great thread with theories on why, if language models are so useful, why haven’t we seen any spike in productivity?
Good insight into why AI’s best use-case is improving the products we already own.
Read this amazing breakdown of how o1 performs on the ArcPrize, a competition to test for AGI.