Tools
Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!


Read full article on: vox.com
Submit a question for Jennifer Rubin about her columns, politics, policy and more
Submit your questions for Jennifer Rubin’s mail bag newsletter and live chat.
1m
washingtonpost.com
NYCFC defender excited for highly anticipated matchup vs. Lionel Messi: ‘Great privilege’
Roughly two years ago, Kevin O’Toole was simultaneously a rookie on trial with NYCFC and finishing his senior thesis for Princeton. 
nypost.com
Fantasy football: Practice patience with infuriating tight end position
The start of the 2024 fantasy football season has been a head-scratching nightmare for so many people who play the game.
nypost.com
California squatters overtake abandoned mansion owned by son of Philadelphia Phillies owner John Middleton
Residents in the ritzy neighborhood allege they were subjected to abuse while one suspected squatter attacked a homeowner with a "steel rebar and beer bottle."
nypost.com
San Juan Hills showcases its stingy defense in victory over Mira Costa
San Juan Hills capitalizes on its strong defense and Mira Costa mistakes to take a 28-7 victory.
1 h
latimes.com
US Army soldier Travis King, who fled to North Korea, is ‘now free’ after pleading guilty to desertion
King bolted across the heavily fortified border from South Korea in July 2023 and became the first American detained in North Korea in nearly five years. 
1 h
nypost.com
Cards Against Humanity sues Elon Musk’s SpaceX for allegedly trashing their Texas land: ‘Go f–k yourself’
“We said, ‘Go f–k yourself, Elon Musk. We’ll see you in court,’” the card game company said.
2 h
nypost.com
Juan Soto delivers dramatic double after being late scratch from Yankees’ lineup with knee issue
Juan Soto’s sliding catch on Thursday cost him a start Friday, but not the entire night. 
2 h
nypost.com
Shohei Ohtani helps ignite Dodgers comeback, reaching 52-52 mark in win over Rockies
A day after his 50-50 milestone, Shohei Ohtani hits another home run and picks up another stolen base as the Dodgers rally to defeat the Colorado Rockies.
2 h
latimes.com
GOP senator calls on embattled North Carolina gubernatorial candidate Mark Robinson to sue CNN or drop out after porn site posts surface
“If the reporting on Mark Robinson is a total media fabrication, he needs to take immediate legal action,” Sen. Thom Tillis wrote on X. 
2 h
nypost.com
Gerrit Cole’s nine-inning gem, Juan Soto’s late heroics lead Yankees to thrilling win
Only after Cole completed nine innings did his offense finally wake up — with some heroics from Juan Soto — to reward him with the win. 
2 h
nypost.com
Texas murder suspect released on bond 4 days after arrest for 2021 fatal stabbing
A Texas suspect in a 2021 cold case was arrested for a deadly stabbing in Austin before he was released on bond just days after he was booked into jail.
2 h
foxnews.com
A Tight Race in Sri Lanka Two Years After Its President Fled
The central issue in the election is how to correct the economy’s deep imbalances.
3 h
nytimes.com
9/20: CBS Evening News
Lebanon border clashes between Israel, Hezbollah spark fears of wider war; Virginia high school students surprise janitor with SUV
3 h
cbsnews.com
62 days: Kamala Harris has yet to do formal press conference since emerging as Democratic nominee
Vice President Kamala Harris hasn’t held a formal press conference with reporters since she became the presumptive and now official Democratic nominee.
3 h
foxnews.com
Video Released of NYPD Shooting on Brooklyn Subway That Wounded 4
After seeing video of officers firing at a knife-wielding man, the cousin of a bystander who was badly hurt in the shooting continued to criticize the police.
3 h
nytimes.com
Miss Manners: New co-worker likes baby talk
How do you tell a co-worker to stop with the baby talk?
3 h
washingtonpost.com
Asking Eric: Parents’ hoarder house is a burden to daughter
She resents her parents’ hoarding because she has to take on all family hosting duties.
3 h
washingtonpost.com
Carolyn Hax: A parent is uncomfortable with great-aunt’s extravagant gifts
Great-aunt retired from fashion business and now spends (too?) freely from a fixed income on clothes for toddler grandniece.
3 h
washingtonpost.com
Sean ‘Diddy’ Combs whines about not being to talk to his kids while locked up on sex trafficking charges: report
Combs' family has been supporting each other and turning to God for guidance, according to a report.
3 h
nypost.com
Mets’ ‘aggressive’ Adam Ottavino bullpen decision backfires in loss to Phillies: ‘Kind of ruined everything’
The Mets were still within striking distance when Carlos Mendoza decided to call to the bullpen in the fourth inning for right-hander Adam Ottavino. 
3 h
nypost.com
Israeli soldiers pushed 4 apparently lifeless bodies from roofs during a West Bank raid
Israeli soldiers pushed four apparently lifeless bodies from rooftops during a raid in the occupied West Bank on Thursday, according to an Associated Press journalist at the scene and videos obtained by AP.
3 h
nypost.com
Sen. Tillis issues ultimatum to embattled GOP candidate in crucial swing state: ‘Owes it to President Trump’
North Carolina Republican Sen. Thom Tillis issued GOP gubernatorial candidate Mark Robinson an ultimatum in a social media post following the bombshell allegations against the candidate.
3 h
foxnews.com
Harris – who once sought to ‘eliminate’ private health plans – says Trump would ‘threaten’ health insurance
“Let's eliminate all of that. Let's move on,” Harris said of private healthcare insurance in 2019.
3 h
nypost.com
Jets’ defense finally flashed dominant potential in convincing Week 3 win
Now this is how the Jets expected their defense to look. Stifling against the run. Ferocious against the pass.
3 h
nypost.com
Eric Wagaman's three hits not enough for Angels in loss to Astros
Eric Wagaman finishes with three hits and two RBIs, but Kyle Tucker's four hits help the Houston Astros pull off a 9-7 win over the Angels.
3 h
latimes.com
Islanders’ Scott Mayfield ready to go full tilt after recovering from ankle surgery
Scott Mayfield last played in a game in February. But the last time he played at full strength? That was all the way back last October, on Opening Night 2023. 
3 h
nypost.com
Martha Stewart leaves Netflix ‘pissed’ after she trashes new film, flew private jet to premiere: source
An insider said, "Martha was seen arriving to -- and departing -- the Telluride Film Festival in Colorado via private jet courtesy of Netflix."
3 h
nypost.com
Court rules nearly 98,000 Arizonans whose citizenship hadn't been confirmed can vote the full ballot
The Arizona Supreme Court has ruled that nearly 98,000 people whose citizenship documents hadn’t been confirmed can vote in state and local races.
4 h
latimes.com
Trinity Rodman exits with injury and the Spirit is thumped by Current
The Spirit’s star forward, a part of the gold medal-winning U.S. national team in Paris, left the field in a wheelchair after injuring her back.
4 h
washingtonpost.com
Oasis to add first US tour dates in 16 years after selling out UK reunion shows: report
American Oasis fans can stop crying their hearts out. 
4 h
nypost.com
Rep. Nancy Mace called ‘bigot’ by liberal author during heated CNN panel after congresswoman released ‘flirty’ texts he sent to her
Rep. Nancy Mace revealed Vanderbilt University professor Michael Eric Dyson sent her what she described as "flirty" text messages following a heated panel discussion on Kamala Harris on CNN.
4 h
nypost.com
Rangers’ Igor Shesterkin brushes off contract questions ahead of deal’s final year: ‘Don’t care’
It wasn’t surprising to hear Igor Shesterkin brush off questions about his pending unrestricted free agent status at the conclusion of this season.
4 h
nypost.com
Cyclist dies after being hit by two vehicles in L.A. Police search for drivers who fled the scene
The LAPD is seeking the public’s help in identifying two drivers involved in two collisions that killed a bicyclist in Northridge on Thursday evening.
4 h
latimes.com
No. 18 Huntingtown stymies No. 20 Westlake, moves to 3-0
Senior Evan Powell and the Hurricane defense silenced a red-hot Wolverines squad in a 16-6 win.
4 h
washingtonpost.com
Giants’ John Michael Schmitz impressing with ‘urgency’ to develop after rough rookie season
He remains far from the finished product the Giants need him to be, but the signs are there that John Michael Schmitz is taking a second-year leap.
4 h
nypost.com
New York Magazine writer Olivia Nuzzi allegedly sent RFK Jr. ‘demure’ nudes during ‘sexting’ affair: report
Nuzzi, 31, and Kennedy's relationship was strictly digital, but included risqué pics from the 31-year-old Washington correspondent, Puck News reported.
4 h
nypost.com
Mets’ pitching crumbles in ugly loss to Phillies in second game of critical series
David Peterson’s worst start of the season coupled with underwhelming relief ended the Mets’ winning streak at four games with a loss to the Phillies.
4 h
nypost.com
Arizona Supreme Court rules 98,000 people whose citizenship is unconfirmed can vote in pivotal election
The Arizona Supreme Court ruled Friday that nearly 98,000 people whose citizenship documents hadn’t been confirmed can vote in state and local races.
4 h
foxnews.com
Millie Bobby Brown, husband Jake Bongiovi set to get married for a second time with lavish wedding in Italy
The lovebirds got engaged in 2023 after two years of dating.
4 h
nypost.com
Arizona Supreme Court rules those affected by database error can still vote
The court's decision comes after officials uncovered a database error that for two decades mistakenly designated the voters as having access to the full ballot.
4 h
cbsnews.com
Sean ‘Diddy’ Combs worried about his 7 kids after sex trafficking arrest, desperately wants to talk to them: report
The embattled music executive has seven children with four different women.
4 h
nypost.com
New York confirms its first case of EEE since 2015. What to know about the virus.
An Ulster County resident has tested positive for the mosquito-borne eastern equine encephalitis virus, health officials say.
4 h
cbsnews.com
Footage shows NYPD officers firing at man with knife in subway shooting that wounded 4
Footage of two New York City police officers opening fire at a subway station as they confronted a man holding a knife shows they fired at him as he was standing still, his arms by his side and his back to a train
4 h
abcnews.go.com
Tyler Glasnow frustrated by recurring elbow issues: 'It’s just, like, exhausting'
Dodgers pitcher Tyler Glasnow laments not having an opportunity to play in the postseason after another elbow injury derails his season.
5 h
latimes.com
Fallen Idol: Katy Perry’s comeback album is nothing to ‘Roar’ about
Katy Perry's new album, "143," is hardly the comeback that she needed. It’s as if she became a has-been before her time on “American Idol.”
5 h
nypost.com
Rookie Dru Phillips emerging as bright spot for woeful Giants’ defense: ‘He’s a dog’
Dru Phillips has been hard to miss, regardless of how rotten the performances have been around him for the Giants' defense.
5 h
nypost.com
No. 7 Friendship tops No. 16 Dunbar in a battle of the city’s best publics
On a late touchdown from freshman Khamari Reed, the Knights win a fierce and physical game, 20-14.
5 h
washingtonpost.com