Tools
Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!


Read full article on: vox.com
Submit a question for Jennifer Rubin about her columns, politics, policy and more
Submit your questions for Jennifer Rubin’s mail bag newsletter and live chat.
1m
washingtonpost.com
Millie Bobby Brown, husband Jake Bongiovi set to get married for a second time with lavish wedding in Italy
The lovebirds got engaged in 2023 after two years of dating.
9 m
nypost.com
Arizona Supreme Court rules those affected by database error can still vote
The court's decision comes after officials uncovered a database error that for two decades mistakenly designated the voters as having access to the full ballot.
cbsnews.com
Sean ‘Diddy’ Combs worried about his 7 kids after sex trafficking arrest, desperately wants to talk to them: report
The embattled music executive has seven children with four different women.
nypost.com
Tyler Glasnow frustrated by recurring elbow issues: 'It’s just, like, exhausting'
Dodgers pitcher Tyler Glasnow laments not having an opportunity to play in the postseason after another elbow injury derails his season.
latimes.com
Fallen Idol: Katy Perry’s comeback album is nothing to ‘Roar’ about
Katy Perry's new album, "143," is hardly the comeback that she needed. It’s as if she became a has-been before her time on “American Idol.”
nypost.com
Rookie Dru Phillips emerging as bright spot for woeful Giants’ defense: ‘He’s a dog’
Dru Phillips has been hard to miss, regardless of how rotten the performances have been around him for the Giants' defense.
nypost.com
No. 7 Friendship tops No. 16 Dunbar in a battle of the city’s best publics
On a late touchdown from freshman Khamari Reed, the Knights win a fierce and physical game, 20-14.
washingtonpost.com
Newsom to sign California bill to limit ‘addictive’ social media feeds for kids
Senate Bill 976 could inspire legal action by social media companies, which argued the legislation ‘unconstitutionally burdens’ access to content.
latimes.com
12 Tufts lacrosse players diagnosed with potentially life-threatening condition after team workout
Five of the players who have been diagnosed with rhabdomyolysis remained hospitalized Friday, according to the school's director of media relations.
nypost.com
California firefighters remain hospitalized after truck flips over on freeway
Eight firefighters were injured, including six critically, after a fire truck rolled over and crashed on a California freeway on Thursday evening.
foxnews.com
Harris adviser brushes off lack of interviews: 'She's a very busy person'
Harris-Walz campaign senior adviser Keisha Lance Bottoms defended the vice president rarely speaking to the press, saying, "She's a very busy person."
foxnews.com
Nick Saban blames Panthers for Bryce Young's struggles: 'Did not' have talent around him
Bryce Young has been benched in favor of Andy Dalton after just two games in his second season in the NFL, which has been nothing short of disastrous.
1 h
foxnews.com
Peter Laviolette picking up where he left off with his Rangers combos
A varsity group of 23 on the ice at the Rangers practice rink. It could have been the playoffs. It was the second day of training camp.
1 h
nypost.com
South Carolina inmate dies by lethal injection, ending state's 13-year pause on executions
A death sentence for a South Carolina inmate was carried out on Friday following a 13-year pause on executions in the state, officials said.
1 h
foxnews.com
David Graham, ‘Peppa Pig’ and ‘Thunderbirds’ voice actor, dead at 99: ‘What heartbreaking news’
Graham also voiced the Daleks on d"Doctor Who."
1 h
nypost.com
Turkish 'special interest' migrant tells Texas troopers he paid $12K to cross into US illegally
A large group of migrants was caught near the Texas-Mexico border Friday, according to authorities. Some were reportedly from Turkey, Pakistan, India and Vietnam.
1 h
foxnews.com
Patriots coach Jerod Mayo says Jacoby Brissett is 'starting quarterback until I say he's not'
The first-year New England Patriots coach cited Jacoby Brissett's toughness when he explained his decision to stand pat with the team's starting quarterback.
1 h
foxnews.com
Motel 6 Is Sold to Oyo, an Indian Hotel Company Expanding in the U.S.
A roadside chain for more than 50 years, Motel 6 was owned by Blackstone, the private equity giant. Oyo will pay $525 million in an all-cash deal.
1 h
nytimes.com
Brittany Cartwright and Jax Taylor put on a united front for their son Cruz amid ‘difficult’ divorce
The mother of one filed for divorce from her "Valley" co-star in August.
1 h
nypost.com
Rangers have kept potent Artemi Panarin line intact early in training camp
There are always questions to be answered, decisions to be made and changes to explore throughout training camp, but there’s one aspect of the Rangers lineup that can be left alone. 
1 h
nypost.com
Security Firm Linked to Top Adams Aide Won Millions in N.Y.C. Business
The company received a $154 million contract to provide “emergency fire watch services” to the New York City Housing Authority. The firm was once owned by the deputy mayor for public safety.
1 h
nytimes.com
Joe Douglas, Robert Saleh finally reaping benefits of their Jets vision
It’s also hard not to take what you saw Thursday night and start thinking about just how good the Jets could be this season under Robert Saleh.
2 h
nypost.com
Axelrod encourages Harris to do more interviews: 'Flood the zone'
Former Obama senior adviser David Axelrod told CNN on Friday that Vice President Kamala Harris "absolutely" has do more many more media interviews ahead of the election.
2 h
foxnews.com
Drake Maye’s Aaron Rodgers handshake polarized NFL fans: ‘Waiting for the GOAT’
Any goodwill Drake Maye had built up with the Patriots' backers may have been lost after the game Thursday night.
2 h
nypost.com
Son charged with killing Vermont town official dad, two other relatives who were found in blood-soaked home
The alleged killer had a troubled relationship with his father, relatives said.
2 h
nypost.com
Mets’ Francisco Lindor gets cortisone shot as he works through back injury
Franciso Lindor was progressing Friday, which came after he reportedly got a cortisone shot for his lingering back issue.
2 h
nypost.com
Sunday is fall, but our warm Friday seemed only summer
Equinox looms, but in D.C., summer declines to surrender.
2 h
washingtonpost.com
Aurora Police deny Tren de Aragua gang has 'taken over' the city in presser: 'Not an immigration issue'
The Aurora Police Department hosted a press conference on Friday covering updates on the armed suspected gang members seen in a surveillance video that went viral last month.
2 h
foxnews.com
U.S. soldier who crossed into North Korea pleads guilty to desertion
Travis King fled into North Korea in July 2023 while taking part in a guided tour of the Korean Demilitarized Zone.
2 h
cbsnews.com
USC accused of fraud by 'Varsity Blues' parent whose conviction was overturned
John Wilson has demanded that USC return his $100,000 donation.
2 h
latimes.com
Bryce Young makes promise to be ‘better’ after early-season Panthers benching
Bryce Young has been benched, but he's not giving up.
2 h
nypost.com
Partier tries to get sneaky shot past David Beckham at Mulberry Bar
Beckham was hanging out with his best friend David Garner, and didn't notice the patron trying to snap a pic.
2 h
nypost.com
Inflation concerns weigh heavily on voters
Many Americans are voting with their wallets this year, which is unsurprising considering the U.S. economy consistently polls as a top concern. Vice President Kamala Harris and former President Donald Trump are offering vastly different proposals for solving the stubborn problem of inflation. Mark Strassmann has an in-depth look at the issue.
2 h
cbsnews.com
Harris-Walz Adviser: Harris 'Was Joking' About Shooting an Intruder
On Friday’s broadcast of CNN’s “The Lead,” Harris-Walz Campaign Senior Adviser Keisha Lance Bottoms said that 2024 Democratic presidential candidate Vice President Kamala Harris was just joking when she talked about shooting someone who broke into her house, “but I The post Harris-Walz Adviser: Harris ‘Was Joking’ About Shooting an Intruder appeared first on Breitbart.
2 h
breitbart.com
Acting Columbia prez is just another anti-Israel lefty, proving the rot runs bone deep
“If you could just let everybody know who was hurt” by Shafik calling the cops on the encampments and occupiers “that I’m just incredibly sorry,” blubbered Armstrong to a student newspaper.
2 h
nypost.com
The 1983 Beirut Bombings Explained
One of Hezbollah’s top military commanders, who was accused of helping plan the blasts four decades ago, was killed on Friday in an Israeli airstrike.
2 h
nytimes.com
‘Black Nazi’ GOP governor candidate Mark Robinson must end his campaign — or he’ll let Kamala Harris win North Carolina
The biggest news on the swing-state map this week: election expert Larry Sabato moving North Carolina’s gubernatorial race from “lean Democratic” to “likely Democratic” in the wake of the latest round of embarrassing revelations about embattled Lt. Gov. Mark Robinson, the hapless GOP candidate this cycle. Sabato is being generous. It’s more than likely Democratic...
2 h
nypost.com
Latest COVID variant, XEC, has spread to half of US states, reports say
The latest strain of the COVID-19 virus, XEC, is circulating across the country. The new variant has been reported in at least 25 U.S. states, according to reports.
2 h
foxnews.com
Kamala Harris adviser Keisha Lance Bottoms offers lame excuse for VP’s press dodging: ‘She’s a very busy person’ 
Harris, 59, is on pace to grant the fewest interviews of any major party’s presidential nominee ever. She has been slammed by both allies and critics for giving just six sitdown interviews since President Biden ended his re-election bid on July 21. 
2 h
nypost.com
State reports 1st human case of EEE in nearly a decade
New York state reported its first case of eastern equine encephalitis in nearly a decade on Friday.
2 h
abcnews.go.com
Kamala Harris’ stealth campaign aims to sneak into office by running out the clock
Like Biden in 2020, Harris has nationalized a new kind of cynical campaign in which leftist candidates seek for a few months to deceive the public into thinking they are moderate — until elected.
2 h
nypost.com
Oprah’s redirect shows that Kamala can’t give a straight answer on policy
During a softball conversation with Oprah Winfrey for a "Unite for America" event in Michigan on Thursday, Harris was asked by a member of the audience how she planned to secure the border.
2 h
nypost.com
How to prevent some scandals, elite schools defying Supremes? and other commentary
Emerging “details” of the investigations into City Hall and top agencies “hint at how to prevent such scandals,” observes City Journal’s Nicole Gelinas.
2 h
nypost.com
Social media erupts after president throws to Jill Biden to speak at Cabinet meeting: 'All yours, kid'
Numerous political commentators responded to First Lady Jill Biden speaking among President Biden's Cabinet, the first Cabinet meeting since October of last year.
2 h
foxnews.com
Mark Robinson won't appear at Trump's North Carolina rally after CNN report
North Carolina Lt. Gov. Mark Robinson, the Republican candidate for governor, is no longer attending former President Donald Trump's rally in North Carolina Saturday. It comes after CNN unearthed Robinson's alleged posts on a pornographic website. Robinson is denying the report and refusing to drop out of the race.
2 h
cbsnews.com
The MTA’s $68.4B capital plan is a pathetic joke
The MTA's new so-called capital "plan" barely qualifies as a passive-aggressive political stunt.
2 h
nypost.com
Oregon amusement park files lawsuit against ride manufacturer after it leaves guests hanging upside down
After dozens of riders at an Oregon amusement park were left hanging upside down, two lawsuits were brought against the theme park and the ride manufacturer.
3 h
foxnews.com