Tools
Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!


Read full article on: vox.com
Submit a question for Jennifer Rubin about her columns, politics, policy and more
Submit your questions for Jennifer Rubin’s mail bag newsletter and live chat.
1m
washingtonpost.com
Menendez Brothers Slam ‘Ruinous’ Netflix Show’s Portrayal
VINCE BUCCI/AFP via Getty ImagesThe Menendez brothers have slammed Ryan Murphy’s new Netflix series, Monsters: The Lyle and Erik Menendez Story, for what they say are “awful lies” and “ruinous character portrayals” in the show.Lyle and Erik Menendez are still serving life prison sentences for murdering their parents in 1989, and they’ve spent years attempting to rehab their image from murderers to victims who suffered abuse at the hands of their parents. With his new show, Murphy has undone all of that, said Erik.“Is the truth not enough?” wrote Erik in a message posted to Lyle’s Facebook page. “How demoralizing to know that one man with power can undermine decades of progress in shedding light on childhood trauma.”Read more at The Daily Beast.
thedailybeast.com
NFL’s mistreatment of fans creating once unfathomable ticket reality
There it was, last Sunday on Fox, during the Commanders-Giants game: an advertisement to purchase this season’s Giants tickets!
nypost.com
Death toll in Beirut strike rises to 37 as Israel and Hezbollah barrage escalates at Lebanon border
As rescue crews recovered victims of Israel’s strike on Beirut, Israeli warplanes launched a withering attack on Lebanon’s south amid fears of wider war.
latimes.com
Sinking boats, bikes, picnic tables part of planned Anacostia River cleanup
A project by the Metropolitan Washington Council of Governments and the Anacostia Riverkeeper will remove boats, electric scooters and other debris endangering fish.
washingtonpost.com
Meta AI chatbot heaps praise on Kamala Harris, warns Trump is ‘crude and lazy’
Meta's AI assistant gave glowing reasons to vote for Vice President Kamala Harris in November, but gave criticism for former President Donald Trump.
nypost.com
San Francisco’s police dress up in chicken costumes to catch speeding drivers — here’s the ‘impact’
One police officer wore a giant inflatable chicken costume across a San Francisco crosswalk.
nypost.com
Harris accepts CNN invitation to debate Trump on Oct. 23 — despite ex-prez insisting they won’t face off again
Vice President Kamala Harris has accepted CNN’s invitation to debate Donald Trump for a second time ahead of the November election — despite the latter’s insistence that he does not want to face off again.
nypost.com
David Graham, ‘Thunderbirds’ and ‘Peppa Pig’ voice actor, dead at 99
Voice actor David Graham has died at the age of 99. Graham was known for voicing characters on popular television shows including, "Thunderbirds," "Peppa Pig" and "Doctor Who."
foxnews.com
The most popular royal revealed: Here’s how William, Kate, Harry and Meghan rank in new poll
Ipsos asked Britons between the ages of 18-75 about their feelings toward the family — and the results are in.
nypost.com
California yacht with fireworks, ammo onboard sinks after bursting into flames, video shows
A luxury yacht has been captured on video sinking at Marina del Ray in California after it burst into flames with ammo and fireworks onboard, officials say.
foxnews.com
Biden, Obama pal and top Dem fundraiser owed millions in back taxes while dishing out tens of thousands to Harris: records
That tax debt hasn't stopped Muñoz, 64, from splashing out six figures to his fellow Democrats for the 2024 elections.
nypost.com
Mets’ David Peterson talks ace ‘goal,’ overcoming father’s death when world ‘caved in on me’
Mets southpaw David Peterson throws some curveballs in a Q&A with Post columnist Steve Serby.
nypost.com
Ireland scraps controversial hate speech measures following criticism by Elon Musk, Conor McGregor
The Irish government is dropping parts of its controversial hate speech laws, which deal with incitement to hatred or violence, or "thoughtcrimes."
foxnews.com
200-year-old message in a bottle left by archaeologist discovered in France
A 200-year-old message in a bottle was recently unearthed by volunteers in the midst of an archaeological dig in northern France.
nypost.com
Mom of suspect in Georgia school shooting indicted and is accused of taping a parent to a chair
She was indicted in connection with an alleged domestic incident last year.
abcnews.go.com
Vance says he will keep calling Haitian migrants ‘illegal aliens’ despite legal status
Republican vice presidential candidate J.D. Vance on Thursday said that he intends to keep calling Haitian migrants who entered the U.S. via parole "illegal aliens."
foxnews.com
Ukrainian president to visit Pennsylvania ammunition factory to thank workers
He is expected to go to the Scranton Army Ammunition Plant to kick off a busy week in the U.S. shoring up support for Ukraine.
cbsnews.com
Exclusive — Cliff Maloney: Voter Registration, Ballot Requests Show Trump 'Has the Edge' in Pennsylvania
Voter registration and mail-in ballot request data show that former President Donald Trump has the edge in Pennsylvania. The post Exclusive — Cliff Maloney: Voter Registration, Ballot Requests Show Trump ‘Has the Edge’ in Pennsylvania appeared first on Breitbart.
1 h
breitbart.com
Stream It Or Skip It: ‘Boy Kills World’ on Hulu, a Hyperviolent Comix-Meets-Video Games Revenge Flick Starring Bill Sarsgard
This movie makes "a bit much" seem like not much.
1 h
nypost.com
Is ‘Saturday Night Live’ New Tonight? ‘SNL’ 2024 Return Date Info
Live from New York or rerun from my couch?
1 h
nypost.com
While Michael Jordan fans freak out, docs say legend’s yellow-tinted eyes may be nothing to worry about
Concern over the discoloration of Michael Jordan's eyes has rebounded after fans noticed yellow pigment in the NBA star's eyes at a soccer match in Monaco against Barcelona.
1 h
nypost.com
Why GOP ‘Black Nazi’s’ Porn Posts Were Too Bonkers for CNN
Andrew HarnikA series of revelations that Republican Mark Robinson allegedly called himself a “Black Nazi” and a “perv” on a pornographic chat site were only the tip of the iceberg, more of his postings have made clear.CNN, which first revealed how the party’s candidate for North Carolina governor’s alleged posts, shied away from the full extent of the debauchery and hate which they contained.Among the posts which it did not report on were descriptions of extreme sex acts and explicit praise for Adolf Hitler’s book Mein Kampf.Read more at The Daily Beast.
1 h
thedailybeast.com
Trump makes play for women's vote, vows to ensure 'powerful exceptions' for abortion
Former President Trump vowed to “protect women at a level never seen before" if elected, vowing to ensure “powerful exceptions" for abortion are adopted across the nation and promising they will be “happy, healthy, confident and free."
1 h
foxnews.com
Katy Perry’s comeback album ‘143’ feels about 143 years old
The singer tries to revive her imperial era, but her perfunctory pop music feel trapped in a bygone time.
1 h
washingtonpost.com
Restaurateur Keith McNally takes aim at ‘repulsive’ Lauren Sánchez and Kardashians
The outspoken businessman made headlines in April for calling the fiancée of Jeff Bezos "revolting."
1 h
nypost.com
Jets’ Morgan Moses has MCL sprain, bone bruise in injury sigh of relief 
Robert Saleh’s gut feeling — even if the Jets head coach hedged by saying he’s been wrong plenty of times before — was right.
1 h
nypost.com
Giants' Brian Daboll on the brink of losing the locker room amid disastrous start: report
The confidence in New York Giants head coach Brian Daboll is "hanging by a thread" amid a disastrous start to the season, according to a report.
1 h
foxnews.com
Foster family pleads guilty to abusing Turpin children following ‘house of horrors’ rescue
The Olguin family pleaded guilty Thursday to abusing a number of youths in their foster care -- including multiple Turpin children who were rescued from their parents’ “house of horrors.”
1 h
nypost.com
Meet Curtis Bashaw — the gay, pro-choice Republican running for Senate in NJ
"I actually think there's a real path to victory," Bashaw, 64, told The Post during an interview this week. "New Jerseyans are craving change."
1 h
nypost.com
Trump Slams Bill Maher for Having ‘Dumb as a Rock Bimbo’ Stephanie Ruhle on His Show
Noam Galai/Eugene Gologursky/Getty ImagesDonald Trump took to Truth Social to blast Bill Maher and his HBO evening talk show in a heavily punctuated bluster on Friday. The former president, who accused Maher of suffering from “TRUMP DERANGEMENT SYNDROME” and called him a “befuddled mess,” also focused his ire on the Sept. 20 episode guests Stephanie Ruhle and Bret Stephens.Ruhle, who the former president referred to as a “‘dumb as a rock’ bimbo” from “MSDNC,” hosts the The 11th Hour on MSNBC. The Republican presidential nominee’s name calling continued with his attack on Stephens, a columnist with The New York Times, who he labeled a “Trump hating loser.”Read more at The Daily Beast.
2 h
thedailybeast.com
NY resident tests positive for potentially deadly mosquito-borne virus EEE
The last time there was a human case of eastern equine encephalitis in New York state was 2015.
2 h
nypost.com
How Elon Musk’s SpaceX began a new ‘Space Race’
A new book charts the path of SpaceX, Elon Musk's re-usable rocket giant.
2 h
nypost.com
Kahn's Trump Tribute 'Fighter' Surpasses 10 Million Views Across Socials in Less than 3 Days
In less than three days of its release, while still retaining the #1 spot on iTunes, the video for Jon Kahn's smash hit "FIGHTER" has garnered over 10 million views across social media. The post Kahn’s Trump Tribute ‘Fighter’ Surpasses 10 Million Views Across Socials in Less than 3 Days appeared first on Breitbart.
2 h
breitbart.com
‘Friends’ star Jane Sibbett reveals she still hasn’t watched the entire series
Jane Sibbett, who played Ross' ex-wife Carol on "Friends," still hasn't watched the beloved sitcom in its entirety.
2 h
nypost.com
Two dead, six wounded in overnight NYC shootings and stabbings
Two people were killed and six people hurt in three separate incidents across the city.
2 h
nypost.com
FBI agents board vessel managed by company whose ship collapsed Key Bridge
The FBI has confirmed that federal agents have boarded a vessel managed by the same company as a cargo ship that caused the deadly Baltimore bridge collapse.
2 h
cbsnews.com
Motel 6 sold to Indian hotel operator Oyo for $525 million
The transaction will also include the sale of the Studio 6 motel brand, which caters to customers seeking extended stays.
2 h
cbsnews.com
After second assassination attempt, Trump worries about family’s safety: ‘I don’t talk about it, but I do’
Fox News Channel host Brian Kilmeade talked with former President Trump for an exclusive interview about the campaign and his family
2 h
foxnews.com
Booze till you’re broke: Over 75% of New Yorkers admit to ‘financial hangovers’ after drinking
Over three-fourths of liquor-loving New Yorkers confessed they’ve experienced “financial hangovers,” or the dreaded feeling that a night of drinking left their bank account hammered, according to a new survey.
2 h
nypost.com
My wife never picks up our child from daycare on time — so here’s how I’m dealing with their late fees
"I can't get him because I work for a living. But I know what will get her attention."
2 h
nypost.com
Steelers laughed at how much Broncos limited playbook for Bo Nix
The Broncos' rookie quarterback — the No. 12 pick in the 2024 NFL Draft — has struggled through the first two games of the season, both losses.
2 h
nypost.com
How the world’s richest people signal their wealth in 2024: Diamonds are out, Lays chips are in — and bags with zippers make you look poor
The ultra-rich have found new and subtle ways to signal their status, opting for unassuming snacks and seemingly inconspicuous jewelry.
2 h
nypost.com
Israeli Strike on Former School Kills 22, Gazan Health Officials Say
Gaza’s ministry of health said most of those killed in the strike on the Zeitoun School in Gaza City were women and children.
2 h
nytimes.com
Angelina Jolie reveals she and daughter Vivienne have matching tattoos: ‘It means so much to us’
The actress and her 16-year-old daughter, Vivienne, got matching tattoos after working together on "The Outsiders" Broadway show.
2 h
nypost.com
Why a president’s financial health is just as important as their politics
A president’s personal finances are something directly under their own control and are therefore a reflection of their values.
2 h
nypost.com
College football predictions: Two favorites and an underdog to bet on Saturday
Picks against the spread for Arizona State-Texas Tech, Rutgers-Virginia Tech and Utah-Oklahoma State.
2 h
nypost.com
Stream It Or Skip It: ‘Call Me Bae’ on Prime Video, A Breezy Indian Series That Might Remind You Of ‘Emily in Paris’
Call Me Bae’s vibe is Schitt’s Creek meets Emily in Paris.
2 h
nypost.com