Tools
Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!


Read full article on: vox.com
Israeli reporter speaks out after bizarre Biden rebuke over hostage deal question: ‘He can’t say a clear yes’
The Israeli television reporter whose question about a potential hostage deal to President Biden sparked an awkward rebuke on Tuesday opened up about the now-viral exchange in an interview with The Post. Israel 13 News anchorwoman and US correspondent Neria Kraus said she believes the lame-duck president’s response shows he realizes time is running out...
5 m
nypost.com
Devout Catholic worker fired for not getting COVID shot wins over $12M in religious discrimination suit
A devout Catholic was awarded nearly $13 million in a discrimination lawsuit claiming she was fired in 2022 for refusing to follow her company’s COVID-19 vaccine mandate as it was against her religion. Lisa Domski, an IT specialist for the Blue Cross Blue Shield of Michigan (BCBSM), had worked for the company for a combined...
8 m
nypost.com
Missing autistic Oregon boy found dead four days before his 6th birthday: sheriff
Deputies scoured the property and even drained a pond on the family's land while deploying drones and police dogs to the surrounding area during the desperate search for the boy.
nypost.com
Missouri teen dies after mom allegedly gave her lethal fentanyl pill to treat toothache: ‘This shouldn’t happen’
The mother “admitted to giving the victim what she believed to be a prescription pain pill and other street drugs were located inside the home with other minor children,” according to a warrant.
nypost.com
Israeli strikes kill dozens in the Gaza Strip and Lebanon, medics say
Israeli airstrikes killed at least 46 people in the Gaza Strip in the past day, medics said. In Lebanon, warplanes struck Beirut's southern suburbs and killed 33 people in the country on Tuesday.
npr.org
US prohibits airlines from flying to Haiti after planes were shot by gangs
The shootings were part of a wave of violence that erupted as the country plagued by gang violence swore in its new prime minister after a politically tumultuous process.
npr.org
Oklahoma woman with Parkinson’s Disease duped out of $20,000 in bitcoin scam
An elderly Oklahoma woman stricken with Parkinson's Disease lost $20,000 in a cryptocurrency ploy that was crafted as a scam within a scam, according to her daughter.
nypost.com
Rangers can’t keep going on like this — or it will get much worse
We’re five weeks in and the Rangers already have broken up the NHL’s best line from a year ago while also breaking up a connection between Mika Zibanejad and Kreider that goes back to Alain Vigneault’s tenure behind the bench.
nypost.com
Trump ally Alina Habba says she would ‘very seriously’ consider serving as his press secretary
“I am very loyal to President Trump. I would think about it very seriously,” Habba told Fox News host Sean Hannity on Tuesday.
nypost.com
Comedian Tony Hinchcliffe unapologetic for Puerto Rico ‘garbage’ joke, admits Trump’s MSG rally not best event for that routine
The comedian admits then-Presidential candidate Donald Trump's Madison Square Garden rally wasn't the best place for his joke however doesn't apologize for saying it.
1 h
nypost.com
How Africa Is Bracing for Trump’s Second Term
Experts on the continent tell the AP what to expect—and what not to expect—from the President-elect.
1 h
time.com
Girls' high school volleyball: Regional playoff results
CIF Regional girls' high school volleyball playoff results for Tuesday.
1 h
latimes.com
Erik Spoelstra costs Heat game with brutal Chris Webber moment — and Jalen Rose can’t believe it
Erik Spoelstra made a series of mistakes in a matter of seconds that proved costly for the Heat in their overtime loss to the Pistons.  The boneheaded moments saw the Heat blow a two-point lead with just under two seconds left in overtime and then an even bigger error came on the part of Spoelstra,...
1 h
nypost.com
Rangers’ Mika Zibanejad looks nothing like himself in turnover-plagued outing
Games where Mika Zibanejad has not resembled himself have piled up this season.
2 h
nypost.com
Islanders fortunate to sneak off with point in overtime loss to Oilers
The Islanders suffered a 4-3 overtime loss to the Oilers on Draisaitl’s game-winner in Edmonton.
2 h
nypost.com
Connecticut mom searching for lost necklace that contains son’s ashes
“It’s monetarily not worth much, but sentimentally, it’s everything,” Soyland said.
2 h
nypost.com
Como en la era Obama, activan plan de defensa comunitaria contra deportaciones masivas
Los activistas proinmigrantes han iniciado la planificación para asistir a las familias con mayor riesgo ante una eventual ola de deportaciones
2 h
latimes.com
Dolphins' Tyreek Hill floats latest theory about arrest near NFL stadium amid battle with wrist injury
In the first quarter of Monday's Dolphins-Rams game, ESPN reported that Tyreek Hill said a torn ligament in his wrist became worst after he was detained by police.
2 h
foxnews.com
Beer vs. wine and liquor drinkers — here’s who has the absolutely worst diet
Researchers compared the diets of more than 1,900 US alcohol drinkers — 38.9% consumed only beer, 21.8% only wine, 18.2% only liquor and 21% a combination of alcohol types.
2 h
nypost.com
Carolyn Hax: Is 15 diners, 11 dogs and seating for 4 a Thanksgiving math fiasco?
A letter writer is “freaking out” over feeding and seating so many guests over four days in an antique two-bedroom home.
2 h
washingtonpost.com
Miss Manners: Dinner guest unilaterally decides who’s paying
A letter writer is annoyed that one dinner guest has unilaterally decided who’s footing the bill.
2 h
washingtonpost.com
Asking Eric: Son’s falling out with niece divides the family
He’s staying away from family dinners that include his estranged cousin. His parent wishes he would let it go and reconcile.
2 h
washingtonpost.com
People magazine names John Krasinski as 2024’s Sexiest Man Alive
The Sexiest Man Alive of 2024 has been crowned.
2 h
nypost.com
John Krasinski named People magazine’s ‘Sexiest Man Alive’
The results are in! See which other male stars were featured in People’s “Sexiest Man Alive” issue for 2024.
2 h
nypost.com
Dick Van Dyke, 98, has scathing reaction to second Donald Trump presidency: ‘Fortunately, I won’t be around’
The "Mary Poppins" star, who turns 99 in December, endorsed Kamala Harris in the 2024 presidential election.
2 h
nypost.com
Trump Keeps MAGA World Guessing as His Opponents Welcome Marco Rubio Report
Trump's allies appear skeptical that Senator Rubio will be the next secretary of state. Opponents of Trump, however, have welcomed the news.
2 h
newsweek.com
How Pete Hegseth’s book on ‘woke’ Pentagon helped him land secretary of defense nomination
Trump touted Hegseth’s book in his announcement, adding that he’s “tough, smart and a true believer in America First.” 
2 h
nypost.com
Tim Hardaway Jr. slams head, exits in wheelchair in scary Pistons moment
Tim Hardaway Jr. had to be taken off the court in the third quarter in a wheelchair after a scary incident where he took a hard fall on Tuesday night.
2 h
nypost.com
GREG GUTFELD: Trump's incoming 'border czar' doesn't care what people think of him
'Gutfeld!' panelists react to President-elect Trump's choice for 'border czar.'
3 h
foxnews.com
Republican Senators Hold Leader Candidate Forum Hours Before Consequential Vote
Republican Senators huddled for over two hours Tuesday night for a family meeting over who will take the reins from Sen. Mitch McConnell (R-KY). The post Republican Senators Hold Leader Candidate Forum Hours Before Consequential Vote appeared first on Breitbart.
3 h
breitbart.com
Nets’ Noah Clowney finally breaks out after quiet start to season
Noah Clowney hadn’t just been waiting for a breakout performance. He’d been working toward one. The Nets young big finally got it on Monday.
3 h
nypost.com
FBI joins investigation into burglaries at Patrick Mahomes, Travis Kelce’s mansions
The FBI has joined the investigation into the burglaries of the homes of Chiefs stars Travis Kelce and Patrick Mahomes. TMZ reported that the FBI has been assisting the Cass County Sheriff’s Office, the agency investigating the Mahomes break-in, and the Leawood Police Department, which is probing the Kelce burglary.  Chiefs star quarterback Patrick Mahomes’...
3 h
nypost.com
Josh Rivera tells how Aaron Hernandez's CTE and sexuality informed 'American Sports Story' finale
The star of FX's limited series explains why he felt a sense of ownership over the show and how a scene provided an 'acknowledgment of the complexity' of Hernandez's life.
3 h
latimes.com
Mutiny threat sparks House GOP infighting ahead of Trump visit: 'Just more stupid'
House Republicans are once again at odds with one another after conservatives threatened to protest Speaker Johnson's bid to lead the conference again.
3 h
foxnews.com
Country star Darius Rucker donates to ETSU’s NIL fund after 'awkward' appearance at football game
Country music star Darius Rucker paid the East Tennessee State University's NIL fund $10 for every minute he was on the field Saturday after what he called an "awkward" appearance.
3 h
foxnews.com
‘Fearless’ Broadneck surges into Maryland 4A volleyball title game
The Bruins will face defending champion Richard Montgomery. Wootton surprised Howard in Class 3A, and Centennial will try to defend its title in 2A.
3 h
washingtonpost.com
Republican David Valadao wins reelection, notching GOP closer to control of the U.S. House
Republican Rep.
3 h
latimes.com
Elderly couple battered by would-be robber while walking in Manhattan
The suspect allegedly threw the woman to the ground and then tried to steal her companion’s wallet, according to cops.
3 h
nypost.com
Dick Van Dyke, 98, takes dig at Donald Trump, says he’s fortunate he ‘won’t be around’ for the next 4 years
Dick Van Dyke says he's glad he wont be around for Trump's second term.
3 h
nypost.com
Tyson vs. Paul: ¿dónde, cuándo, a qué hora y cómo ver la pelea? 
Tras varias postergaciones de la contienda debido a problemas de salud y otras situaciones, el combate entre Jake Paul, de 27 años, y Mike Tyson, de 58, se celebrará este viernes por la noche.
3 h
latimes.com
St. John’s latest important step matters — but guarantees nothing
Consider Monday’s national ranking a first step. A start. A beginning to what St. John’s and its fans hope will be a long-awaited breakthrough season.
3 h
nypost.com
What Is the Department of Government Efficiency? Musk, Ramaswamy to Lead
President-elect Trump announced the new department, shortened to DOGE, in a statement on Tuesday.
3 h
newsweek.com
Bev Priestman ousted from Canada's soccer coaching position after independent review of Olympic drone scandal
The Canadian women's soccer team was implicated in a drone scandal this past summer. But, an investigation determined drone use against opponents, predated the Paris Olympics.
3 h
foxnews.com
Who’s been picked for Trump’s second administration so far?
President-elect Donald Trump has wasted no time staffing his incoming administration, announcing a flurry of picks in the last few days alone. With intense jockeying going on at the highest levels of MAGA world, and a slate of cabinet picks yet to be announced, here's a look at who's been picked so far.
3 h
nypost.com
Democrat Mike Levin holds onto San Diego area House seat, beating GOP challenger Matt Gunderson
California's 49th Congressional District was one of several tight races with the potential to help determine which party controls the U.S. House of Representatives.
3 h
latimes.com
Senator-elect Jim Justice's team clarifies report claiming famous pooch Babydog banned from Senate floor
Senator-elect Jim Justice's office has clarified reports that his famous pooch Babydog was banned from the Senate floor, saying Justice never intended to bring the dog onto the floor.
3 h
foxnews.com
En medio de récords de calor en el planeta, científicos reportan aumento en emisiones de carbono
Aun cuando la Tierra está imponiendo nuevos récords de calor, la humanidad emitirá este año 300 millones de toneladas más de dióxido de carbono al aire por la quema de combustibles fósiles que el año pasado.
3 h
latimes.com
Knicks finally unlock fourth-quarter formula in NBA Cup win over 76ers
Better rested and facing a disappointing opponent, the Knicks reversed their fourth-quarter woes and pummeled the Sixers on Tuesday night, 111-99.
4 h
nypost.com