Tools
Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!


Read full article on: vox.com
Submit a question for Jennifer Rubin about her columns, politics, policy and more
Submit your questions for Jennifer Rubin’s mail bag newsletter and live chat.
1m
washingtonpost.com
Trump vows to ‘end all sanctuary cities in America,’ boost law enforcement in regions that don’t cooperate with ICE
Republican presidential nominee Donald Trump pledged Saturday to end sanctuary cities "across the country."
9 m
nypost.com
Christian Angulo, 14, laid to rest in emotional final funeral for victims of Georgia high school shooting
“This tragedy was not God’s plan for Christian. Our focus is on the beams of love that shone from Christian’s all too short life."
nypost.com
Bank of America chief Brian Moynihan raises eyebrows by taking on chancellor role for Corporation of Brown University
Brian Moynihan has his hands full running Bank of America, which is why he raised some eyebrows when he took on the volunteer position as chancellor of the Corporation of Brown University.
nypost.com
Liverpool toma la cima de la Liga Premier con goleada ante Bournemouth
El colombiano Luis Díaz anotó dos destacados goles en un lapso de dos minutos y el Liverpool venció 3-0 al Bournemouth para reclamar el sábado la cima de la clasificación de la Liga Premier.
latimes.com
Defensive end de Raiders, Malcolm Koonce se perderá toda la temporada, según fuente AP
El defensive end de los Raiders de Las Vegas Malcolm Koonce, quien se lesionó la rodilla antes del primer encuentro de la temporada, se perderá toda la temporada, le confirmaron el sábado a The Associated Press dos personas que tienen conocimiento La persona habló con AP en condición de anonimato debido a que los Raiders no han anunciado que Koonce no jugará el resto de la campaña.
latimes.com
Se accidenta autobús que transportaba a equipo de fútbol americano en Brasil, hay 3 fallecidos
Un autobús que transportaba un equipo de fútbol americano en Brasil se volcó en una carretera al sur de Río de Janeiro el sábado y que dejó tres personas fallecidas y seis más lesionadas, informaron las autoridades.
latimes.com
Rev. Al Sharpton’s father dead at 93: ‘Our relationship was complicated’
“I’m deeply saddened to announce the passing of my father, Al Sharpton, Sr. Our relationship was complicated, but he was still my father," the civil rights activist said.
nypost.com
Mets targeting to have injured Francisco Lindor back in time to face Braves
Francisco Lindor was out of the Mets lineup for a sixth straight game with an ailing back on Saturday.
nypost.com
LI cops fatally shoot man who blew traffic stop in Queens, leaving several injured
Police shot and killed a man on Long Island after he blew off a traffic stop in Queens and left eight cops -- including two NYPD officers and one civilian -- injured in his wake, Nassau County authorities said.
nypost.com
Israeli soldiers pushed 4 apparently lifeless bodies from roofs during a West Bank raid
Videos obtained by the Associated Press show Israeli soldiers pushing four apparently lifeless bodies from rooftops during a raid in the occupied West Bank.
latimes.com
Jerry Jones appears to make wild NSFW comment about Cowboys player’s privates on Jamie Foxx’s livestream
Cowboys owner Jerry Jones was recorded talking about one player's penis during practice.
nypost.com
New Mexico State vs. Sam Houston prediction: CFB Week 4 underdog pick
This weekend's college football slate has a handful of notable games that most fans will be keen to watch: New Mexico State vs. Sam Houston isn't one of them.
nypost.com
Pelicans' Dejounte Murray weighs in on whether Caitlin Clark dominated in WNBA rookie season
Dejounte Murray, who will begin his first season with the Pelicans later this year, took to social media to push back against the criticism directed at Caitlin Clark's season.
foxnews.com
Boy abducted in Oakland more than 70 years ago found living on East Coast
The Mercury News reported this week that Albino's niece in Oakland, using DNA testing and newspaper clippings — and with assistance from police, the FBI and the U.S. Justice Department — found her uncle living on the East Coast.
latimes.com
Nationals All-Star demoted to minor leagues after staying at casino overnight: reports
The Washington Nationals have sent All-Star CJ Abrams to the minors after he reportedly stayed out at a casino until 8 a.m. despite a 1 p.m. game.
foxnews.com
Jorge Garcia looks back at 'Lost' 20 years later and the role of a lifetime
Jorge Garcia talks 20th anniversary of 'Lost' and his role as Hugo 'Hurley' Reyes, the cursed lottery winner and crash survivor of Oceanic Flight 815.
latimes.com
Screen-caused migraines and more: Letters to the Editor — Sept. 22, 2024
NYPost readers discuss screens triggering migraines, Kamala Harris' position on Israel and more.
nypost.com
Keith Urban gives update on ‘heartbroken’ Nicole Kidman after her mom’s sudden death
The Oscar-winning actress learned of her mother's death while at the 2024 Venice Film Festival.
nypost.com
Final meal of SC inmate put to death in state’s first execution in 13 years revealed
He had a killer final meal.
nypost.com
Donald Trump Rants About Bill Maher And ‘Dumb As A Rock Bimbo’ Stephanie Ruhle After Latest ‘Real Time’
The former president again accused Maher of suffering from "TRUMP DERANGEMENT SYNDROME."
nypost.com
Sam Darnold’s revival can serve as glimmer of hope for NFL’s struggling young quarterbacks
It might not look good now, but there is hope for a successful future.
nypost.com
An Israeli strike on a school kills at least 22 people, Gaza Health Ministry says
Gaza health workers say an Israeli strike on a school in northern Gaza has killed at least 22 people.
latimes.com
Israel airstrike wiped out slew of senior Hezbollah military leaders alongside top commander
Israel killed a slew of senior leaders of Hezbollah's elite fighting force, including top commander Ibrahim Aqil, in a rare airstrike Friday.
nypost.com
Watch Live: Donald Trump Holds Rally in Wilmington, North Carolina
Former President Donald Trump speaks to supporters at a rally in Wilmington, North Carolina, on Saturday, September 21. The post Watch Live: Donald Trump Holds Rally in Wilmington, North Carolina appeared first on Breitbart.
breitbart.com
Today’s Iconic Moment in New York Sports: Jets beat Dolphins, Marino in 51-45 shootout
September 21, 1986: Ken O’Brien and the Jets beat Dan Marino and the Dolphins 51-45 in overtime.
1 h
nypost.com
PM Update: Likely storms could be strong along and west of I-95 into evening
A few storms may have downpours, hail, and high winds but should weaken by midnight.
1 h
washingtonpost.com
Woman is left alone in 'creepy' section of airport as 'best friend' heads for hotel
A travel drama between friends has lit up social media as a woman described what happened when she traveled with her "best friend" and ran into a canceled flight and an overnight stay.
1 h
foxnews.com
NYC schools chancellor David Banks had no waiver to ‘benefit’ brother’s business clients
Banks never obtained a waiver to meet with vendors represented by his younger brother, Terence, despite a possible conflict of interest, officials told The Post.
1 h
nypost.com
Arizona Court Allows 98,000 to Vote in State and Local Races Despite Database Glitch
Officials recently discovered that some people with driver’s licenses issued before 1996 might not have proof of citizenship on file, a state requirement since 2004.
1 h
nytimes.com
F.B.I. Agents Board Ship Managed by Company of Vessel in Baltimore Bridge Collapse
The move comes three days after the Justice Department’s civil lawsuit against the owner and operator of the Dali, which struck the Francis Scott Key Bridge in March.
1 h
nytimes.com
Cooking host Padma Lakshmi: Harris’ culinary skills ‘might say even more about her success as a leader’
TV host, chef and actress Padma Lakshmi argued that Vice President Harris' qualities as a cook demonstrate her promise as a leader in a guest essay published on Sunday.
1 h
foxnews.com
West Virginia’s Aubrey Burks stretchered off field in scary moment
There was a scary moment in West Virginia when a defensive player was stretched off the field during a play he wasn't involved in.
1 h
nypost.com
‘Jeopardy!’ viewers soon over ‘hot priest’ contestant Father Steve Jakubowski, 29: ‘I’m about to convert’
One fan commented, "There is an extremely attractive Catholic priest on Jeopardy, and that's just not fair."
1 h
nypost.com
Royal Caribbean abruptly bans this essential travel item from its cruises — and customers are fuming
Royal Caribbean cruise line quietly added this gadget to its list of banned items.
1 h
nypost.com
Human remains positively identified as the Kentucky highway shooter
Human remains found in Kentucky have been positively identified as the man who shot 12 vehicles and wounded five people on Interstate 75 recently.
1 h
latimes.com
Federal judge temporarily blocks Tennessee's 'abortion trafficking' law
A federal judge has temporarily blocked Tennessee from enforcing a law banning adults from helping minors get an abortion without parental permission.
1 h
latimes.com
Nationals option all-star shortstop CJ Abrams to Class AAA Rochester
With Rochester wrapping up its season on Sunday, Abrams, 23, will report to the Nationals’ training facility in West Palm Beach, Fla.
1 h
washingtonpost.com
VP Kamala Harris Agrees to Second Debate Against Donald Trump on CNN
SAUL LOEB/AFP via Getty ImagesVice President Kamala Harris has accepted CNN’s invitation for a second debate against GOP nominee Donald Trump hosted by the network, her campaign staffers announced Saturday. According to CNN, a second square off between the two presidential candidates is planned for Oct. 23, two weeks before Americans head to the polls to cast their votes. However, so far only Harris has signed on for another debate.After accepting the invitation, Harris said Trump should have no problem agreeing to the terms. Read more at The Daily Beast.
1 h
thedailybeast.com
Biden begins private meetings with world leaders at Delaware home ahead of secretive Quard meeting
Biden has met with the leaders of Japan, India and Australia privately at his Wilmington, Delaware home ahead of the secretive Quadrilateral Security Dialogue meeting this weekend.
1 h
foxnews.com
Mother of Apalachee High School shooting suspect charged with elder abuse, tying mother to chair: report
A woman whose son is charged with shooting up Apalachee High School in Georgia earlier this month has been accused of taping her elderly mother to a chair, according to a local report.
1 h
foxnews.com
‘NYC doesn’t heart you’: 80% of New Yorkers who suffer cardiac arrest die due to slow FDNY response
Four out of five New Yorkers who go into cardiac arrest die as response times for NYC's first responders to medical emergencies continue to soar, startling new data shows.
2 h
nypost.com
Does ‘nature’s Ozempic’ really work? Trendy GLP-1 supplements are designed to supercharge weight loss for a fraction of the price
Why fork over thousands of dollars a month for prescription weight loss jabs when store-bought supplements claim to have the same effect?
2 h
nypost.com
Charlize Theron shares rare photos of her kids as they get into ‘spooky season’ at Disneyland
The "Mad Max: Fury Road" star has two adopted daughters: August and Jackson.
2 h
nypost.com
San Francisco warns RV occupants: Accept help or be towed
The city will make parking between midnight and 6 a.m. a towable offense for oversized vehicles whose occupants refuse offers of housing or other services.
2 h
latimes.com
Who Gives a F*** About Christmas? Melania Trump Does Now
Paul Morigi/Jabin Botsford via Getty ImagesAlleged Christmas-hater Melania Trump launched a holiday decor side hustle on Saturday, unveiling her “Merry Christmas, AMERICA!” collection of festive ornaments. The four limited edition objects are “proudly handcrafted in the United States,” and include the former first lady’s signature along with a corresponding “3D animated digital collectible,” according the manufacturer USA Memorabilia. “Each unique piece captures the magic of the holiday season” Melania wrote on X to announce her latest business venture. “Let these ornaments inspire cherished memories & bring warmth to your entire family.”Read more at The Daily Beast.
2 h
thedailybeast.com
RFK Jr. once had 43 ‘mistresses’ in his cellphone – including now-wife Cheryl Hines
The Kennedy scion, now 70, kept dozens of women’s contact info in an alphabetical list under the letter “G."
2 h
nypost.com
Suspect in Melbourne's oldest cold case arrested in Italy
The 65-year-old suspect is accused of killing two young women in January 1977.
2 h
cbsnews.com