Tools

Change country:

What it means that new AIs can “reason”

In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step.

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in.

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology.

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.”

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product.

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture.

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model.

Sometimes, big things can come from small improvements

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far.

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.”

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

Read full article on: vox.com

2 mo

vox.com

Read full article on: vox.com

Sotomayor has no plans to resign from Supreme Court, sources say

Despite speculation in liberal legal circles and talk among Democratic senators of urging Justice Sonia Sotomayor to resign, sources close to her say she has no plans to.

7 m

abcnews.go.com

Chiefs block Broncos' game-winning field goal attempt to save undefeated season in miraculous fashion

The Kansas City Chiefs remain undefeated this season after Mike Danna blocked a game-winning, 35-yard field goal attempt by the Denver Broncos' Will Lutz.

10 m

foxnews.com

Elon Musk endorses Rick Scott for Senate majority leader

Elon Musk is endorsing lawmaker Rick Scott for Senate majority leader, joining a growing list of MAGA figures who are throwing their support behind the Florida Republican.

11 m

foxnews.com

49ers' Nick Bosa appears to perform Trump-inspired sack celebration

San Francisco 49ers star defensive end Nick Bosa appeared to perform a Donald Trump-inspired sack celebration dance against the Tampa Bay Buccaneers.

16 m

foxnews.com

Devastated father of US nurse killed in Budapest after meeting man in nightclub struggles for answers: ‘No reason for this to to happen’

The devastated father of a US nurse killed in Budapest is struggling to come to grips after her alleged killer who met her in a nightclub confessed to the senseless crime. Nurse Practioner “Kenzie” Michalski was murdered while visiting Budapest, Hungary. Fredonia, New York native Mackenzie “Kenzie” Michalski, 31, had vanished after leaving a popular...

18 m

nypost.com

Black Civil War soldier finally gets proper military burial — after star NYC journo family spends 15 years tracking down unmarked grave

It took 150 years for this veteran to finally get the respect he deserves.

20 m

nypost.com

Vagrants may have accidentally started NYC’s Prospect Park fire: sources

Homeless people camped out in Brooklyn's Prospect Park might have started the weekend blaze that torched two wooded acres in the urban oasis, The Post has learned.

20 m

nypost.com

Iraq to lower age of consent for girls to just 9 years old and allow men to marry young kids: ‘This law legalizes child rape’

The Shia conservative groups that dominate Iraq's parliament have proposed an amendment to the country's "personal status law" that could see a Taliban-style rollback of all women's rights.

20 m

nypost.com

Schumer won't allow Dave McCormick at Senate orientation, citing outstanding PA ballots

Senate Majority Leader Chuck Schumer will not invite senator-elect Dave McCormick to Senate orientation this week, citing ballots that are uncounted in Pennsylvania.

24 m

foxnews.com

'Joker 2' actor rips own movie as 'worst film ever made' after it fails at box office

Actor Tim Dillon ripped the Warner Bros. flick "Joker: Folie À Deux" which he had a small role in, as the "worst film ever made."

25 m

foxnews.com

Senate leader contender John Thune responds to new Trump litmus test ahead of election

Thune responded to Trump's latest request of a potential leader for the Senate GOP.

33 m

foxnews.com

‘Yellowstone’ Season 5 Part 2 is finally here—Here’s how to watch for free

It's finally time to head back to Dutton Ranch — without Kevin Costner.

36 m

nypost.com

How to watch ‘Yellowstone’ Season 5 Part 2 live for free: Time, TV, streaming

It's the beginning of the end for "Yellowstone."

36 m

nypost.com

Lions vs. Texans player props: NFL ‘Sunday Night Football’ prediction, pick

Week 10 offers some exciting prime-time games, including what should be a fun matchup between the Lions and Texans on “Sunday Night Football.”

36 m

nypost.com

Bradley Cooper and Gigi Hadid have date night to check out Alyssa Milano on Broadway

The couple posed alongside the "Charmed" alum and her co-star Kimberly Marable backstage at a performance of "Chicago."

42 m

nypost.com

Giants Week 10 report card: All-around unacceptable in Germany

Some of the final numbers are not terrible but the results remain unacceptable.

43 m

nypost.com

GOP Fla. Sen. Rick Scott calls for upending status quo in Senate to ‘get Trump’s agenda done’

Florida Sen. Rick Scott on Sunday pitched himself as the candidate in the Senate GOP leadership race who will break the status quo and best fight to implement President-elect Donald Trump's agenda.

43 m

nypost.com

Pentagon appeals judge’s ruling that would let 9/11 terrorists take plea and escape death penalty

The US Department of Defense is challenging the decision of a military judge to reinstate three 9/11 terrorists' plea deals that guarantee they would be spared the death penalty.

43 m

nypost.com

College Football Playoff flaws displayed with unjustified seedings in expanded format

I understand the reasoning behind the format of the expanded playoff. You want to reward conference champions. Still, I think inclusion should be enough of a reward.

46 m

nypost.com

The BSO hits a surprise high note with Wynton Marsalis tuba concerto

Jonathon Heyward led the Baltimore Symphony Orchestra in a program including works by Marsalis, Carl Nielsen and Jean Sibelius.

51 m

washingtonpost.com

‘Percy Jackson and the Olympians’ Casts Andra Day as Athena

Day will portray the Greek goddess of wisdom in Season 2 of the Disney+ series.

55 m

nypost.com

How my party lost me, Dems must return to normalcy and other commentary

“I’ve raised at least $50 million for the left,” recalls Evan Barker at The Free Press, but “on Tuesday, I voted for Donald Trump.”

55 m

nypost.com

Why Giannis Antetokounmpo snapped at reporter amid Bucks’ miserable start

Giannis Antetokounmpo appeared fed up.

60 m

nypost.com

Charlamagne tha God surprised by exit polls showing Trump won 33% of voters of color

Radio host Charlamagne tha God appeared surprised on Sunday by exit polls showing President-elect Trump winning 33% of the non-White vote during an interview.

1 h

foxnews.com

I’m a sexual perversion expert – these are the weirdest fetishes I’ve ever studied

The weird world of sex can be bewildering, even for an expert.

1 h

nypost.com

Major NYC landlords get surprise boost from growing media giant

The city’s largest commercial landlords enjoyed big boosts from Bloomberg LP’s sustained and growing stake on Midtown’s East Side. The media giant’s 924,876 square-foot renewal and expansion at 919 Third Ave., which we reported first, came as a welcome surprise to landlord SL Green. Bloomberg LP will lease nearly 1 million square feet at SL...

1 h

nypost.com

Murrow Award rescinded for acclaimed ‘Retrograde’ documentary

The Radio Television Digital News Association decision follows a Washington Post report on warnings “Retrograde’s” filmmakers received about endangering Afghans.

1 h

washingtonpost.com

EEUU cita a Weah para cuartos en Liga de Naciones, aunque suspendido en ida contra Jamaica

Mauricio Pochettino tiene muchas ganas de tener a Tim Weah en la selección de Estados Unidos.

1 h

latimes.com

Travis Kelce makes Chiefs history with girlfriend Taylor Swift watching

With superstar girlfriend Taylor Swift in the stands at Arrowhead Stadium, Travis Kelce reached a milestone on Sunday afternoon.

1 h

nypost.com

Taylor Swift's entourage has dust-up with photographers before Chiefs game

Two members of Taylor Swift's entourage confronted photographers on Sunday as she arrived at Arrowhead Stadium for the Kansas City Chiefs Week 10 game.

1 h

foxnews.com

Gol de campo de Piñeiro en tiempo extra da victoria a Panthers sobre Giants en Alemania

Eddy Piñeiro pateó un gol de campo de 36 yardas en tiempo extra y los Panthers de Carolina encadenaron victorias consecutivas por primera vez esta temporada al vencer el domingo 20-17 a los Giants de Nueva York.

1 h

latimes.com

Mystery creature spotted in West Virginia park gives people the creeps: ‘Nothing I had ever seen before’

Footage of a mysterious creature roaming through a West Virginia park has left locals and animal experts stumped — with some residents guessing the enigmatic beast is anything from a lemur to a lion.

1 h

nypost.com

Mark Cavendish pone fin a su carrera con broche de oro en Singapur

Mark Cavendish, uno de los mejores pedalistas de ruta en la historia del ciclismo, se retiró de gran manera el domingo al ganar el Critérium del Tour de Francia en Singapur.

1 h

latimes.com

Ex-NYC Rep. Michael Grimm ‘paralyzed from chest down’ after falling off horse in freak polo accident: ‘A lot of prayers are needed’

The 45-year-old Marine veteran is being treated at Kessler Institute for Rehabilitation in New Jersey — where late “Superman’’ movie star Christopher Reeve also recovered after being paralyzed from the neck down from a fall from a horse.

1 h

nypost.com

How Leonardo DiCaprio kept his A-list 50th birthday extra private

We also hear a number of attendees from the earlier super starry Baby2Baby gala headed to the "Wolf of Wall Street" actor's birthday afterwards.

1 h

nypost.com

NYT has yet to issue correction for wrongly claiming Trump ‘falsely’ accused FEMA of avoiding supporters’ storm-ravaged homes

A New York Times fact-check that claimed Donald Trump lied when he said FEMA deliberately neglected the storm-ravaged homes of his supporters has been debunked -- but you’d never know it.

2 h

nypost.com

‘When Calls The Heart’ Season 12 Premiere Date On Hallmark Announced

The network also released a first-look teaser for the forthcoming season of the popular western romantic drama.

2 h

nypost.com

High school football: City and Southern Section quarterfinal playoff pairings

High school football: City and Southern Section quarterfinal playoff pairings.

2 h

latimes.com

Taylor Swift’s bodyguard screams at photographers as singer hits Chiefs game in miniskirt

The "Bad Blood" hitmaker's security team meant business upon Swift's arrival to the Kansas City Chiefs game against the Denver Broncos.

2 h

nypost.com

College football wide receiver thanks Trump for touchdown dance celebration inspiration

Drake Bulldogs wide receiver Trey Radocha thanked President-elect Donald Trump on Saturday giving him the inspiration for his touchdown dance celebration.

2 h

foxnews.com

Watch proud Taylor Swift react to Travis Kelce’s record-making Chiefs touchdown

The pop star also attended the Chiefs' game against the Tampa Bay Buccaneers last weekend, which resulted in yet another victory.

2 h

nypost.com

Steelers’ fake punt backfires in spectacular fashion vs. Commanders

The design was perfect.The initial execution was strong.There was just one problem for the Steelers.

2 h

nypost.com

Elon Musk slams Dana Carvey’s ‘SNL’ impression, says ‘dying’ show is ‘out of touch with reality’

Elon Musk, one of Donald Trump's biggest allies, criticized the first "SNL" episode after the election.

2 h

nypost.com

Failed Reagan Assassin John Hinckley Jr. on Requests to Assassinate Trump: 'I'm a Man of Peace Now'

John Hinckley Jr., the failed assassin who shot former President Ronald Reagan, responded to requests to assassinate President-elect Donald Trump by stating that he is a "man of peace now." The post Failed Reagan Assassin John Hinckley Jr. on Requests to Assassinate Trump: ‘I’m a Man of Peace Now’ appeared first on Breitbart.

2 h

breitbart.com

Kamala Harris’ stepdaughter, Ella Emhoff, addresses rumors she was hospitalized for ‘mental breakdown’

The daughter of second gentleman Douglas Emhoff was seen crying during the vice president's concession speech at Howard University.

2 h

nypost.com

Netanyahu condemns antisemitic pogrom in Amsterdam, warns world leaders attacks will spread if don't act

Amid rising Jew-hatred across Europe, Amsterdam recently became ground zero of an outbreak of violent antisemitism as local Arabs attacked visiting Israeli soccer fans following a soccer game.

2 h

foxnews.com

49ers' Ricky Pearsall, who was wounded in shooting, scores 1st touchdown of his career

San Francisco 49ers wide receiver Ricky Pearsall scored the first touchdown of his career on Sunday, months after he was shot and wounded.

2 h

foxnews.com

Pearsall scores 1st career touchdown just over 2 months after being shot

San Francisco 49er Ricky Pearsall scored his first NFL career touchdown on Sunday, after being shot almost three months ago.

2 h

cbsnews.com