"Computer that can't do math", Experts slam AI's calculation capabilities after Apple findings reveal flaws even in advanced models

Trillion dollars, years of development, and so many experts working on it, despite this, the computer built could not do the math. Such a blunt assessment by Apple’s research shows how even most sophisticated AI models stumble on some basic school-grade arithmetic. These findings have ignited debates about whether all these systems actually reason or are merely making some guesses. Many even believe AI companies failed at their commitment.

Understanding Apple’s report about math skills of AI

Experts slam AI's calculation capabilities after Apple findings reveals flaws even in advanced models

Apple researchers, taking the standard benchmark GSM8K—which was a problem that even a 10-year-old could have solved—made some small changes to it. They made sure to keep logic identical, but swapped some numbers. All of the 25 tested models saw a performance drop, giving a real shocker to researchers. As per them, despite the claims of AI might be taking over critical thinking, is it truly doing so?

The team, as per reports, just added one irrelevant sentence to this problem. It stated that Oliver picks 44 Kiwis on Friday, 58 on Saturday and double Friday’s amount on Sunday. Then the sentence said, “but five of them were a bit smaller than average.” Now, anyone, including a child, would ignore it as it does not bring in any real change within Kiwi’s count.

However, OpenAI’s o1-mini subtracted those 5 smaller kiwis from the value. The same was done by Llama. Both of the AI models answered 185 and not 190—the actual answer. These results were slammed by observers on X. A user about it wrote, “The kiwi problem is the one that should haunt every AI company. The model saw ‘five of them were a bit smaller’ and subtracted 5. It didn’t ask why size would affect a count. That is the absence of reasoning entirely.”

1/The kiwi problem is the one that should haunt every AI company.

The model saw "five of them were a bit smaller than average" and subtracted 5. It didn't ask why size would affect a count. It didn't flag the sentence as irrelevant. It just saw a number next to a descriptive… pic.twitter.com/Gls5SSBVEC
— Nav Toor (@heynavtoor) April 6, 2026

The user further said, “Auto regressive models or as I call them parrot calculators don’t have an objective worldview.”

No surprises here, auto regressive models or as I call them parrot calculators don’t have an objective worldview, the only way to build around their failure modes is to architect a titanium cage of deterministic governance layers around them. pic.twitter.com/l7A8OMuQz5
— Jay Kothari 🇨🇦 (@realenergyjay) April 7, 2026

“We’ve spent a trillion dollars to build a computer that can’t do math. Truly a full circle moment for technology,” said another.

We've spent a trillion dollars to build a computer that can't do math.

Truly a full circle moment for technology. https://t.co/yEQPTWe206
— Patrick S. Tomlinson (@stealthygeek) April 6, 2026

Overall the truth is, these models do not understand what truly subtraction means. They just see some numbers next to a descriptive word and then blindly apply an operation.

Performance crashed despite the examples given in advance

The entire damage went far beyond the 1 fruit issue. The dataset of Apple, called GSM-NoOp, revealed drops seen were catastrophic. Like, Phi-3 mini lost more than 65% of its math ability from just 1 irrelevant clause. GPT-4o then fell to 63.1%, from 94.9% accuracy. Even OpenAI’s o1-preview, which is built specifically for reasoning step-by-step, dropped to 77.4% from 92.7%.

🚨SHOCKING: Apple just proved that AI models cannot do math. Not advanced math. Grade school math. The kind a 10-year-old solves.

And the way they proved it is devastating.

Apple researchers took the most popular math benchmark in AI — GSM8K, a set of grade-school math problems… pic.twitter.com/xUw4uldgZh
— Nav Toor (@heynavtoor) April 6, 2026

Even after giving these models 8 solved examples of similar questions before asking a question, it did not help. The models still fell for the researchers’ trick. A user notably said, “The scariest result is not the 65% drop. It’s what happened when they gave the model 8 solved examples. You cannot fix this with better prompts.”

2/The scariest result in this paper is not the 65% drop. It's what happened when they gave the model 8 solved examples of the exact same question right before asking it.

The model had the answer key. It had the logic laid out step by step. Eight times. Then it saw one irrelevant… pic.twitter.com/msy2LeTVyB
— Nav Toor (@heynavtoor) April 6, 2026

A research paper by Apple that was presented at ICLR 2025 concluded that LLMs do “probabilistic pattern-matching.” They do not do genuine logical reasoning. So, when researchers added some extra clauses to the problem, AI’s performance further collapsed. Gemma2-9b dropped to 41.8% from 84.4%, with only 2 additional irrelevant sentences. It showed that these models do not slow down for thinking. They just pattern-match till complexities break down.

As a comment summarized on X, “Every benchmark score you have ever seen for math reasoning was tested on fixed questions models likely saw during training. The benchmarks are not measuring intelligence. They are measuring memory. And nobody told you.”

“Computer that can’t do math”, Experts slam AI’s calculation capabilities after Apple findings reveal flaws even in advanced models

Understanding Apple’s report about math skills of AI

Performance crashed despite the examples given in advance

Morbid Metal Early Access Review

“Most expensive ‘what if’ in tech”, Inside Mark Zuckerberg’s $100 billion Metaverse gamble that led to mass layoffs

An Italian channel claiming copyright over Nvidia’s DLSS 5 videos is the dumbest thing on the internet right now

AI Might Be Taking Over Your Critical Thinking But You Don’t Know It Yet