“Computer that can’t do math”, Experts slam AI’s calculation capabilities after Apple findings reveal flaws even in advanced models

Trillion dollars, years of development, and so many experts working on it, despite this, the computer built could not do the math. Such a blunt assessment by Apple’s research shows how even most sophisticated AI models stumble on some basic school-grade arithmetic. These findings have ignited debates about whether all these systems actually reason or are merely making some guesses. Many even believe AI companies failed at their commitment.

Understanding Apple’s report about math skills of AI

Experts slam AI's calculation capabilities after Apple findings reveals flaws even in advanced models

Apple researchers, taking the standard benchmark GSM8K—which was a problem that even a 10-year-old could have solved—made some small changes to it. They made sure to keep logic identical, but swapped some numbers. All of the 25 tested models saw a performance drop, giving a real shocker to researchers. As per them, despite the claims of AI might be taking over critical thinking, is it truly doing so?

The team, as per reports, just added one irrelevant sentence to this problem. It stated that Oliver picks 44 Kiwis on Friday, 58 on Saturday and double Friday’s amount on Sunday. Then the sentence said, “but five of them were a bit smaller than average.” Now, anyone, including a child, would ignore it as it does not bring in any real change within Kiwi’s count.

Related  Meta becomes the latest platform to be targeted by authorities over failing to protect young users

However, OpenAI’s o1-mini subtracted those 5 smaller kiwis from the value. The same was done by Llama. Both of the AI models answered 185 and not 190—the actual answer. These results were slammed by observers on X. A user about it wrote, “The kiwi problem is the one that should haunt every AI company. The model saw ‘five of them were a bit smaller’ and subtracted 5. It didn’t ask why size would affect a count. That is the absence of reasoning entirely.”

The user further said, “Auto regressive models or as I call them parrot calculators don’t have an objective worldview.”

“We’ve spent a trillion dollars to build a computer that can’t do math. Truly a full circle moment for technology,” said another.

Related  Best Steam Deck settings for Wuchang Fallen Feathers

Overall the truth is, these models do not understand what truly subtraction means. They just see some numbers next to a descriptive word and then blindly apply an operation.

Performance crashed despite the examples given in advance

The entire damage went far beyond the 1 fruit issue. The dataset of Apple, called GSM-NoOp, revealed drops seen were catastrophic. Like, Phi-3 mini lost more than 65% of its math ability from just 1 irrelevant clause. GPT-4o then fell to 63.1%, from 94.9% accuracy. Even OpenAI’s o1-preview, which is built specifically for reasoning step-by-step, dropped to 77.4% from 92.7%.

Even after giving these models 8 solved examples of similar questions before asking a question, it did not help. The models still fell for the researchers’ trick. A user notably said, “The scariest result is not the 65% drop. It’s what happened when they gave the model 8 solved examples. You cannot fix this with better prompts.”

Related  "Slopilot" chants trends as Microsoft blocks "Microslop" comments on Copilot Discord server

A research paper by Apple that was presented at ICLR 2025 concluded that LLMs do “probabilistic pattern-matching.” They do not do genuine logical reasoning. So, when researchers added some extra clauses to the problem, AI’s performance further collapsed. Gemma2-9b dropped to 41.8% from 84.4%, with only 2 additional irrelevant sentences. It showed that these models do not slow down for thinking. They just pattern-match till complexities break down.

As a comment summarized on X, “Every benchmark score you have ever seen for math reasoning was tested on fixed questions models likely saw during training. The benchmarks are not measuring intelligence. They are measuring memory. And nobody told you.”

Chahat Sharma
Chahat Sharma
Chahat Sharma is a Writer at Backdash. She is the Author of An Audacious Lass: A Girl Who Wants to Live Her Life On Her Own Terms and has co-authored several anthologies. Alongside her published work, she actively contributes to various platforms, weaving words that connect with both social and personal narratives. As a passionate storyteller at heart, Chahat aspires to see her words brought to life on the big-screen someday. Her dream is to work with and learn from Shonda Rhimes, the acclaimed American Television Producer and Screenwriter, to craft stories that resonate with audiences worldwide. With her growing portfolio and unwavering dedication to writing, as of now she continues to shape her path toward impactful storytelling.

Latest articles

Related articles