AI Research Math 'Substandard', Language Model 'Unable to Innovate', Hindering Scientific Progress
AI was rated "substandard" in solving research-level mathematics, and the limitations of its language model (LLM) resulted in "unable to come up with new ideas."
One of the co-authors commented, "AI is slowing down scientific progress." In a discussion of the new paper <First Proof> co-authored by several world-renowned mathematicians, the authors write, "Although commercial AI systems have already reached a level of utility that makes them useful tools for mathematicians, it is not yet clear where they stand in solving research-level mathematical problems on their own, without expert intervention."
Dr. Heyerer, one of the authors, stated, "I believe that mathematics is actually quite 'safe,'" and that "large language models (LLMs), the core technology behind chatbots, are now quite adept at solving rigged problems." However, he added, "I have not seen any plausible examples of LLMs generating truly novel ideas or concepts."
Currently, AI companies use what some mathematicians describe as 'artificial' or 'limited' problems as benchmarks to assess how well AI research labs perform without human assistance and to attract large amounts of funding from investors.
It is known that AI companies sometimes invite mathematicians to participate in verifications, paying them around $5,000 per problem.
None of the authors of the First Proof project in this paper have any affiliation with an AI company. It's not related.
The file uploaded online on the 7th as <tgkolda/1stproof> is <Upload files to "2026-02-batch">.
The New York Times, which covered the authors, said, "This paper describes a recently launched experiment, in which the authors collected real test problems from unpublished research to meaningfully measure the mathematical abilities of artificial intelligence." The authors said on the 7th, "The authors hope this survey will nuance the exaggerated narrative that math is 'solved' by AI, while downplaying the consequences of AI hype that could scare the next generation of students or discourage research funders."
The authors include Dr. Heyerer, who won the Fields Medal, the most prestigious award in mathematics, in 2014 and the prestigious Groundbreaking Prize in 2021. He teaches at the Ecole Polytechnique Fédérale de Lausanne in Switzerland and Imperial College London.
Dr. Heyerer is a professor at Stanford University and the winner of the 2017 New Horizons Prize in Mathematics.
He co-authored the paper "First Proof" with several mathematicians, including Professor Mohamed Abouzaid, Professor Lauren Williams of Harvard University, and Dr. Tamara Kolda, who runs MathSci.ai, a consulting firm in the San Francisco Bay Area.
For the experiment, the authors, representing various mathematical fields, each submitted a "test question" from their ongoing but unpublished research.
The New York Times reported that "they have already decided on the answer," adding that "the solutions are encrypted online and will be released on February 13th."
Regarding the joint test question, Dr. Kolda, one of the few mathematicians elected to the National Academy of Engineering, told the New York Times, "The goal here is to understand the limits: how far can the AI surpass existing solutions found in the training data and online?"
The joint research team conducted preliminary tests against OpenAI's ChatGPT-5.2 Pro and Google's Gemini 3.0 DeepThink.
The authors had a one-shot chance to come up with an answer. Given this, he wrote, "Even the best publicly available AI systems struggle to solve many of our problems."
The introduction to the paper, as reported by the New York Times, provides an explanation right from the title:
"In baking, the first fermentation, or bulk fermentation, is a crucial step in fermenting the entire dough as a single loaf, dividing it, and shaping it into bread."
The research process and results, as reported by the New York Times, are roughly as follows.
The reporter spoke with the authors via videoconference and email, and these have been condensed and edited for clarity.
Question: <How is the "First Proof" method novel compared to other benchmarking efforts?>
Professor Mohamed Abouzaid: The biggest novelty is that the test questions are actually taken from our own research.
We start with what interests us. Within that space, we try to formulate testable questions.
Question: <What are testable questions?>
Author's response: Current AI systems have well-known limitations. First, they are very weak at visual reasoning, so we avoided such questions; if our goal were adversarial, we would have included questions that included images. Furthermore, companies limit the length of the model's response at a time because the quality of the answer degrades beyond a certain point, so they avoid queries requiring more than five pages of answers.
Question: <The paper is careful to clarify "what mathematical research is." What does this mean?>
Professor Abouzaid: A key step in modern research is identifying the biggest motivating question—the direction in which to approach the problem. All sorts of preliminary work is necessary, and this is where mathematical creativity comes in.
Once a problem is solved, mathematicians tend to evaluate the significance of their research contribution based on the questions it raises. Sometimes, solving a conjecture in one direction can be frustrating, because it prevents the possibility of new questions.
Professor Lauren Williams: Let me use a loose analogy. In experimental science, research can be divided into three parts. First, we pose a big question that we hope will lead to insights into our field. Second, we design experiments to answer the question. Third, we conduct the experiments and analyze the results.
I could also divide mathematical research into parallel parts: First, we pose a big question that we hope will lead the field. Second, we break that big question down into smaller, more manageable problems. The first step is to develop a framework for finding solutions—like our test questions. Third, find answers to these smaller questions and prove them correct.
All three are essential. In the first proof project, we focused on the third element because it's the most measurable.
You can query an AI model with a small, clear question and then evaluate its answer. If we were to ask an AI model to provide a larger question or framework, evaluating its performance would be much more difficult.
Question: <How did the AI system perform in the "first proof" evaluation?>
Professor Williams: In one test of my problem, there was an interesting sequence of responses. The model would give an answer and say, "Okay, this is the final answer."
Then, it would say, "Wait a minute, stop. How about this?" and modify its answer in some way.
And so on: "Okay, this is the final answer." Wait, there's a trap!" It got stuck in an infinite loop.
Another answer was closely related, but answered a different question.
Dr. Tamara Kolda: The preliminary results were disappointing. The AI was confused by the problem, ignoring some key information in the answer, and it was inconsistent.
We've since revised the problem description and added more explicit instructions to give the AI a better chance. We'll have to wait and see what the final results are.
Professor Martin Heyerer: One thing I've noticed in general is that the model tends to give too much detail to the easy parts.
In other words, "Okay, great, try to do it a little faster." "I'm bored listening to you." And you'll rarely get into the details of the core argument.
Sometimes it's like reading a bad undergraduate paper. They know where to start and where they want to go, but they don't quite know how to get there.
So they wander around and at some point, they get stuck in the "and therefore" and pray.
Question <That's like a classic hand gesture—lack of rigor and skipping over complexity?>
Professor Heyerer: Yes, they're pretty good at explaining things in general terms.
Question <So you weren't impressed?>
Professor Heyerer: No, I wouldn't say that. Sometimes I've been quite impressed, for example, with the way they've connected several known arguments with a few calculations. They've been really good at getting that part right.
Question <In your dream world, what would artificial intelligence (AI) be doing for you?>
Professor Heyerer: The output of current LLMs is unreliable.
They seem absolutely confident, but it takes a lot of effort to convince yourself that their answers are correct.
Intellectually It feels painful. Again, it's like not knowing whether a graduate student is strong or just a good undergraduate.
The ideal is a trustworthy model.
Dr. Kolda: AI is often promoted as a colleague or collaborator, but I don't think that's true.
My human colleagues have unique perspectives, and I especially enjoy discussing different perspectives.
The AI has a perspective I dictate, and it's not interesting at all!
One of my growing concerns is that AI could unintentionally slow down scientific progress.
Theoretical physicist Max Planck is often quoted as saying, "Science advances one funeral at a time."
I recognize that my perspective could be quite wrong.
But if my opinions are imprinted on AI systems and persist indefinitely, will it hinder the development of new scientific ideas?
The New York Times article was titled, "These Mathematicians Are Putting Artificial Intelligence to the Test," with the subtitle, "Large Language Models Struggle at Research-Level Math Problems. It Takes Humans to Assess How Bad They Are."
See <AI Derails: Forgetting What You're Doing, Losing Focus, 'Reinforcing the Investment Bubble,' December 28, 2025>
<Chatbot User Brain Corrupted by AI Children with "Zero Memory" Have the Worst Vocabulary on Social Media (November 10, 2025)