How to Reduce Non-Determinism and Hallucinations in Large Language Models (LLMs)
In recent months, two separate pieces of research have shed light on two of the most pressing issues in large language models (LLMs): their non-deterministic nature and their tendency to hallucinate. Both phenomena have a direct impact on the reliability, reproducibility, and practical usefulness of these technologies.
On the one hand, Thinking Machines, led by former OpenAI CTO Mira Murati, has published a paper proposing ways to make LLMs return the exact same answer to the exact same prompt every time, effectively defeating non-determinism. On the other hand, OpenAI has released research identifying the root cause of hallucinations and suggesting how they could be significantly reduced.
Let’s break down both findings and why they matter for the future of AI.
The problem of non-determinism in LLMs
Anyone who has used ChatGPT, Claude, or Gemini will have noticed that when you type in the exact same question multiple times, you don’t always get the same response. This is what’s known as non-determinism: the same input does not consistently lead to the same output.
In some areas, such as creative writing, this variability can actually be a feature; it helps generate fresh ideas. But in domains where consistency, auditability, and reproducibility are critical — such as healthcare, education, or scientific research — it becomes a serious limitation.
Why does non-determinism happen?
The most common explanation so far has been a mix of two technical issues:
- Floating-point numbers: computer systems round decimal numbers, which can introduce tiny variations.
- Concurrent execution on GPUs: calculations are performed in parallel, and the order in which they finish can vary, changing the result.
However, Thinking Machines argues that this doesn’t tell the whole story. According to their research, the real culprit is batch size.
When a model processes multiple prompts at once, it groups them into batches (or “carpools”). If the system is busy, the batch is large; if it’s quiet, the batch is small. These variations in batch size subtly change the order of operations inside the model, which can ultimately influence which word is predicted next. In other words, tiny shifts in the order of addition can completely alter the final response.
Thinking Machines’ solution
The key, they suggest, is to keep internal processes consistent regardless of batch size. Their paper outlines three core fixes:
- Batch-invariant kernels: ensure operations are processed in the same order, even at the cost of some speed.
- Consistent mixing: use one stable method of combining operations, independent of workload.
- Ordered attention: slice input text uniformly so the attention mechanism processes sequences in the same order each time.
The results are striking: in an experiment with the Qwen 235B model, applying these methods produced 1,000 identical completions to the same prompt, rather than dozens of unique variations.
This matters because determinism makes it possible to audit, debug, and above all, trust model outputs. It also enables stable benchmarks and easier verification, paving the way for reliable applications in mission-critical fields.
The problem of hallucinations in LLMs
The second major limitation of today’s LLMs is hallucination: confidently producing false or misleading answers. For example, inventing a historical date or attributing a theory to the wrong scientist.
Why do models hallucinate?
According to OpenAI’s paper, hallucinations aren’t simply bugs; they are baked into the way we train LLMs. There are two key phases where this happens:
- Pre-training: even with a flawless dataset (which is impossible), the objective of predicting the next word naturally produces errors. Generating the
right answer is harder than checking whether an answer
is right.
- Post-training (reinforcement learning): models are fine-tuned to be more “helpful” and “decisive”. But current metrics reward correct answers while penalising both mistakes
and admissions of ignorance. The result? Models learn that it’s better to bluff with a confident but wrong answer than to say “I don’t know”.
This is much like a student taking a multiple-choice exam: leaving a question blank guarantees zero, while guessing gives at least a chance of scoring. LLMs are currently trained with the same incentive structure.
OpenAI’s solution: behavioural calibration
The proposed solution is surprisingly simple yet powerful: teach models when not to answer. Instead of forcing a response to every question, set a confidence threshold.
- If the model is, for instance, more than 75% confident, it answers.
- If not, it responds:
“I don’t know.”
This technique is known as behavioural calibration. It aligns the model’s stated confidence with its actual accuracy.
Crucially, this requires rethinking benchmarks. Today’s most popular evaluations only score right and wrong answers. OpenAI suggests a three-tier scoring system:
- +1 for a correct answer
- 0 for “I don’t know”
- –1 for an incorrect answer
This way, honesty is rewarded and overconfident hallucinations are discouraged.
Signs of progress
Some early users report that GPT-5 already shows signs of this approach: instead of fabricating answers, it sometimes replies, “I don’t know, and I can’t reliably find out.” Even Elon Musk praised this behaviour as an impressive step forward.
The change may seem small, but it has profound implications: a model that admits uncertainty is far more trustworthy than one that invents details.
Two sides of the same coin: reliability and trust
What makes these two breakthroughs especially interesting is how complementary they are:
- Thinking Machines is tackling
non-determinism, making outputs consistent and reproducible.
- OpenAI is addressing
hallucinations, making outputs more honest and trustworthy.
Together, they target the biggest barrier to wider LLM adoption: confidence. If users — whether researchers, doctors, teachers, or policymakers — can trust that an LLM will both give reproducible answers and know when to admit ignorance, the technology can be deployed with far greater safety.
Conclusion
Large language models have transformed how we work, research, and communicate. But for them to move beyond experimentation and novelty, they need more than just raw power or creativity: they need trustworthiness.
Thinking Machines has shown that non-determinism is not inevitable; with the right adjustments, models can behave consistently. OpenAI has demonstrated that hallucinations are not just random flaws but the direct result of how we train and evaluate models, and that they can be mitigated with behavioural calibration.
Taken together, these advances point towards a future of AI that is more transparent, reproducible, and reliable. If implemented at scale, they could usher in a new era where LLMs become dependable partners in science, education, law, and beyond.









