Tehran University of Medical Sciences

Science Communicator Platform

Share By
Artificial Intelligence Reliability in Implant Dentistry: A Comparative Analysis of Clinical Accuracy and Hallucination Patterns Across Multiple Language Models Publisher Pubmed

Summary: How often do AI models produce reliable information in dentistry? A study found newer generative models have high fabrication rates, while retrieval systems excel. Can we trust AI? #AIinDentistry #EvidenceBasedPractice

Hooshiar MH
Authors

Source: Journal of Prosthetic Dentistry Published:2026


Abstract

Statement of problem: Artificial intelligence (AI) language models have been increasingly used for clinical decision-making in implant dentistry and for scholarly writing. Yet, their reliability, fabrication rates, and comparative performance across different architectures and generations remain unestablished, potentially compromising evidence-based practice. Purpose: The purpose of this cross-sectional comparative study was to evaluate the accuracy of different AI models in implant dentistry. How often they produced false information under evidence-based prompting was also examined. Material and methods: The clinical accuracy and hallucination patterns of multiple AI language models, including both conventional large language models (represented by ChatGPT) and retrieval-augmented generation (RAG) systems (represented by ScholarQA and OpenEvidence), were examined with regard to their response to evidence-based clinical questions in implant dentistry and to whether evidence-based prompting strategies reduced fabrication rates across these different AI architectures. Five AI models (GPT-4o, Model 4.1, Model 4.5, ScholarQA, and OpenEvidence) were tested with 15 clinical questions in implant dentistry under both unprompted and evidence-based prompting conditions. Results: ScholarQA and OpenEvidence both achieved high accuracy (80.0%) with 0% fabrication, while GPT-4o showed high rates of reference (82%) and statistical (85.7%) fabrication. More recently released models demonstrated increased fabrication rates compared with earlier versions. Notably, 89.1% of fabricated citations were dated 2023 to 2025. Evidence-based prompting increased GPT-4o's accuracy from 33.3% to 66.7% but did not reduce fabrication rates. Both RAG systems showed minimal response to prompting, with OpenEvidence improving marginally to 86.6% and ScholarQA decreasing to 73.3%. Conclusions: Conventional generative models frequently produced hallucinations, including fabricated citations and data, whereas retrieval-augmented systems such as ScholarQA and OpenEvidence avoided fabricated references but still generated incorrect interpretations. ChatGPT-4.5, ScholarQA, and OpenEvidence demonstrated comparable accuracy, yet hallucination in these forms remains the principal barrier to reliable clinical and scholarly use. Prompting improved performance mainly in earlier generations of AI models, while more recently released versions and RAG systems showed only limited benefit or were unaffected. Copyright © 2026 by the Editorial Council of The Journal of Prosthetic Dentistry. All rights are reserved, including those for text and data mining, AI training, and similar technologies.