Can Large Language Models Support Endodontic Decision-Making? Accuracy and Consistency of ChatGPT and Gemini in Deep Caries and Pulp Exposure

Işık, Vasfiye; Şişmanoğlu, SONER; Baybora Kayahan, Mehmet

doi:10.5152/essentdent.2026.0113

Can Large Language Models Support Endodontic Decision-Making? Accuracy and Consistency of ChatGPT and Gemini in Deep Caries and Pulp Exposure

Işık V., Şişmanoğlu S., Baybora Kayahan M.

Essentials of Dentistry, cilt.5, ss.1-7, 2026 (TRDizin)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 5
Basım Tarihi: 2026
Doi Numarası: 10.5152/essentdent.2026.0113
Dergi Adı: Essentials of Dentistry
Derginin Tarandığı İndeksler: TR DİZİN (ULAKBİM)
Sayfa Sayıları: ss.1-7
İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Background: This study evaluated the accuracy and consistency of responses generated by 2 large language models (LLMs), ChatGPT and Gemini, regarding the management of deep carious lesions and pulp exposure in endodontics.Methods: Fifty dichotomous (Yes/No) questions were developed based on the position statement of the European Society of Endodontology and distributed across 5 categories: Diagnosis and Classification, Caries Management, Pulp Exposure Management, Materials and Techniques, and Follow-up and Prognosis. Questions were presented to ChatGPT-4o and Gemini (Flash 2.5) on 3 occasions, 1 week apart. A total of 300 responses were collected and compared with reference answers. Accuracy was measured as the proportion of correct responses, while consistency was assessed using Fleiss’ Kappa across time points. Statistical analyses included Cochran’s Q and McNemar’s test, with significance set at P < .05.Results: ChatGPT achieved an overall accuracy of 76.7%, while Gemini achieved 86.0%, a statistically significant difference favoring Gemini (P=.034). By categories, Gemini showed superior accuracy in Caries Management and Pulp Exposure Management (96.7%), while ChatGPT performed best in Diagnosis and Classification (93.3%). Substantial consistency was observed for both models (Fleiss’ Kappa=0.627 for ChatGPT; 0.723 for Gemini). Gemini’s accuracy varied significantly across weeks (P=.015), whereas ChatGPT’s remained stable (P=.670).Conclusion: Both LLM-based chatbots demonstrated moderate accuracy and high consistency in endodontics. While results highlight their potential as educational and decision-support tools, current performance remains insufficient for reliable clinical application. Domain-specific training and further refinement are necessary before widespread implementation in endodontic practice.Cite this article as: Işık V, Sismanoglu S, Kayahan MB. Can large language models support endodontic decision-making? Accuracy and consistency of ChatGPT and Gemini in deep caries and pulp exposure. Essent Dent. 2026, 5, 0113, doi:10.5152/EssentDent.2026.25113.