Accuracy and Temporal Consistency of ChatGPT and Gemini in Responding to Textbook and Patient-Oriented Dental Bleaching Questions: A Multi-Session Comparative Study

ŞİŞMANOĞLU, SONER; Kotan, Serre; IŞIK, VASFİYE

doi:10.1111/jerd.70172

Accuracy and Temporal Consistency of ChatGPT and Gemini in Responding to Textbook and Patient-Oriented Dental Bleaching Questions: A Multi-Session Comparative Study

ŞİŞMANOĞLU S., Kotan S. S., IŞIK V.

Journal of Esthetic and Restorative Dentistry, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1111/jerd.70172
Dergi Adı: Journal of Esthetic and Restorative Dentistry
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE
Anahtar Kelimeler: artificial intelligence, chatbot accuracy, dental bleaching, patient education, temporal consistency
İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Objective: This study compared the accuracy and temporal consistency of ChatGPT and Gemini in responding to dental bleaching questions across three weekly sessions. Materials and Methods: A total of 280 true/false questions were developed comprising 200 textbook-based and 80 patient-oriented frequently asked questions. Both chatbots were queried weekly under controlled conditions. Accuracy was compared using generalized estimating equations, consistency was assessed using Fleiss' kappa, and weekly stability was evaluated using Cochran's Q test. Open-ended responses were scored for quality and misinformation by two evaluators. Results: For textbook questions, ChatGPT achieved significantly higher accuracy than Gemini (77.7% versus 70.5%, p = 0.0009). For frequently asked questions, both chatbots performed comparably (92.9% versus 90.8%, p = 0.252). Temporal consistency was only fair for textbook questions but almost perfect for frequently asked questions in both chatbots. Both chatbots showed significant upward trends in textbook accuracy across sessions. Gemini received higher global quality scores for open-ended responses, while misinformation rates were similarly low. Conclusions: Within the limitations of this study, ChatGPT achieved significantly higher accuracy than Gemini for textbook-based dental bleaching questions, while both chatbots performed comparably for patient-oriented questions. Temporal consistency differed markedly, with almost perfect consistency for patient-oriented questions and only fair consistency for textbook-based questions. Clinical Significance: Chatbot responses to common patient questions about dental bleaching are generally accurate and consistent, but their reliability drops substantially for specialized academic content, suggesting these tools should complement rather than replace professional clinical judgment.