How Accurate Are the Responses of 4 AI Chatbots to Orthognathic Surgery Questions?

HATİPOĞLU, ŞİRİN; Ozkan, Esra; Tasova, Fatma; ÖZDAL ZİNCİR, ÖZGE

doi:10.1097/scs.0000000000012367

How Accurate Are the Responses of 4 AI Chatbots to Orthognathic Surgery Questions?

HATİPOĞLU Ş., Ozkan E. C., Tasova F. A. K., ÖZDAL ZİNCİR Ö.

JOURNAL OF CRANIOFACIAL SURGERY, cilt.37, sa.3/4, ss.905-910, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 37 Sayı: 3/4
Basım Tarihi: 2026
Doi Numarası: 10.1097/scs.0000000000012367
Dergi Adı: JOURNAL OF CRANIOFACIAL SURGERY
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, MEDLINE
Sayfa Sayıları: ss.905-910
İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Objective: This study aimed to evaluate and compare the accuracy, reliability, and comprehensibility of information provided by 4 artificial intelligence (AI)-based language models (ChatGPT-4, Google Gemini, Microsoft Copilot, and DeepSeek-v3) for orthognathic surgery. Methods: A cross-sectional content analysis was carried out to evaluate the responses generated by ChatGPT-4, Gemini, Copilot, and DeepSeek-v3. A total of 118 questions covering 12 domains related to orthognathic surgery were formulated, and the AI-generated answers were systematically assessed. A 5-point Likert scale was used to independently score the responses. Descriptive statistics were used. The Fisher exact test was applied to examine relationships between categorical variables when the expected value was <5. All analyses were performed by the IBM SPSS 27 program. Results: Significant differences were observed among the AI models (P=0.022). DeepSeek-v3 demonstrated the highest proportion of objectively true responses (87.3%), outperforming Gemini, ChatGPT-4, and Copilot. While ChatGPT-4 and DeepSeek-v3 performed significantly better in the "postoperative" domain by providing "objectively true" answers (P=0.038), Gemini and Copilot generated a greater proportion of "selected facts." Domain-specific variations were statistically significant only for Gemini (P<0.001). Conclusions: The results indicate that the reliability of AI-assisted language models in delivering medical information is subject to variation depending on the specific topic addressed. In its first comparative assessment within this study, DeepSeek-v3 outperformed the other evaluated models in terms of informational accuracy.