Large Language model (LLM) in temporomandibular disorder education: a comparative study

ÖZDAL ZİNCİR, ÖZGE; HATİPOĞLU, ŞİRİN; Çifçi Özkan, Esra

doi:10.1186/s12903-025-07468-z

Large Language model (LLM) in temporomandibular disorder education: a comparative study

ÖZDAL ZİNCİR Ö., HATİPOĞLU Ş., Çifçi Özkan E.

BMC Oral Health, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 26 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1186/s12903-025-07468-z
Dergi Adı: BMC Oral Health
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
Anahtar Kelimeler: Artificial intelligence, ChatGPT-4, Copilot, Gemini, Patient education, Temporomandibular disorders
İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Background: With the increasing reliance on artificial intelligence (AI) in healthcare information delivery, it is essential to evaluate the accuracy and reliability of AI-generated responses. This study aimed to assess the quality of responses provided by three AI-based language models—ChatGPT-4, Gemini, and Copilot—on temporomandibular disorders (TMD), a complex and prevalent group of musculoskeletal conditions. Methods: A total of 83 questions, categorized into seven key domains of TMD (Anatomy, Signs and Symptoms, Etiology, Evaluation and Diagnosis, Treatment Options, Complications, and Prognosis), were presented independently to each AI model. Each response was evaluated and classified into one of five accuracy levels: False, Nonfactual, Minimal Facts, Selected Facts, and Objectively True. Statistical analysis, including Pearson Chi-Square and Fisher’s Exact tests, was conducted to determine the relationship between AI model and response accuracy. Results: ChatGPT-4 produced the highest proportion of Objectively True answers (78.3%), significantly outperforming Gemini (53%) and Copilot (20.5%) (p < 0.05). Gemini’s responses predominantly consisted of Selected Facts, while Copilot’s outputs were largely incomplete or minimally informative. Statistically significant differences in response accuracy were observed across all thematic domains (p < 0.05). Conclusion: ChatGPT-4 demonstrated superior reliability in delivering accurate and comprehensive information about TMD, though inconsistencies remain in specific areas such as joint anatomy and prognosis. AI models should undergo rigorous validation before being employed in clinical or patient education settings.