BMC Oral Health, cilt.26, sa.1, 2026 (SCI-Expanded, Scopus)
Background: With the increasing reliance on artificial intelligence (AI) in healthcare information delivery, it is essential to evaluate the accuracy and reliability of AI-generated responses. This study aimed to assess the quality of responses provided by three AI-based language models—ChatGPT-4, Gemini, and Copilot—on temporomandibular disorders (TMD), a complex and prevalent group of musculoskeletal conditions. Methods: A total of 83 questions, categorized into seven key domains of TMD (Anatomy, Signs and Symptoms, Etiology, Evaluation and Diagnosis, Treatment Options, Complications, and Prognosis), were presented independently to each AI model. Each response was evaluated and classified into one of five accuracy levels: False, Nonfactual, Minimal Facts, Selected Facts, and Objectively True. Statistical analysis, including Pearson Chi-Square and Fisher’s Exact tests, was conducted to determine the relationship between AI model and response accuracy. Results: ChatGPT-4 produced the highest proportion of Objectively True answers (78.3%), significantly outperforming Gemini (53%) and Copilot (20.5%) (p < 0.05). Gemini’s responses predominantly consisted of Selected Facts, while Copilot’s outputs were largely incomplete or minimally informative. Statistically significant differences in response accuracy were observed across all thematic domains (p < 0.05). Conclusion: ChatGPT-4 demonstrated superior reliability in delivering accurate and comprehensive information about TMD, though inconsistencies remain in specific areas such as joint anatomy and prognosis. AI models should undergo rigorous validation before being employed in clinical or patient education settings.