Scientific Accuracy of Large Language Models in Tilted Implant Dentistry: A Guideline-Based Comparative Evaluation

YILDIZ, MEHMET; ALKAP, MELEK; Özdal, Umut; ÖZDAL ZİNCİR, ÖZGE

doi:10.1097/scs.0000000000012768

Scientific Accuracy of Large Language Models in Tilted Implant Dentistry: A Guideline-Based Comparative Evaluation

YILDIZ M. S., ALKAP M., Özdal U., ÖZDAL ZİNCİR Ö.

Journal of Craniofacial Surgery, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2026
Doi Numarası: 10.1097/scs.0000000000012768
Dergi Adı: Journal of Craniofacial Surgery
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, MEDLINE
Anahtar Kelimeler: Artificial intelligence, implant dentistry, large language models, tilted dental implants
İstanbul Üniversitesi-Cerrahpaşa Adresli: Evet

Özet

Tilted dental implant systems are widely used in the rehabilitation of anatomically compromised jaws and are supported by international consensus guidelines. Concurrently, large language models (LLMs) are increasingly accessed as informational tools in implant dentistry; however, their scientific accuracy and adherence to guideline-based principles in advanced implant concepts remain insufficiently explored. This study evaluated the scientific accuracy, guideline conformity, and clinical consistency of responses generated by 4 LLMs regarding tilted dental implant systems. A total of 120 guideline-based questions covering 8 predefined domains (definition, indications, contraindications, advantages, surgical procedure content, prosthetic procedure content, complications, and prognosis/survival) were developed in accordance with ITI, EAO, and AAOMS consensus reports. Each question was independently submitted to ChatGPT-5.2, Copilot, DeepSeek, and Gemini, and all responses were anonymized and evaluated by a multidisciplinary expert panel using a structured ordinal scoring system. Overall, scientific accuracy scores were high across all models, with near-ceiling performance observed in domains related to indications, advantages, procedural content, and prognosis. Statistically significant between-model differences were identified in the definition (P=0.003), contraindications (P=0.006), and complications (P<0.001) domains, with DeepSeek and Gemini demonstrating consistently higher scores in complication-related content compared with ChatGPT and Copilot. Within-model analyses further revealed significant domain-dependent variability across all LLMs. Although LLMs demonstrate a strong capacity to reproduce established, guideline-based knowledge regarding tilted implant systems, limitations remain in safety-critical domains requiring nuanced clinical judgment. Accordingly, LLMs should be regarded as adjunctive educational tools rather than substitutes for expert decision-making in craniofacial implantology.