Log In

Assessment of various artificial intelligence applications in responding to technical questions in endodontic surgery

Published 1 day ago16 minute read

BMC Oral Health volume 25, Article number: 763 (2025) Cite this article

The objective of this study was to evaluate the performance of ScholarGPT, ChatGPT-4o and Google Gemini in responding to queries pertaining to endodontic apical surgery, a subject that demands advanced specialist knowledge in endodontics.

A total of 30 questions, including 12 binary and 18 open-ended queries, were formulated based on information on endodontic apical surgery taken from a well-known endodontic book called Cohen’s pathways of the pulp (12th edition). The questions were posed by two different researchers using different accounts on the ScholarGPT, ChatGPT-4o and Gemini platforms. The responses were then coded by the researchers and categorised as ‘correct’, ‘incorrect’, or ‘insufficient’. The Pearson chi-square test was used to assess the relationships between the platforms.

A total of 5,400 responses were evaluated. Chi-square analysis revealed statistically significant differences between the accuracy of the responses provided applications (χ² = 22.61; p < 0.05). ScholarGPT demonstrated the highest rate of correct responses (97.7%), followed by ChatGPT-4o with 90.1%. Conversely, Gemini exhibited the lowest correct response rate (59.5%) among the applications examined.

ScholarGPT performed better overall on questions about endodontic apical surgery than ChatGPT-4o and Gemini. GPT models based on academic databases, such as ScholarGPT, may provide more accurate information about dentistry. However, additional research should be conducted to develop a GPT model that is specifically tailored to the field of endodontics.

Peer Review reports

Artificial intelligence (AI), which is at the forefront of modern technological progress, is having a significant impact on our lives today, and its areas of application are expanding. Natural language processing (NLP) is a branch of AI that allows computers to understand and interpret human language [1]. NLP has been instantiated in a special type of AI known as the large language model (LLM) since November 2022. This has revolutionised the way information is searched for and retrieved, as it is capable of creating, translating and summarising human-like text [1, 2]. ChatGPT, developed by OpenAI, is the most well-known and popular LLM today [3, 4].

Gemini, launched by Google DeepMind in 2023, is another LLM [5]. It carries out tasks such as text generation, translation and summarisation as well as creative content production [5, 6]. In May 2024, OpenAI unveiled ChatGPT-4o, the new update to its intelligent chatbot [7]. GPT-4o is a multimodal LLM that combines several different models that understand audio, video and text into a single model [7, 8]. OpenAI’s ScholarGPT, which is the product of the development of GPT-4, is a language model that has been trained on academic texts and customised for academic use [9]. ScholarGPT has important advantages, such as the ability to quickly find sources for academic research and to summarise and analyse articles.

Today, the use of AI in healthcare, as in many other fields, is becoming increasingly important. Although scientific publications on AI in medicine have increased significantly in recent years, the literature on AI in dentistry is still limited [3, 4, 10, 11]. In endodontics, AI is used for various purposes, such as detecting the presence of periapical lesions, calculating working lengths, detecting differences in root and canal morphology, detecting root fractures, assessing pain after endodontic treatment and predicting treatment outcomes [11, 12]. Despite all these developments, dentists and researchers should be aware that AI may have ethical limitations such as the potential for discrimination and bias, data privacy, and security issues, as well as technical limitations such as providing incorrect, insufficient or obsolete information [13]. The reliability of AI in providing scientific information on various topics requires further evaluation.

The aim of root canal treatment is to clean the infected pulp tissue in the root canal system, shape and hermetically fill the root canal with a biocompatible material and attempt to prevent reinfection [14]. Retreatment is the most commonly recommended solution to unsuccessful root canal treatment. However, in cases in which retreatment is also unsuccessful or not possible, treatments such as endodontic apical surgery or intentional replantation are recommended [15]. Endodontic apical surgery, also known as apicectomy or apical resection, is a surgical procedure performed to preserve a tooth that has not healed following conventional root canal treatment or retreatment. Its primary goal is to prevent tooth loss by resolving persistent periapical pathology. The procedure involves making an incision, elevating a flap, and creating an osteotomy to access and curette the periapical lesion. This is followed by root-end resection, preparation of a retrograde cavity, and sealing of the cavity with a biocompatible filling material [16].

As AI is being used more and more in medicine and dentistry, it is crucial to evaluate its accuracy and reliability. An examination of the extant literature reveals that few studies have sought to determine the capacity of AI chatbots to furnish precise responses to queries in the domain of endodontics [11, 17,18,19,20,21,22,23]. In addition, Balel [24]. , has previously evaluated the performance of ScholarGPT in answering technical questions in the field of oral and maxillofacial surgery by comparing it with ChatGPT. The present study is the first to compare the performance of ScholarGPT with that of different AI robots in providing endodontic information. The aim was to evaluate and compare the accuracy of the answers given by ScholarGPT, Gemini and ChatGPT-4o AI chatbots to questions that dentists may ask about endodontic apical surgery. The null hypothesis of this study was that there would be no significant difference in the accuracy and completeness of information related to endodontic apical surgery provided by ScholarGPT, ChatGPT-4o, and Gemini.

This study was conducted in accordance with the Declaration of Helsinki. Since the study exclusively evaluated publicly available AI-generated data and did not involve human participants, it was deemed exempt from ethical approval. A total of 30 questions about endodontic apical surgery (Table 1), comprising 12 dichotomous and 18 open-ended queries, were developed. The list of questions is also provided in the supplementary file 1. The questions were designed to guide professionals in endodontic apical surgery. We used a systematic approach to develop the questions to avoid the complexity of answers in artificial intelligence applications. All questions were developed with scientific accuracy and clinical relevance, based on Cohen’s Pathways of the Pulp (12th edition) in Chap. 11: Periradicular Surgery [25]. The questions focused on practical aspects of endodontic apical surgery, such as applicability, healing stages, complications, materials used, application technique and patient selection. An endodontist and a periodontist evaluated and contributed to the applicability and scientific nature of the questions.

The questions were posed by two different researchers using different accounts on the ScholarGPT, ChatGPT-4o and Gemini platforms between 25th of Nov 2024 and 4th of Dec 2024. The questions were asked 3 times a day – in the morning, afternoon and evening – for 10 days. A new conversation option was chosen each time to minimise the influence of previous answers. Thus, a total of 60 answers were obtained for each question. The responses were then coded by the researchers, who categorised them as ‘correct’, ‘incorrect’ or ‘insufficient’. Each answer was then compared with the correct answers as provided in the reference book, Cohen’s Pathways of the Pulp in Chap. 11: Periradicular Surgery. The responses were carefully documented in an Excel spreadsheet (Microsoft, Redmond, WA). The distribution of the responses was analysed. Since the chi-square test is a statistical method used to evaluate the significance of the difference between categorical data, the Pearson chi-square test was used to evaluate the relationships between platforms. Inter-rater agreement was assessed through Cohen’s Kappa Test.

Table 1 Questions

Full size table

A total of 5,400 responses were evaluated – 1,800 from each of the AI ​​applications. The chi-square analysis revealed statistically significant differences between the accuracy of the responses provided by applications (χ²= 22.61; p < 0.05) (Table 2). The analysis demonstrated that 90.1% of ChatGPT-4o’s responses were accurate, 2.9% were erroneous and 7.1% were incomplete. On the other hand, Gemini’s responses were 59.5% accurate, 19.4% erroneous and 21.1% incomplete. In contrast, ScholarGPT demonstrated superior performance compared to the other two applications, as its responses were 97.7% correct, 1.2% incorrect and 1.1% insufficient (Fig. 1). The weighted kappa value for inter-rater agreement was 0.85.

Table 2 Distribution and comparison of accuracy of responses of AI apps

Full size table

Fig. 1
figure 1

Distribution of answers produced by the AI applications

Full size image

Substantial evidence suggests AI has undergone rapid development in recent years and will become a widely used tool in modern dentistry in the near future [19, 26, 27]. It is imperative to acknowledge that the use of AI in dentistry is still in its developmental stage, and the benefits it offers vary according to the particular use case and application. The use of non-expert educational data, the potential risks associated with the use of outdated information, and the ethical and legal concerns surrounding patient confidentiality require careful consideration [28].

The most prevalent and well-regarded AI products in contemporary use are those that fall under the category of language models, which use NLP algorithms [4]. Language models, including prominent examples such as ChatGPT, Gemini and Meta LLaMA, provide users with the benefit of AI access without the need for advanced technological expertise [29]. Nevertheless, while multimodal LLMs have made considerable progress in various domains, further research is necessary due to their current limitations, particularly in medical and dental research [30]. Therefore, the present study evaluated the capacity of the ChatGPT-4o, Gemini and ScholarGPT platforms to address enquiries concerning endodontic apical surgery, which requires advanced specialist knowledge in endodontics and represents a challenging subject to manage clinically and theoretically.

ChatGPT-4o and Gemini were chosen for this study primarily because their multimodal structures facilitate the management of health problems and because they are the most widely used and easily accessible AI chatbots today. ScholarGPT is an AI model developed for academic and scientific use [9]. It performs functions such as analysing and summarising articles, producing texts that comply with conventions of academic language and providing field-specific academic information. In this study, we evaluated the accuracy rate of ScholarGPT’s responses by comparing them with the AI applications Gemini and ChatGPT-4o. To our knowledge, this is the first academic study to evaluate the performance of ScholarGPT in the field of endodontics.

It has been argued that the acceptable limit of accuracy for artificial intelligence applications should be above 90%, in order to ensure safety and efficacy [31, 32]. In this study, the answers given by the AI robots were coded as ‘insufficient’ if they were not completely correct or not completely incorrect according to Cohen’s pathways of the pulp, one of the most important resources in the field of endodontics. According to our findings, ScholarGPT exceeded the accuracy threshold and achieved the highest correct response rate, with 97.7% correct responses, while ChatGPT-4o followed with a 90.1% correct response rate. Gemini exhibited the lowest correct response rate by far (59.5%) among the chatbots examined. Therefore, the null hypothesis was rejected.

In the present study, ChatGPT-4o demonstrated superior accuracy and performance in comparison to Gemini. This finding is consistent with a study by Doshi et al. [33], who compared Gemini and ChatGPT in the domain of radiology. Quah et al. [34] evaluated the accuracy of the GPT-4, GPT-3.5, Llama 2, Gemini and Copilot chatbots in answering multiple-choice questions in the field of oral and maxillofacial surgery. Their study reported that GPT-4 demonstrated the highest performance with 76.8%, followed by Copilot with 72.6%, GPT-3.5 with 62.2%, Gemini with 58.7% and Llama 2 with 42.5%. In addition, Ekmekci and Durmazpinar [19] evaluated the accuracy of responses by the Gemini, ChatGPT-4o and ChatGPT-4 chatbots with PDF plugins to questions posed by dentists about regenerative endodontic treatment. The findings revealed that ChatGPT-4 with a PDF plugin exhibited the highest accuracy, with a correct response rate of 98.1%, while ChatGPT-4o demonstrated an accuracy rate of 86.2%, and Gemini exhibited the lowest accuracy at 48%. These findings support the results of our study.

In their study investigating the accuracy and consistency of ChatGPT’s responses to three different levels of dichotomous (yes/no) endodontic questions posed to ChatGPT and human experts, Suarez et al. [18] found that ChatGPT’s responses exhibited an accuracy rate of 57.33%, and that the accuracy varied significantly with the question difficulty. Ozden et al. [21] asked ChatGPT and Gemini dichotomous (yes/no) questions about dental trauma over 10 days. They found that both applications gave the right answer to 57.5% of the questions. We believe that the higher accuracy rate of ChatGPT in our study compared to these two studies is due to the fact that the version used in our study (ChatGPT-4o) has a more advanced database.

In a study [25] evaluating Gemini’s performance in answering questions about diagnosing and treating dental problems in endodontics, Gemini’s responses were reported to be accurate 37.11% of the time. This level of accuracy is evidently inadequate. Furthermore, it is imperative to acknowledge that medical information provided by ChatGPT and Gemini does not constitute academic knowledge. The information provided should not be regarded as a substitute for medical advice.

In the inaugural study [24] conducted to appraise the performance of ScholarGPT in the domain of healthcare, Balel used a modified Global Quality Scale to evaluate the responses of ScholarGPT. The study reported that ScholarGPT exhibited strong performance compared to ChatGPT in addressing technical inquiries related to oral and maxillofacial surgery. These findings are consistent with the results of the present study. Based on these results, we can say that GPT models developed on the basis of academic databases can provide more accurate and reliable information. As no other studies have evaluated the performance of ScholarGPT in medicine and dentistry, a comparison of the present findings with the results of previous studies is not possible.

ScholarGPT is capable of retrieving information from a variety of databases, including Google Scholar, PubMed and Arxiv [9]. The insufficient or incorrect results in the responses of ScholarGPT may be due to the fact that only the abstracts, and not the full texts, of some articles in these databases are used [24]. Another reason for this situation may be that ScholarGPT does not have full access to the most important and popular databases in medical science, such as Elsevier, Wiley, Nature, Oxford Academic, Scopus & Web of Science and Cambridge University Press which are reliable sources of evidence-based information. The development of an academic GPT model tailored to the field of endodontics would be a significant advancement. However, it is imperative to consider the potential legal and ethical implications of such a model [23, 27].

In the present study, the accuracy of AI applications was evaluated through the use of both open-ended and dichotomous (yes/no) questions. The decision to not employ only a ‘yes/no’ format for the questions was motivated by the need to mirror the multidimensional nature of clinical practice [35]. This is a significant strength of our study. Although temporal variation was overcome by having two different researchers simultaneously ask the same questions 3 times a day for 10 days, this study does have some limitations. First, the study appraised the performance of three different AI applications in the context of endodontic apical surgery. A plethora of endodontic topics was not evaluated. Subsequent studies encompassing a more extensive range of subjects and incorporating a greater number of questions are needed to evaluate the performance of AI applications in the domain of endodontics. A further limitation of the current study is that the responses provided by the AI applications to the questions were not compared with the responses of general dentists and endodontists. A comparison of answers provided by AI applications with those offered by general dentists and endodontists would provide valuable information about the performance of AI applications. Further research in this area is recommended. The AI ​​applications whose accuracy performance was evaluated in the current study are chatbots for the general audience, not specifically trained in the field of endodontics. Therefore, there may be certain biases in the responses. This is another limitation of our study.

In conclusion, ScholarGPT demonstrated strong performance in the domain of endodontic apical surgery, exhibiting a higher accuracy rate than ChatGPT-4o and Gemini. However, none of the three applications are entirely secure, necessitating caution during their use. These applications should be regarded as an adjunct to clinical knowledge and experience. GPT models based on academic databases have the potential to provide more accurate and reliable information in the medical field. In addition, the development of a dedicated GPT model for the field of endodontics with full access to the most important and popular databases in medical science could provide higher quality and more accurate informations.

Availability of data and materialThe datasets used and/or analysed during the current study are available from the corresponding author upon reasonable request.

None declared.

The authors did not receive any funding for this study.

    Authors

    1. Kubilay Baris

      You can also search for this author inPubMed Google Scholar

    Design of the work; S.D.B., K.B., acquisition; S.D.B., K.B., interpretation of data; S.D.B., drafted the work; S.D.B., revised; K.B. All the authors have read and approved the final version of the manuscript.

    Correspondence to Sevda Durust Baris.

    Not applicable.

    The authors declare no competing interests.

    AI Artificial intelligence.

    CBCT Cone-beam computerized tomography.

    ChatGPT Chat Generative Pre-Trained Transformer.

    GPT Generative Pre-trained Transformer.

    LLMs Large Language Models.

    LLM Large Language Model.

    MTA Mineral trioxide aggregate.

    NSAIDs Non-steroidal anti-inflammatory drugs.

    NLP Natural language processing.

    and consent to partcipate.

    Not applicable.

    Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

    Check for updates. Verify currency and authenticity via CrossMark

    Baris, S.D., Baris, K. Assessment of various artificial intelligence applications in responding to technical questions in endodontic surgery. BMC Oral Health 25, 763 (2025). https://doi.org/10.1186/s12903-025-06149-1

    Download citation

    Origin:
    publisher logo
    BioMed Central
    Loading...
    Loading...

    You may also like...