The performance of global large language models (LLMs), trained largely on Western data, for disease in other settings and languages is unknown. Taking myopia as an illustration, we evaluated the global versus Chinese-domain LLMs in addressing Chinese-specific myopia-related questions.
Global LLMs (ChatGPT-3.5, ChatGPT-4.0, Google Bard, Llama-2 7B Chat) and Chinese-domain LLMs (Huatuo-GPT, MedGPT, Ali Tongyi Qianwen, and Baidu ERNIE Bot, Baidu ERNIE 4.0) were included. All LLMs were prompted to address 39 Chinese-specific myopia queries across 10 domains. 3 myopia experts evaluated the accuracy of responses with a 3-point scale. “Good”-rating responses were further evaluated for comprehensiveness and empathy using a five-point scale. “Poor”-rating responses were further prompted for self-correction and re-analysis.
The top 3 LLMs in accuracy were ChatGPT-3.5 (8.72 ± 0.75), Baidu ERNIE 4.0 (8.62 ± 0.62), and ChatGPT-4.0 (8.59 ± 0.93), with highest proportions of 94.8% “Good” responses. Top five LLMs with comprehensiveness were ChatGPT-3.5 (4.58 ± 0.42), ChatGPT-4.0 (4.56 ± 0.50), Baidu ERNIE 4.0 (4.44 ± 0.49), MedGPT (4.34 ± 0.59), and Baidu ERNIE Bot (4.22 ± 0.74) (all p ≥ 0.059, versus ChatGPT-3.5). While for empathy were ChatGPT-3.5 (4.75 ± 0.25), ChatGPT-4.0 (4.68 ± 0.32), MedGPT (4.50 ± 0.47), Baidu ERNIE Bot (4.42 ± 0.46), and Baidu ERNIE 4.0 (4.34 ± 0.64) (all p ≥ 0.052, versus ChatGPT-3.5). Baidu ERNIE 4.0 did not receive a “Poor” rating, while others demonstrated self-correction capabilities, showing enhancements ranging from 50% to 100%.
Global and Chinese-domain LLMs demonstrate effective performance in addressing Chinese-specific myopia-related queries. Global LLMs revealed optimal performance in Chinese-language settings despite primarily training with non-Chinese data and in English.
© 2025. The Author(s), under exclusive licence to The Royal College of Ophthalmologists.
Create Post
Twitter/X Preview
Logout