Unstructured data present in electronic health records (EHR) is a rich source of medical information, however its abstraction is labor intensive. Automated EHR phenotyping (AEP) can reduce the need for manual chart review. We present an AEP model that is designed to automatically identify patients diagnosed with epilepsy.
The ground truth for model training and evaluation was captured from a combination of structured questionnaires filled out by physicians for a subset of patients and manual chart review using customized software. Modeling features included indicators of the presence of keywords and phrases in unstructured clinical notes, prescriptions of anti-seizure medications (ASMs), International Classification of Diseases (ICD) codes for seizures and epilepsy, number of ASMs and epilepsy-related ICD codes, age and sex. Data were randomly divided into training (70%) and hold-out testing (30%) sets, with distinct patients in each set. We trained regularized logistic regression (LR) and an extreme gradient boosting (XGBoost) models. Model performance was measured using area under the receiver operating curve (AUROC) and area under the precision recall curve (AUPRC), with 95% confidence intervals (CI) estimated via bootstrapping.
Our study cohort included 3,903 adults drawn from outpatient departments of 9 hospitals between February 2015 and June 2022: mean age 47 ± 18 years, 57% women, 82% White, 84% Non-Hispanic; 70% with epilepsy. The final models included 285 features, including 246 keywords and phrases captured from 8,415 encounters. Both models achieved AUROC and AUPRC of 1 [95% CI 0.99-1.00] in the hold-out testing set.
A machine learning-based AEP approach accurately identifies patients with epilepsy from notes, ICD codes, and ASMs. This model can enable large-scale epilepsy research using EHR databases.
This article is protected by copyright. All rights reserved.