The growing patient access to Electronic Health Records (EHRs) necessitates tools that aid in demystifying medical jargon. This paper extends the work on NoteAid, an NLP system designed to improve patient understanding of EHR notes by linking clinical terms to lay language. We address the critical gap left by the previous systems, which lack comprehensiveness in their lay definitions for medical jargon. Our study introduces the README dataset, an expert-annotated collection of medical terms paired with lay definitions tailored for a 4th-7th grade reading level. We detail the development of a novel data-centric pipeline that leverages a Human-AI-in-the-loop paradigm to refine the README dataset, which involved rounds of AI-assisted data cleaning, augmentation, and validation to ensure high quality.
By comparing different versions of the README dataset, we show that AI-examined high-quality subsets lead to superior performance in NLP models. Additionally, we explore AI-synthetic data augmentation strategies and their efficacy when expert-annotated data is limited. Our findings have significant implications for the future of patient-centric EHR interfaces, demonstrating that a focus on data quality substantially enhances the utility of NLP tools in real-world healthcare communication.
Submitted to ACL Rolling Review. This work is done when I was undergraduate research assistant at UMass BioNLP Lab.