Evaluating Machine Translation Systems in Emergency Rooms
Millions of people use language translation tools daily, even if the technology—known as machine translation—is sometimes unreliable and prone to errors. While garbled translations may seem like nothing more than a minor inconvenience at times, in high-stakes settings like a hospital emergency room, an incorrect translation for discharge instructions or medication protocols could have life-threatening consequences.
Researchers from the University of Maryland’s Computational Linguistics and Information Processing (CLIP) Lab looked into this problem, studying data collected from English-to-Chinese machine translation systems used in emergency rooms at the University of California, San Francisco.
For their study, the CLIP team reviewed data from 65 English-speaking physicians that had been split into two groups to evaluate two distinct methods for assessing the quality of machine-generated translations used for Chinese-speaking patients.
The first group of physicians used a quality estimation tool—AI-driven software that can automatically predict the accuracy of a machine translation output. According to the researchers, this tool helped doctors rely on machine translation more appropriately by deciding to show “good” translations to patients overall. But the tool was not perfect—it failed to flag some critical errors that could harm the health of the patient.
The second set of doctors used a technique known as backtranslation—where the user refeeds the Chinese output back into Google Translate to assess its English output. For the doctors that used this method, the researchers observed complementary trends—that backtranslation does not improve their ability to assess translation quality on average, but does help identify clinically critical errors that quality estimation tools fail to flag.
The CLIP team believes their study paves the way for future work in designing methods that combine the strengths of both methods they tested, resulting in a human-centered evaluation design that can be used to further improve machine translation tools used in clinical settings.
“Our study confirms that lay users often trust AI systems even when they should not, and that the strategies that people develop on their own to decide whether to trust an output—such as backtranslation—can be misleading,” says Marine Carpuat, an associate professor of computer science who helped author the study. “However, we show that AI techniques can also be used to provide feedback that helps people calibrate their trust in systems. We view this as a first step toward developing trustworthy AI.”
Sweta Agrawal, a co-author on the study who graduated with her Ph.D. in computer science in 2023 and is now a postdoctoral fellow at the Instituto de Telecomunicações in Portugal, says that the project is important for many reasons.
“This work provides support for the usefulness of providing actionable feedback to users in high-risk scenarios,” she says. “Moreover, the findings contribute to the ongoing research efforts to design reliable metrics, especially for critical domains like health care.”
The team’s paper, “Physician Detection of Clinical Harm in Machine Translation: Quality Estimation Aids in Reliance and Backtranslation Identifies Critical Errors,” recently won an outstanding paper award at the Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), held from December 6–10 in Singapore.
Agrawal and Carpuat collaborated on the paper with UMD co-authors Ge Gao, an assistant professor of information studies; Yimin Xiao, a third-year information studies doctoral student; and researchers from the University of California (UC) Berkeley, and UC San Francisco. It was one of 26 papers accepted to EMNLP 2023 that were co-authored by researchers in CLIP.
Another CLIP team— Associate Professor of Computer Science Jordan Boyd-Graber and undergraduate computer science students Anaum Khan and Sander Schulhof—won a best paper award for their study exposing systemic vulnerabilities of large language models. To collect their data, they coordinated a global “prompt hacking” competition, soliciting more than 600,000 adversarial prompts against three state-of-art large language models.
Other accolades from EMNLP 2023 include fifth-year computer science doctoral student Alexander Hoyle wining a best reviewer award, and Carpuat receiving a best area chair award.
Boyd-Graber, Carpuat and Gao all have appointments in the University of Maryland Institute for Advanced Computer Studies and are members of the Institute for Trustworthy AI in Law & Society (TRAILS).
Carpuat and Gao were recently awarded TRAILS seed grant funding for a new project that seeks to understand how people perceive outputs from language translation. Based on their findings, the researchers will develop new techniques to assist people in using these imperfect systems more effectively.
This article was published by the University of Maryland Institute for Advanced Computer Studies.