Existing unstructured anatomic pathology reports would directly benefit from novel word disambiguation approach under development at MIT
Unstructured medical laboratory data is widely recognized to be one significant hurdle on the path toward the universal electronic health record (EHR). This is particularly true for anatomic pathology reports. Despite advances in synoptic reporting, to date, few pathology groups and clinical laboratories have developed ways to resolve this problem.
Now there is news of a different approach toward unstructured healthcare data. Researchers at the Massachusetts Institute of Technology (MIT) have developed a system for algorithmically distinguishing words with multiple possible meanings. The new approach could help find useful information buried in electronic medical records (EHR).
New Approach Could Help With Unstructured Healthcare Data
This fundamentally new approach to word disambiguation demonstrated a significantly higher level of accuracy from previous systems, according to an MIT news release. It boasts an average accuracy rate of 75% in disambiguating words with two senses. This represents a marked improvement over the average 63% achieved using previous methods—a level is too low to be useful, the release stated.
“[W]hat we tried instead was something that’s been tried before in the general domain, but never in the biomedical or clinical domains,” stated Anna Rumshisky, Ph.D., Assistant Professor, Department of Computer Science, MIT, in the release. Rumshisky is also Research Affiliate, Clinical Decision-Making Group, Computer Science and Artificial Intelligence Laboratory at MIT. She helped lead the new research. The novel approach could lead to more accurate systems that require less human intervention, she observed.
Topic Modeling as One Solution to Unstructured Medical Data
Rumshisky and her colleagues adapted algorithms from a research area known as topic modeling. Topic modeling seeks to automatically identify the topics of documents by inferring relationships among prominently featured words.
She explained the different approach. “The twist on it that we’re trying to transpose from the general domain is to treat occurrences of a target word as documents and to treat senses as hidden topics that we’re trying to infer,” she stated in the release.
Potentially Useful Clinical Information Is Buried in Clinical Notes
“For years, pathologists dictated reports as large blocks of text and valuable information was locked inside those reports,” observed Michael J.Becich, M.D., Ph.D., in a previous Dark Daily, titled “Pathologist Becich IDs three most important developments in pathology informatics”.
Becich observed that there is growing use in pathology of synoptic pathology reports, where all the data is structured. Synoptic pathology reporting uses an electronic report that includes a discrete data field format. This arrangement allows computers to search the database of reports in useful ways.
As noted in the MIT release, there could be a goldmine of information still buried in physicians’ notes. For instance, they could contain hidden correlations between symptoms, treatments, and outcomes. Or, they may indicate patients who would be good candidates for clinical trials to test new drugs.
“About 80% of clinical information is buried in clinical notes,” observed Hongfang Liu, Ph.D., Associate Professor of Medical Informatics, Mayo Clinic, in the news release. “A lot of words or phrases are ambiguous there. So in order to get the correct interpretation, you need to go through the word-disambiguation phase.”
New Approach More Effectively Distinguishes Word Meaning
Word-sense disambiguation (WSD) is one of the challenges in extracting data from unstructured text. As reported in the release, when computers can infer the intended meanings of words, it enables them to find useful patterns in vast quantities of data.
This is where Rumshisky’s work is useful. Her team’s algorithm identifies correlations—not only between words—but between words and other textual “features,” the release reported. These include the words’ syntactic roles. As an example, if the ambiguous word “discharge” is preceded by an adjective, it likely refers to a bodily secretion, rather than to an administrative event.
The researchers are planning to incorporate listings in the Unified Medical Language System (UMLS) into their existing algorithm. The UMLS is a huge thesaurus of medical terms, compiled by the National Institutes of Health.
There is the potential for the work of Rumshisky and her colleagues to prove useful in searching anatomic pathology reports to identify useful clinical information. Even if pathologists in the United States were to adopt some form of synoptic reporting in future years, the large volume of anatomic pathology reports in medical archives and EHRs would still exist, providing a need for this type of word disambiguation capability.
Pathologists and clinical laboratory managers can expect this new WSD system to have direct application in anatomic pathology when fully-developed and ready for prime time.
—Pamela Scherer McLeod