ABEE: automated bio entity extraction from biomedical text documents
DOI | https://doi.org/10.1108/DTA-04-2022-0151 |
Published date | 21 April 2023 |
Date | 21 April 2023 |
Pages | 222-244 |
Subject Matter | Library & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet |
Author | Ashutosh Kumar,Aakanksha Sharaff |
ABEE: automated bio entity
extraction from biomedical text
documents
Ashutosh Kumar and Aakanksha Sharaff
Department of Computer Science and Engineering, National Institute of Technology
Raipur, Raipur, India
Abstract
Purpose –The purpose of this study was to design a multitask learning model so that biomedical entities
can be extracted without having any ambiguity from biomedical texts.
Design/methodology/approach –In the proposed automated bio entity extraction (ABEE) model,
a multitask learning model has been introduced with the combination of single-task learning models. Our
model used Bidirectional Encoder Representations from Transformers to train the single-task learning model.
Then combined model’s outputs so that we can find the verity of entities from biomedical text.
Findings –The proposed ABEE model targeted unique gene/protein, chemical and disease entities from the
biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical
trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug
discoveries and new treatments.
Research limitations/implications –As such, there are no limitations with the model, but the research
team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can
easily estimate the entities of similar groups.
Practical implications –As far as the practical implication concerned, the ABEE model will be helpful in
various natural language processing task as in information extraction (IE), it plays an important role in the
biomedical named entity recognition and biomedical relation extraction and also in the information retrieval
task like literature-based knowledge discovery.
Social implications –During the COVID-19 pandemic, the demands for this type of our work
increased because of the increase in the clinical trials at that time. If this type of research has been
introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.
Originality/value –In this work we proposed a novel multitask learning model that is capable to extract
biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-
the-art performance in terms of precision, recall and F1 score.
Keywords Biomedical entity extraction, Neural network, Single-task learning, Multitask learning,
Biomedical entity extraction, Bio data mining
Paper type Research paper
1. Introduction
Biomedical named entity recognition (BioNER) is a critical task when it comes to obtaining
biomedical insights from unstructured biomedical texts. In recent research, BioNER has played
a vital role in the identification of biological entities and its associated extraction task.
The authors gratefully acknowledge the Department of Computer Science and Engineering of the National
Institute of Technology Raipur for providing infrastructure and facilities necessary for this work.
Funding: This research is not funded by any financial institution.
Authors’contributions: A.K. and A.S. hypothesized and designed the idea of ABEE model. A.K.
developed ABEE.A.K. and A.S. experimented and analyzed the results. A.S.,as the supervisor of A.K.,
guided this researchwork. All authors read the final manuscript carefully and approved it.
Availability of data: All the corpora are openly licensed and available at https://github.com/cambridgeltl/
MTL-Bioinformatics-2016/tree/master/data and https://github.com/SKumarAshutosh/ABEE.
Declaration of competing interests: The authors declare that they have no competing interests.
ThecurrentissueandfulltextarchiveofthisjournalisavailableonEmeraldInsightat:
https://www.emerald.com/insight/2514-9288.htm
222
Received 11 April 2022
Revised 2 September 2022
Accepted 19 September2022
Data Technologies and
Applications
Vol. 57 No. 2, 2023
pp. 222-244
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-04-2022-0151
DTA
57,2
The key task of BioNER is to recognize biological entities like genes, proteins, chemicals,
symptoms and diseases. Most of the research has focused only on extracting the biomedical
named entities because most of the biomedical systems are highly dependent on these entities
and direct access to such biomedical information are possible only after BioNER. Building such
aBioNERsystemisalsoaverydifficult task for the richness of biomedical literature. For
training and evaluation, a highly accurate BioNER system requires manually annotated
biomedical data. Most of the annotated biomedical datasets have been developed and
provided openly for BioNER research. Basically, BioNER is a task of extracting biomedical
entities from medical text documents. Earlier studies investigating BioNER fall under three
categories: statistical machine learning, dictionary and rule-based methods (Wang et al.,2018;
Yao et al., 2015). Rule-based methods are highly dependent on a variety of separate class rules.
Rule-based approaches can be defined simply as “Just go ahead and write the rules”(Chiticariu
et al., 2013). Many handcrafted and heuristic rules were used to identify the combination of
named entities and their context in previously rule-based system entities (Li et al., 2020a). These
techniques were predominant in the early, as well as recent BioNER systems (Alfred et al., 2014).
Although it is a very important task to list all model structure rules of BioNER, handcrafted
techniques of this magnitude always entail high cost of system engineering.
Dictionary-based approaches are considered as the basic approach, and these are highly
dependent on existing biological vocabularies and lexicons. Generally, the dictionary-based
method is used to identify the biomedical entity hidden in the text. Basically, if the term in a list
matches the word or group of words in a document, then it is identified as an entity. Their
performance and simplicity make these systems usable more extensively. Though this method
is found to be extremely reliable, it has a weak recall. A BioNER framework based on dictionary
methods can extract certain biological entities from the biological text which are described in
a dictionary. However, these dictionary-based methods are incapable of handling biological
entities, which are not present in the dictionary and usually cause low-recalls (Tasneem and
Archana, 2016). Tuason et al. (2004) reported that errors in spelling and differences in character
and word level caused low recall. The problem with the fixed-length vocabulary is that it is of
fixed size. New terms are added very rapidly by researchers and scientists communities across
the world, and rendering the majority of such vocabulary obsolete is quite difficult.
Low precision and recall mentioned in the dictionary-based methods has required several
improvements. Anexample is the creation of orthographic variations to obtain the terms for
a biomedical resource and to incorporate them in the primary lists (Tsuruoka and Tsujii,
2003). The extendedlist can be used thereafter to do exact matchingof strings. While most of
these improvements were tested, dictionary-based methods are frequently paired with
advanced methods of named entity recognition (NER). Statistical machine learning methods
consider BioNERto be an issue of sequence labeling, where the goalis to determine the right
sequence of labels for a specific input sentence. Theoretically, Hidden Markov models
(HMMs) established by Deng et al. (2017) and promoted by others have strong modeling
ability to the time signal analysis, so much so as to become a research hotspot. HMMs
generallydeal with time-series data, whichhave been successfully used in speechrecognition,
behavior recognition, character recognition and fault diagnosis. The maximum entropy
models or Maximum Entropy Markov Models provide a probabilistic framework that can
combine diverse pieces of contextual evidence to estimate the probability of a certain class.
The essential principle behind the maximum entropy approach is to create a model that
satisfies all knownconstraints; however, it treatsthe unknowns uniformly (Dong et al.,2005).
The conditionalrandom fields (CRFs) approachis used as a sequence labeling, in which CRFs
act as a model,and the probability distribution is a functionof variables, which are dependent
on both observation features and state transition. This model predicts the most likely label
sequence of a given observation set, and under conditional independence between
observations, it can use any arbitrary observational feature (Lee et al.,2018). Support
ABEE for
biomedical
NER task
223
To continue reading
Request your trial