A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature

Date13 January 2012
Published date13 January 2012
DOIhttps://doi.org/10.1108/00220411211200301
Pages5-30
AuthorCarmen Galvez,Félix de Moya‐Anegón
Subject MatterInformation & knowledge management,Library & information science
A dictionary-based approach to
normalizing gene names in one
domain of knowledge from the
biomedical literature
Carmen Galvez
Departmentof Information Science, Communication and Documentation Faculty,
University of Granada, Granada, Spain, and
Fe
´lix de Moya-Anego
´n
SCImago Research Group (CSIC), Institute of Public Goods and Policies (IPP),
Madrid, Spain
Abstract
Purpose – Gene term variation is a shortcoming in text-mining applications based on biomedical
literature-based knowledge discovery. The purpose of this paper is to propose a technique for
normalizing gene names in biomedical literature.
Design/methodology/approach – Under this proposal, the normalized forms can be characterized
as a unique gene symbol, defined as the official symbol or normalized name. The unification method
involves five stages: collection of the gene term, using the resources provided by the Entrez Gene
database; encoding of gene-naming terms in a table or binary matrix; design of a parametrized
finite-state graph (P-FSG); automatic generation of a dictionary; and matching based on dictionary
look-up to transform the gene mentions into the corresponding unified form.
Findings – The findings show that the approach yields a high percentage of recall. Precision is only
moderately high, basically due to ambiguity problems between gene-naming terms and words and
abbreviations in general English.
Research limitations/implications The major limitation of this study is that biomedical
abstracts were analyzed instead of full-text documents. The number of under-normalization and
over-normalization errors is reduced considerably by limiting the realm of application to biomedical
abstracts in a well-defined domain.
Practical implications The system can be used for practical tasks in biomedical literature
mining. Normalized gene terms can be used as input to literature-based gene clustering algorithms, for
identifying hidden gene-to-disease, gene-to-gene and gene-to-literature relationships.
Originality/value – Few systems for gene term variation handling have been developed to date. The
technique described performs gene name normalization by dictionary look-up.
Keywords Linguistics,Dictionary, Gene name normalization,Genes
Paper type Research paper
Introduction
The growing amount of scientific discovery in genomic and related biomedical disciplines
has led to a corresponding growth in the number of databases available electronically.
The analysis of biomedical texts and available databases can help to interpret a
phenomenon, to detect gene relations, or to establish comparisons among similar genes in
different specific databases. All these processes are crucial for making sense of the
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/0022-0418.htm
Normalizing
gene names
5
Received 16 June 2010
Revised 31 October 2010
21 February 2011
Accepted 27 February 2011
Journal of Documentation
Vol. 68 No. 1, 2012
pp. 5-30
qEmerald Group Publishing Limited
0022-0418
DOI 10.1108/00220411211200301
immense quantity of genomic information. Due to the overload of information, biomedical
scientists are faced with major challenges when tracking down new discoveries and the
results of research in their domain of interest. These challenges are intensified by the need
to follow developments in other domains that might possibly be relevant to one’s own
research. One demand for biomedical researchers is how to access and manage this
ever-increasing quantity of information. These enormous collections of publications offer
an exceptional opportunity for data-mining and text-mining applications.
The identification of gene names is an important factor frustrating many of the
mining applications based on literature-based knowledge discovery, such as term
identification, the formulation of hypotheses about disease, or the visualization of
biological relationships. In many cases implicit relationships are inferred simply by
combining the principle of the co-occurrence of terms or concepts to some form of
graphic association. Current expansion has heightened interest in:
.information retrieval (IR) to gather, select and filter documents that may prove
useful;
.natural language processing (NLP) to automatically process the texts; and
.information extraction (IE), a sub-area of NLP, to find relevant concepts, facts
surrounding concepts, and relationships between relevant terms from the
identified documents.
Data-mining is an analytical process entailing IR, NLP and IE, used to discover
unsuspected associations that is, combining or linking facts and events for the
purpose of knowledge discovery in databases (KDD). When data-mining processes are
applied to texts in natural language, we speak of text-mining, also known as textual
data-mining, intelligent text analysis, text data-mining, unstructured data
management, or knowledge discovery in text (KDT). Text-mining, then, is the
discovery by computer of previously unknown information through the automatic
extraction of information from different written resources. A key element is the linking
together of this extracted information to form new facts or hypotheses that can be
further explored using more convention al experimental means. Hearst (1999)
characterizes text-mining as the process of discovering and extracting knowledge
from unstructured data, and contrasts it with data-mining, which discovers knowledge
from structured data. Text-mining is similar to data-mining in its goal and is organized
in stages whose major steps are (Fayyad and Uthurusamy, 1996):
.understanding the domain of application;
.creating a target dataset;
.pre-processing;
.the development of a model;
.the choice of suitable data-mining algorithms; and
.final interpretation of results.
The main issues connected with the transfer of general data-mining techniques to the
textual domain are the representation of texts, their pre-processing, and the special
statistical characteristics of textual data, all of which have important implications for
the choice of data-mining algorithms. Pre-processing involves the elimination of
JDOC
68,1
6

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT