A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature

Document

Cited in

Date	13 January 2012
Published date	13 January 2012
DOI	https://doi.org/10.1108/00220411211200301
Pages	5-30
Author	Carmen Galvez,Félix de Moya‐Anegón
Subject Matter	Information & knowledge management,Library & information science

A dictionary-based approach to

normalizing gene names in one

domain of knowledge from the

biomedical literature

Carmen Galvez

Departmentof Information Science, Communication and Documentation Faculty,

University of Granada, Granada, Spain, and

´lix de Moya-Anego

´n

SCImago Research Group (CSIC), Institute of Public Goods and Policies (IPP),

Madrid, Spain

Abstract

Purpose – Gene term variation is a shortcoming in text-mining applications based on biomedical

literature-based knowledge discovery. The purpose of this paper is to propose a technique for

normalizing gene names in biomedical literature.

Design/methodology/approach – Under this proposal, the normalized forms can be characterized

as a unique gene symbol, deﬁned as the ofﬁcial symbol or normalized name. The uniﬁcation method

involves ﬁve stages: collection of the gene term, using the resources provided by the Entrez Gene

database; encoding of gene-naming terms in a table or binary matrix; design of a parametrized

ﬁnite-state graph (P-FSG); automatic generation of a dictionary; and matching based on dictionary

look-up to transform the gene mentions into the corresponding uniﬁed form.

Findings – The ﬁndings show that the approach yields a high percentage of recall. Precision is only

moderately high, basically due to ambiguity problems between gene-naming terms and words and

abbreviations in general English.

Research limitations/implications – The major limitation of this study is that biomedical

abstracts were analyzed instead of full-text documents. The number of under-normalization and

over-normalization errors is reduced considerably by limiting the realm of application to biomedical

abstracts in a well-deﬁned domain.

Practical implications – The system can be used for practical tasks in biomedical literature

mining. Normalized gene terms can be used as input to literature-based gene clustering algorithms, for

identifying hidden gene-to-disease, gene-to-gene and gene-to-literature relationships.

Originality/value – Few systems for gene term variation handling have been developed to date. The

technique described performs gene name normalization by dictionary look-up.

Keywords Linguistics,Dictionary, Gene name normalization,Genes

Paper type Research paper

Introduction

The growing amount of scientiﬁc discovery in genomic and related biomedical disciplines

has led to a corresponding growth in the number of databases available electronically.

The analysis of biomedical texts and available databases can help to interpret a

phenomenon, to detect gene relations, or to establish comparisons among similar genes in

different speciﬁc databases. All these processes are crucial for making sense of the

The current issue and full text archive of this journal is available at

www.emeraldinsight.com/0022-0418.htm

Normalizing

gene names

Received 16 June 2010

Revised 31 October 2010

21 February 2011

Accepted 27 February 2011

Journal of Documentation

Vol. 68 No. 1, 2012

pp. 5-30

qEmerald Group Publishing Limited

0022-0418

DOI 10.1108/00220411211200301

immense quantity of genomic information. Due to the overload of information, biomedical

scientists are faced with major challenges when tracking down new discoveries and the

results of research in their domain of interest. These challenges are intensiﬁed by the need

to follow developments in other domains that might possibly be relevant to one’s own

research. One demand for biomedical researchers is how to access and manage this

ever-increasing quantity of information. These enormous collections of publications offer

an exceptional opportunity for data-mining and text-mining applications.

The identiﬁcation of gene names is an important factor frustrating many of the

mining applications based on literature-based knowledge discovery, such as term

identiﬁcation, the formulation of hypotheses about disease, or the visualization of

biological relationships. In many cases implicit relationships are inferred simply by

combining the principle of the co-occurrence of terms or concepts to some form of

graphic association. Current expansion has heightened interest in:

.information retrieval (IR) to gather, select and ﬁlter documents that may prove

useful;

.natural language processing (NLP) to automatically process the texts; and

.information extraction (IE), a sub-area of NLP, to ﬁnd relevant concepts, facts

surrounding concepts, and relationships between relevant terms from the

identiﬁed documents.

Data-mining is an analytical process entailing IR, NLP and IE, used to discover

unsuspected associations – that is, combining or linking facts and events for the

purpose of knowledge discovery in databases (KDD). When data-mining processes are

applied to texts in natural language, we speak of text-mining, also known as textual

data-mining, intelligent text analysis, text data-mining, unstructured data

management, or knowledge discovery in text (KDT). Text-mining, then, is the

discovery by computer of previously unknown information through the automatic

extraction of information from different written resources. A key element is the linking

together of this extracted information to form new facts or hypotheses that can be

further explored using more convention al experimental means. Hearst (1999)

characterizes text-mining as the process of discovering and extracting knowledge

from unstructured data, and contrasts it with data-mining, which discovers knowledge

from structured data. Text-mining is similar to data-mining in its goal and is organized

in stages whose major steps are (Fayyad and Uthurusamy, 1996):

.understanding the domain of application;

.creating a target dataset;

.pre-processing;

.the development of a model;

.the choice of suitable data-mining algorithms; and

.ﬁnal interpretation of results.

The main issues connected with the transfer of general data-mining techniques to the

textual domain are the representation of texts, their pre-processing, and the special

statistical characteristics of textual data, all of which have important implications for

the choice of data-mining algorithms. Pre-processing involves the elimination of

JDOC

68,1

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A dictionary‐based approach to normalizing gene names in one domain of knowledge from the biomedical literature

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users