Selecting a text similarity measure for a content-based recommender system. A comparison in two corpora

Document

Cited in

Date	03 June 2019
Pages	506-527
Published date	03 June 2019
DOI	https://doi.org/10.1108/EL-08-2018-0165
Author	Manjula Wijewickrema,Vivien Petras,Naomal Dias
Subject Matter	Information & knowledge management

Selecting a text similarity

measure for a content-based

recommender system

A comparison in two corpora

Manjula Wijewickrema and Vivien Petras

Berlin School of Library and Information Science,

Humboldt University of Berlin, Germany, and

Naomal Dias

Department of Computer Systems Engineering, University of Kelaniya,

Kelaniya, Sri Lanka

Abstract

Purpose –The purpose of this paper is to develop a journal recommender system, which compares the

content similarities betweena manuscript and the existing journal articles in two subject corpora (covering

the social sciences and medicine). The study examinesthe appropriateness of three text similarity measures

and the impact of numerousaspects of corpus documents on system performance.

Design/methodology/approach –Implemented three similarity measures one at a time on a journal

recommender system with two separate journal corpora. Two distinct samples of test abstracts were

classiﬁedand evaluated based on the normalized discountedcumulative gain.

Findings –The BM25 similarity measure outperforms both the cosine and unigram language similarity

measures overall. The unigram language measureshows the lowest performance. The performance results

are signiﬁcantly different between each pair of similarity measures, while the BM25 and cosine similarity

measures are moderately correlated. The cosine similarity achieves better performance for subjects with

higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of

corpus journalsin the domain of social sciences achieved better performance for cosinesimilarity and BM25.

Originality/value –This is the ﬁrst work related to comparingthe suitability of a number of string-based

similaritymeasures with distinct corpora for journalrecommender systems.

Keywords Publishing, Recommender systems, Content-based ﬁltering, Journal selection,

Manuscript submissions

Paper type Research paper

Introduction

The increasing amount of modern scholarly literature creates more editorial opportunities

and motivates publishers to launch more journal outlets. As a result, an author can ﬁnd

plenty of publicationoptions to submit an article at present. On the one hand, this is clearly a

beneﬁt for authors as they have more chances to publish their research somewhere. On the

other hand, multiple journal options confuse authors in selecting the most appropriate

journal outlet to submit their articles. Experienced authors, who have published for a long

The authors would like to thank the National Centre for Advanced Studies in Humanities and Social

Sciences, Colombo, Sri Lanka for providing partial funding to conduct this research under the

reference number 16/NCAS/SUSL/Lib/08.

37,3

506

Received22 August 2018

Revised5 February 2019

Accepted23 March 2019

TheElectronic Library

Vol.37 No. 3, 2019

pp. 506-527

0264-0473

DOI 10.1108/EL-08-2018-0165

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0264-0473.htm

time, may suggest an appropriate journal outlet based on their experience. However,

experienced authors are not always willing or available and the number of options means

that one cannot have perfect knowledge about all the existing journal options in any

considered ﬁeld so that the possibilityof missing an appropriate journal is high. Alternative

approaches, such as colleague recommendations or contacting journal editors to check for

suitability, are not reliable either since the advice again depends on the responder’s

experience of working in the ﬁeld.

Selecting the wrong journal for publication can cause several problems. The immediate

rejection of submission always discourages authors and may, consequently, lead to a

decrease in the productivity of the researcher. An outdated publication after going through

several attempts with inappropriate journals would achieve less impact on the research

community.Even if the research is relevant, societywill not be able to get its beneﬁtunlessit

appears at the right time. Avoiding a lengthy timespan for ﬁnding and short listing several

journal optionsmanually will support authors to savetime for more research. Hence, journal

recommendersystems will not only supportauthors in selectingthe optimal journal for them,

but they also implicitlyencourage authors to conductmore research and publish their results

to create a more knowledgeable society. In addition to authors, editors and publishers will

also beneﬁt fromjournal recommender systems.For example, lesser known journalswill get

the opportunity to emerge as an option to be suggested by the recommender systems.

Publishers could also customize journal recommender systems to select an appropriate

journalfrom their own collection. This willattract more authors to them.

The accuracy of the journal recommender system in suggesting appropriate journals is

dependent on the text similarity measure used in them. Additionally, the performance of

these systems may also depend on the nature of the training corpusused (Banko and Brill,

2001;Islam and Inkpen, 2008). For instance, the vocabulary used in independent domains,

the lengths of the corpus documents, the number of documents in the corpus and the

organization of the text (e.g. structuredor unstructured) may all impact the recommendation

algorithm. Thus, developing novelsimilarity measures or selecting an appropriate one for a

given corpus from a number of existing similarity measures is important to increase the

accuracy of a journal recommender system. The current study aims to compare the

performance of three commontext similarity measures to implement a journal recommender

system for English language articlesbased on two separate disciplinary corpora. The study

examines the performance in termsof relevance of retrieved results and rank position in the

retrieved list.

The unigram language, Okapi BM25, was used in the present study, as was cosine

similarity as the text similaritymeasure for comparison. The efﬁciency, solid establishment

over several years, ease of implementation,ease of interpretation, and the capability to cope

with the current problem were all considerations when selecting these three similarity

measures for implementing the recommender system. The Lucene search engine library

(https://lucene.apache.org/) was used as the software tool to implement the three similarity

measures for evaluation. The two training corpora were comprised of open access (OA)

journal articles because it is difﬁcult to ﬁnd prior research on developing journal

recommender systems for them. The number of professional OA publishers increased

dramatically after the year 2000 (Björk et al., 2010).Also, the number of OA journal authors

increased as they receive a number of beneﬁts over closed access publications (Swan and

Brown, 2004).

Social sciences and medicine contribute a large amount of scholarly literature to

academia. For example, the Elsevierdatabase included 31 per cent social sciences literature,

while 26 per cent of the database consisted of health sciences by August 2017 (Elsevier,

Text similarity

measure

507

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Selecting a text similarity measure for a content-based recommender system. A comparison in two corpora

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users