Selecting a text similarity measure for a content-based recommender system. A comparison in two corpora

Date03 June 2019
Pages506-527
Published date03 June 2019
DOIhttps://doi.org/10.1108/EL-08-2018-0165
AuthorManjula Wijewickrema,Vivien Petras,Naomal Dias
Subject MatterInformation & knowledge management
Selecting a text similarity
measure for a content-based
recommender system
A comparison in two corpora
Manjula Wijewickrema and Vivien Petras
Berlin School of Library and Information Science,
Humboldt University of Berlin, Germany, and
Naomal Dias
Department of Computer Systems Engineering, University of Kelaniya,
Kelaniya, Sri Lanka
Abstract
Purpose The purpose of this paper is to develop a journal recommender system, which compares the
content similarities betweena manuscript and the existing journal articles in two subject corpora (covering
the social sciences and medicine). The study examinesthe appropriateness of three text similarity measures
and the impact of numerousaspects of corpus documents on system performance.
Design/methodology/approach Implemented three similarity measures one at a time on a journal
recommender system with two separate journal corpora. Two distinct samples of test abstracts were
classiedand evaluated based on the normalized discountedcumulative gain.
Findings The BM25 similarity measure outperforms both the cosine and unigram language similarity
measures overall. The unigram language measureshows the lowest performance. The performance results
are signicantly different between each pair of similarity measures, while the BM25 and cosine similarity
measures are moderately correlated. The cosine similarity achieves better performance for subjects with
higher density of technical vocabulary and shorter corpus documents. Moreover, increasing the number of
corpus journalsin the domain of social sciences achieved better performance for cosinesimilarity and BM25.
Originality/value This is the rst work related to comparingthe suitability of a number of string-based
similaritymeasures with distinct corpora for journalrecommender systems.
Keywords Publishing, Recommender systems, Content-based ltering, Journal selection,
Manuscript submissions
Paper type Research paper
Introduction
The increasing amount of modern scholarly literature creates more editorial opportunities
and motivates publishers to launch more journal outlets. As a result, an author can nd
plenty of publicationoptions to submit an article at present. On the one hand, this is clearly a
benet for authors as they have more chances to publish their research somewhere. On the
other hand, multiple journal options confuse authors in selecting the most appropriate
journal outlet to submit their articles. Experienced authors, who have published for a long
The authors would like to thank the National Centre for Advanced Studies in Humanities and Social
Sciences, Colombo, Sri Lanka for providing partial funding to conduct this research under the
reference number 16/NCAS/SUSL/Lib/08.
EL
37,3
506
Received22 August 2018
Revised5 February 2019
Accepted23 March 2019
TheElectronic Library
Vol.37 No. 3, 2019
pp. 506-527
© Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-08-2018-0165
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
time, may suggest an appropriate journal outlet based on their experience. However,
experienced authors are not always willing or available and the number of options means
that one cannot have perfect knowledge about all the existing journal options in any
considered eld so that the possibilityof missing an appropriate journal is high. Alternative
approaches, such as colleague recommendations or contacting journal editors to check for
suitability, are not reliable either since the advice again depends on the responders
experience of working in the eld.
Selecting the wrong journal for publication can cause several problems. The immediate
rejection of submission always discourages authors and may, consequently, lead to a
decrease in the productivity of the researcher. An outdated publication after going through
several attempts with inappropriate journals would achieve less impact on the research
community.Even if the research is relevant, societywill not be able to get its benetunlessit
appears at the right time. Avoiding a lengthy timespan for nding and short listing several
journal optionsmanually will support authors to savetime for more research. Hence, journal
recommendersystems will not only supportauthors in selectingthe optimal journal for them,
but they also implicitlyencourage authors to conductmore research and publish their results
to create a more knowledgeable society. In addition to authors, editors and publishers will
also benet fromjournal recommender systems.For example, lesser known journalswill get
the opportunity to emerge as an option to be suggested by the recommender systems.
Publishers could also customize journal recommender systems to select an appropriate
journalfrom their own collection. This willattract more authors to them.
The accuracy of the journal recommender system in suggesting appropriate journals is
dependent on the text similarity measure used in them. Additionally, the performance of
these systems may also depend on the nature of the training corpusused (Banko and Brill,
2001;Islam and Inkpen, 2008). For instance, the vocabulary used in independent domains,
the lengths of the corpus documents, the number of documents in the corpus and the
organization of the text (e.g. structuredor unstructured) may all impact the recommendation
algorithm. Thus, developing novelsimilarity measures or selecting an appropriate one for a
given corpus from a number of existing similarity measures is important to increase the
accuracy of a journal recommender system. The current study aims to compare the
performance of three commontext similarity measures to implement a journal recommender
system for English language articlesbased on two separate disciplinary corpora. The study
examines the performance in termsof relevance of retrieved results and rank position in the
retrieved list.
The unigram language, Okapi BM25, was used in the present study, as was cosine
similarity as the text similaritymeasure for comparison. The efciency, solid establishment
over several years, ease of implementation,ease of interpretation, and the capability to cope
with the current problem were all considerations when selecting these three similarity
measures for implementing the recommender system. The Lucene search engine library
(https://lucene.apache.org/) was used as the software tool to implement the three similarity
measures for evaluation. The two training corpora were comprised of open access (OA)
journal articles because it is difcult to nd prior research on developing journal
recommender systems for them. The number of professional OA publishers increased
dramatically after the year 2000 (Björk et al., 2010).Also, the number of OA journal authors
increased as they receive a number of benets over closed access publications (Swan and
Brown, 2004).
Social sciences and medicine contribute a large amount of scholarly literature to
academia. For example, the Elsevierdatabase included 31 per cent social sciences literature,
while 26 per cent of the database consisted of health sciences by August 2017 (Elsevier,
Text similarity
measure
507

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT