Web robot detection in scholarly Open Access institutional repositories

Published date19 September 2016
Date19 September 2016
AuthorJoseph W. Greene
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Information user studies,Metadata,Information & knowledge management,Information & communications technology,Internet
Web robot detection in
scholarly Open Access
institutional repositories
Joseph W. Greene
James Joyce Library, University College Dublin, Dublin, Ireland
Purpose The purpose of this paper is to investigate the impact and techniques for mitigating the
effects of web robots on usage statistics collected by Open Access (OA) institutional repositories (IRs).
Design/methodology/approach A close review of the literature provides a comprehensive
list of web robot detection techniques. Reviews of system documentation and open source code are
carried out along with personal interviews to provide a comparison of the robot detection techniques
used in the major IR platforms. An empirical test based on a simple random sample of downloads
with 96.20 per cent certainty is undertaken to measure the accuracy of an IRs web robot detection at a
large Irish University.
Findings While web robot detection is not ignored in IRs, there are areas where the two
main systems could be improved. The technique tested here is found to have successfully detected
94.18 per cent of web robots visiting the site over a two-year period (recall), with a precision of
98.92 per cent. Due to the high level of robot activity in repositories, correctly labelling more robots has
an exponential effect on the accuracy of usage statistics.
Research limitations/implications This study is performed on one repository using a single
system. Future studies across multiple sites and platforms are needed to determine the accuracy of web
robot detection in OA repositories generally.
Originality/value This is the only study to date to have investigated web robot detection in IRs.
It puts forward the first empirical benchmarking of accuracy in IR usage statistics.
Keywords Usage statistics, Institutional repositories, Open access, Detection, Downloads,
Web robots
Paper type Research paper
1. Introduction
Usage metrics are commonly used in library and information service environments to
assist with decision making such as journal purchasing, collection building, and item
deselection, and to demonstrate the overall value of the services themselves. Scholarly
Open Access (OA) repositories, freely accessible full text repositories of scientific and
scholarly publications, are one such service within the higher education and research
information sector. Beginning around 1991 with arXiv.org, the electronic pre-print
archive of papers in physics and similar subjects (Cornell University Library, n.d.), the
number of OA repositories has grown to more than 4,000 worldwide in 2015 (University
of Southampton and EPrints.org, n.d.). Many of these repositories are hosted locally by
universities for self-archiving by the academic and research staff of those institutions
and are known within the community as institutional repositories (IRs).
As with other information services, OA repositories often collect usage statistics for
the items they host, typically as full text download counts. Opinions on download
Library Hi Tech
Vol. 34 No. 3, 2016
pp. 500-520
©Emerald Group Publishing Limited
DOI 10.1108/LHT-04-2016-0048
The current issue and full text archive of this journal is available on Emerald Insight at:
The author would like to thank Paul Needham (University of Cranfield and IRUS-UK) and
Stefan Amshey, Ann Connolly, and Jean-Gabriel Bankier (BePress Digital Commons) for
invaluable discussions and suggestions on the draft of this paper.
statistics are somewhat divided, with some arguing that they are problematic and
unhelpful (Cornell University Library, n.d.), while others make free use of download
statistics, ranking papers, and even authors, distributing them monthly to participants,
and advertising them broadly to the public (Gordon and Jensen, n.d.; Zimmerman and
Baum, n.d.). Download statistics have even been shown under certain conditions to be
predictors of future citations (Brody et al., 2006), arguably the most important metric
for scholarly and scientific research publications.
Regardless of which stance one takes, any data used as a metric or simply publicised
for promotional purposes must be accurate in order to be useful and credible. A great
challenge to this in any web environment is the use of web robots, operated by search
engines and comment spammers alike, and accounting for between 8.51 and 32.6 per
cent of web traffic (Doran and Gokhale, 2011). Robot traffic can vary widely depending
on the type of website, with a study on the internet archive finding as much as 93 per
cent of requests attributable to robots (AlNoamany et al., 2013).
Given the importance of accurate usage statistics, the sizable and widely variable
impact of web robots, and the complexity of detecting them, we endeavour to answer
the following questions: What techniques are commonly used for web robot detection?
How do the main IR software packages implement web robot detection out-of-the-box?
We then describe and test a web robot detection technique used in practice by an OA IR
at a large Irish University and discuss an effective and practical approach to web robot
detection for repositories that takes advantage of the theoretical models.
2. Web robot detection techniques
A close review of the existing literature on web robot detection yielded ten individual
studies (Tan and Kumar, 2002; Geens et al., 2006; Huntington et al., 2008; Duskin and
Feitelson, 2009; Stassopoulou and Dikaiakos, 2009; Doran and Gokhale, 2012;
AlNoamany et al., 2013; Song et al., 2013; Lamothe, 2014; Zabihi et al., 2014) and one
overview/review article (Doran and Gokhale, 2011) that describe and test the main
techniques and data used in web robot detection. Table I lists 23 distinct variables used
in these studies, categorised here according to a simplified version of the schema
proposed by Doran and Gokhale (2011). While the majority come from the field of
computer science, three studies were found that focus on scholarly information systems
(Bollen and Sompel, 2006; Huntington et al., 2008; Lamothe, 2014).
None of these studies benchmark detection techniques used in an OA repository, though
Huntington et al.s (2008) research on an OA journal is very closely related in terms of the
content. The technique of investigating outliers in library e-resource usage data proposed
by Lamothe (2014) is similar not only in content, but also in terms of the technique, which is
nearly identical with one of the techniques used by the repository investigated in this study.
Each study presents a different method for analysing the data, from matching data
in the server logs against known robots (Huntington et al., 2008) to complex machine
learning techniques (Stassopoulou and Dikaiakos, 2009; Tan and Kumar, 2002). What is
immediately clear is that no method is capable of accurately detecting all robots
visiting a given web server. The stated goal of robot detection becomes to detect the
highest percentage of all robots (recall) with the lowest number of false positives
(precision), that is, capturing as many robots as possible while labelling the fewest
number of human sessions as robots (Geens et al., 2006). Table II summarises the recall,
precision, and F-score (harmonic mean of recall and precision) achieved in a number of
studies. Recall ranges between 0.85368 and 0.9751, precision between 0.82 and 0.95, and
the F-score between 0.84466 and 0.94.
Web robot
detection in

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT