Hit count estimate variability for website-specific queries in search engines. The case for rare disease association websites

Published date19 March 2018
Pages192-213
DOIhttps://doi.org/10.1108/AJIM-10-2017-0226
Date19 March 2018
AuthorCristina I. Font-Julian,José-Antonio Ontalba-Ruipérez,Enrique Orduña-Malea
Subject MatterLibrary & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management
Hit count estimate variability
for website-specific queries in
search engines
The case for rare disease
association websites
Cristina I. Font-Julian, José-Antonio Ontalba-Ruipérez and
Enrique Orduña-Malea
Universitat Politècnica de València, Valencia, Spain
Abstract
Purpose The purpose of this paper is to determine the effect of the chosen search engine results page
(SERP) on the website-specific hit count estimation indicator.
Design/methodology/approach A sample of 100 Spanish rare disease association websites is analysed,
obtaining the website-specific hit count estimation for the first and last SERPs in two search engines (Google
and Bing) at two different periods in time (2016 and 2017).
Findings It has been empirically demonstrated that there are differences between the number of hits
returned on the first and last SERP in both Google and Bing. These differences are significant when they
exceed a threshold value on the first SERP.
Research limitations/implications Future studies considering other samples, more SERPs and
generating different queries other than website page count (osite W) would be desirable to draw more
general conclusions on the nature of quantitative data provided by general search engines.
Practical implications Selecting a wrong SERP to calculate some metrics (in this case, website-specific
hit count estimation) might provide misl eading results, comparisons and performance ranking s.
The empirical data suggest that the first SERP captures the differences between websites better because
it has a greater discriminating power and is more appropriate for webometric longitudinal studies.
Social implications The findings allow improving future quantitative webometric analyses based on
website-specific hit count estimation metrics in general search engines.
Originality/value The website-specific hit count estimation variability between SERPs has been
empirically analysed, considering two different search engines (Google and Bing), a set of 100 websites
focussed on a similar market (Spanish rare diseases associations), and two annual samples,making this study
the most exhaustive on this issue to date.
Keywords Google, Search engines, Bing, Hit count estimates, Rare diseases, Website page count
Paper type Research paper
1. Introduction
The use of search engines to automatically (or semi-automatically) extract data on website
page count (number of files hosted by a web domain) and web visibility (number of links or
mentions received by a web domain) has been one of the main applications of the
instrumental branch of cybermetrics (Orduna-Malea and Aguillo, 2015). These web
indicators may provide, in a way that complements other quantitative and qualitative
procedures and techniques, evidence for the greater or lesser presence and impact of web
content and, therefore, of the sites that generate it and of the individuals or legal entities
that manage it.
Cybermetric techniques that evaluate the impact of content hosted by web domains have
been applied to many types of websites, such as universities (Aguillo et al., 2008), academic
journals (Vaughanand Hysen, 2002; Vaughan and Thelwall,2003; Thelwall, 2012), companies
(Vaughan, 2004; Vaughan and Wu, 2004; Orduna-Malea et al., 2015), the media (Gao and
Vaughan, 2005), political parties (Park et al., 2004; Romero-Frías and Vaughan, 2010),
Aslib Journal of Information
Management
Vol. 70 No. 2, 2018
pp. 192-213
© Emerald PublishingLimited
2050-3806
DOI 10.1108/AJIM-10-2017-0226
Received 11 October 2017
Revised 20 January 2018
Accepted 5 February 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2050-3806.htm
192
AJIM
70,2
local government(Holmberg and Thelwall,2008), museums (Espadas et al.,2008; Gouveia and
Kurtenbach, 2009; Orduna-Malea, 2014) and even hospitals (Utrilla-Ramirez et al., 2009;
Utrilla-Ramirez et al.,2011).
However, the precision of the analytical tool (search engines) has always been questioned
(Snyder and Rosenbaum, 1999) due to both the internal functioning of search engines
(not made for quantitative purposes) and the nature of web information itself (dynamic and
volatile). Although the literature has studied and proposed various methods for collecting
web data from a cybermetric perspective (Bar-Ilan, 2001; Thelwall, 2004, 2006, 2009;
Thelwall and Sud, 2011), the limitations of search engines have restricted the expansion and
evolution of cybermetrics as a discipline (Thelwall, 2010). The disappearance of specific
search commands, in particular, the command for finding out the number of hyperlinks that
a particular website receives, and of search engines and entire platforms, such as Altavista
and Yahoo Site Explorer, that were equipped with certain essential tools for cybermetric
analysis (Orduna-Malea and Aguillo, 2015), contributed to a gradual abandonment by
researchers of search engines as data sources. At the same time, other specialised platforms
emerged, such as Majestic, Open Site Explorer and Ahrefs, which, despite their undoubted
benefits and features for cybermetrics, offer less coverage and limit their services to the
gathering of large amounts of data.
Although these issues have greatly affected the use of general search engines as
sources of hyperlinks, the use of commercial search engines to calculate website page
count has also been questioned from the outset. Given the impossibility of externally
calculating the number of files hosted on a website without webmaster access privileges,
together with the added difficulty of quantifying dynamically generated content (no file
associated), website page count has traditionally been calculated by the number of URLs
thatasearchenginehasindexedonthecorresponding website, called website-specific hit
count estimation. In the case of Google andBing,thereisasearchcommand(ositeW)
that retrieves URLs from a particular website. The procedure is based on running a search
query (e.g. osite:nasa.gov W) and noting the number of results provided by the
search engine (hit count estimate (HCE)).
However, the lack of precision in HCEs (Uyar, 2009; Satoh and Yamana, 2012), their
variability over time (Bar-Ilan, 1999), biased search coverage (Thelwall, 2000; Vaughan and
Thelwall, 2004; Lewandowski, 2015), the frequency of updates (Lewandowski, 2008) and
certain functional limitations (such as only returning a maximum number of results
regardless of the HCE value) have led to a certain rejection of the use of search engines as
tools for obtaining web impact data (Lawrence et al., 2010).
The fact that Google did not even offer an application programming interface (API) to
facilitate automated data collection-led cybermetric studies to switch to Bing, which
did offer an API (htt p://datamarket.azu re.com/dataset/bing/ search), although it was
limited to a maximum of 5,000 free queries per month (this service was withdrawn from
the market on 31 December 2016). Numerous cybermetric studies (Thelwall, 2008;
Thelwall and Sud, 2012; Wilkinson and Thelwall, 2013) were performed using the
Bing API, due mainly to the fact that cybermetric applications such as Webometric
Analyst (http://lexiurl.wlv.ac.uk) worked with its API. However, Bings lower coverage
(see www.worldwidewebsize.com) in comparison to Google, and various limitations
(e.g. inaccurate for queries with more than 1,000 hits) greatly restricted the use of this API
for metric purposes.
One of the various limitations of search engines, summarised in a schematic but complete
way by Wilkinson and Thelwall (2013), is the variation in HCEs on each individual search
engineresults page (SERP).For example, if we wantedto find out the numberof pages indexed
by Bing for the Library of Congress, we could make the following query: osite:loc.gov W.
On the first SERP (configured to display ten results), the search engine informs us that
193
Hit count
estimate
variability

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT