Predicting the quality of health web documents using their characteristics

Date12 November 2018
Pages1024-1047
DOIhttps://doi.org/10.1108/OIR-01-2017-0028
Published date12 November 2018
AuthorMelinda Oroszlányová,Carla Teixeira Lopes,Sérgio Nunes,Cristina Ribeiro
Subject MatterLibrary & information science,Information behaviour & retrieval,Collection building & management,Bibliometrics,Databases,Information & knowledge management,Information & communications technology,Internet,Records management & preservation,Document management
Predicting the quality of health
web documents using their
characteristics
Melinda Oroszlányová, Carla Teixeira Lopes, Sérgio Nunes and
Cristina Ribeiro
Department of Informatics Engineering, University of Porto, Porto, Portugal
Abstract
Purpose The quality of consumer-oriented health information on the web has been defined and evaluated
in several studies. Usually it is based on evaluation criteria identified by the researchers and, so far, there is no
agreed standard for the quality indicators to use. Based on such indicators, tools have been developed
to evaluate the quality of web information. The HONcode is one of such tools. The purpose of this paper is to
investigate the influence of web document features on their quality, using HONcode as ground truth, with the
aim of finding whether it is possible to predict the quality of a document using its characteristics.
Design/methodology/approach The presentwork uses a set of health documents and analyzeshow their
characteristics(e.g. web domain, last update,type, mention of places of treatmentand prevention strategies)are
associated with their quality. Based on these features, statistical models are built which predict whether
health-related web documents have certification-level quality. Multivariate analysis is performed, using
classificationto estimate the probability of a document having qualitygiven its characteristics. This approach
tells us whichpredictors are important. Threetypes of full and reduced logisticregression models are built and
evaluated.The first one includes everyfeature, without any exclusion,the second one disregards theUtilization
Review Accreditation Commission variable, due to it being a qualityindicator, and the third one excludesthe
variablesrelated to the HONcode principles,which might alsobe indicators of quality.The reduced models were
built with the aim to see whetherthey reach similar results with a smaller numberof features.
Findings The prediction models have high accuracy, even without including the characteristics of Health
on the Net code principles in the models. The most informative prediction model considers characteristics that
can be assessed automatically (e.g. split content, type, process of revision and place of treatment). It has an
accuracy of 89 percent.
Originality/value This paper proposes models that automatically predict whether a document has quality
or not. Some of the used features (e.g. prevention, prognosis or treatment) have not yet been explicitly
considered in this context. The findings of the present study may be used by search engines to promote
high-quality documents. This will improve health information retrieval and may contribute to reduce the
problems caused by inaccurate information.
Keywords Credibility, Prediction models, Online health information, Heath information quality
Paper type Research paper
Introduction
In the past decades, a huge amount of health-related information became available on the
web. This is one of the reasons people prefer the internet as a source when seeking for
information (Kim, 2009; Savolainen, 2008; Zhang et al., 2014). This implied an increased
number of people being affected by online health information. Usersaccess to the internet
and their interest in health information are both influencing aspects in health search
surveys. The findings of a national survey from 2010 (Fox, 2011) about how internet users
in the USA search for health information on the web report that, on a daily basis, millions of
American adults have been using online resources for their health concerns. Among all
adults in the USA, 74 percent went online, and 59 percent looked online for health
information in 2010. These values were showed to be influenced by the health status of the
Online Information Review
Vol. 42 No. 7, 2018
pp. 1024-1047
© Emerald PublishingLimited
1468-4527
DOI 10.1108/OIR-01-2017-0028
Received 31 January 2017
Accepted 27 March 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/1468-4527.htm
This work was supported by Project NORTE-01-0145-FEDER-000016(NanoSTIMA), financed by
the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020
Partnership Agreement and through the European Regional Development Fund (ERDF).
1024
OIR
42,7
user as well. Usersfindings may have an impact on their decision making, according to the
retrieved information.
Although the web is the largest source of health information available to the users, it has
always been unregulated due to its distributed nature, entailing the attention to the quality
of health-related information. The problem is also the increased possibility to access
consumer-oriented health information, due to the rising popularity of the internet and the
advance in crowd-edited websites (e.g. Wikipedia), but with potential incoherence in its
quality. As summarized by Eysenbach (2002a), researchers conclude that the quality of
health information varies significantly between sources.
People usually do not check the quality of health information on the internet (e.g. related
to some specific medical conditions). Therefore, the goal of current research is to provide
quality indicators with the perspective of helping users find trustworthy information
(Eysenbach, 2002b). The quality assessment is done by introducing quality indicators and
building reliable quality rating tools that can be used to improve search engine rankings. At
the end of the 1990s there were already more than 45 quality-rating instruments identified
( Jadad and Gagliardi, 1998), which used seals of approval for qualifying the websites.
Within a few years, this number raised above 250 instruments with the focus on tools that
could be used by the consumers (Bernstam, 2005). The criteria and instruments used to
evaluate and rate the health-related websites can be easily accessed through the open web.
There is no agreed standard of quality indicators for web-information yet, nor are the
quality evaluation tools reliable in predicting high quality information (Fahy et al., 2014).
Based on the quality indicators, researchers established scoring systems for quality
evaluation (e.g. HONcode, Utilization Review Accreditation Commission (URAC)), that can
help searchersto encounter more reliable information.In the present work, using a previously
annotated data set,composed by a set of annotated web pages and specific characteristics of
web documents, we analyze the impact of several document characteristics on their quality.
Our broaderaim is to see whether it is possibleto infer the quality of health information on the
internet automatically, besides using the already reported characteristics in the literature,
and without using the characteristics of HONcode and URAC criteria (Health on the
Net Foundation, 2015; Utilization Review Accreditation Commission, 2015).
The extensive use of search engines (Fahy et al., 2014) makes their ranking criteria
important indicators of the information reached by users. A manifold approach is needed in
order to improve the quality of health information that reaches the seekers on the internet,
along with better informing the searchers about the online health resources.
Assessing the quality of health information on the web
In the 1990s, Pallen (1995) introduced a guide to the internet that became a central concept
for healthcare providers of that time, urging them to share information on health topics with
the public. Later on, researchers started to focus on the information about specific medical
topics available on the web, and they saw that for the users, when searching for
health-related information, it might be difficult to determine the reliability of the web pages.
Several types of studies that assess the quality of web content started to appear. There are
studies focused on the evaluation of the quality of certain websites, studies proposing
guidelines for manual evaluation of websites and studies presenting tools to automatically
do this assessment. In the following subsections, we will describe some of the main
initiatives in manual and automatic methods.
Quality criteria
Several studies have evaluated the quality of the content, and several guidelines have been
published by international health organizations.
1025
Predicting the
quality of
health web
documents

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT