Metadata categorization for identifying search patterns in a digital library

Pages270-286
Published date06 March 2019
DOIhttps://doi.org/10.1108/JD-06-2018-0087
Date06 March 2019
AuthorTessel Bogaard,Laura Hollink,Jan Wielemaker,Jacco van Ossenbruggen,Lynda Hardman
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
Metadata categorization for
identifying search patterns in a
digital library
Tessel Bogaard and Laura Hollink
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
Jan Wielemaker and Jacco van Ossenbruggen
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands and
Vrije Universiteit Amsterdam, Amsterdam, The Netherlands, and
Lynda Hardman
Centrum Wiskunde & Informatica, Amsterdam, The Netherlands and
Universiteit Utrecht, Utrecht, The Netherlands
Abstract
Purpose For digital libraries, it is useful to understand how users search in a collection. Investigating
search patterns can help them to improve the user interface, collection management and search algorithms.
However, search patterns may vary widely in different parts of a collection. The purpose of this paper is to
demonstrate how to identify these search patterns within a well-curated historical newspaper collection using
the existing metadata.
Design/methodology/approach The authors analyzed search logs combined with metadata records
describing the content of the collection, using this metadata to create subsets in the logs corresponding to
different parts of the collection.
Findings The study shows that faceted search is more prevalent than non-faceted search in terms of
number of unique queries, time spent, clicks and downloads. Distinct search patterns are observed in different
parts of the collection, corresponding to historical periods, geographical regions or subject matter.
Originality/value First, this study provides deeper insights into search behavior at a fine granularity in a
historical newspaper collection, by the inclusion of the metadata in the analysis. Second, it demonstrates how
to use metadata categorization as a way to analyze distinct search patterns in a collection.
Keywords Newspapers, Library users, Behaviour, Digital libraries, Case studies, Searching, Archives
Paper type Research paper
1. Introduction
Log analysis is an unobtrusive technique for macro-analysis of user behavior in digital search
systems (Hollink et al., 2011; Spink and Jansen, 2004). It contributes to an understanding of the
information needs of users and to what extent these needs are met. Results based on log analysis
may be used for the evaluation of search algorithms, (re-)design of user interfaces and to identify
potential gaps in the underlying document collection. User behavior in general web search is
well-studied (Baeza-Yates et al., 2005; Beitzel et al., 2004; Downey et al., 2007; Jansen and Spink,
2006).However,insearchenginesprovidingaccess to a specific type of content or collection
(vertical search engines), the search functionality is often different;hence, user behavior can be
expected to differ. This has been shown, for example, for image archives (Han and Wolfram,
2015; Hollink et al., 2011), a medical knowledge portal (Callahan et al., 2015), a newspaper archive
(Gooding, 2016) and in a study of a digital library (Niu and Hemminger, 2015).
Journal of Documentation
Vol. 75 No. 2, 2019
pp. 270-286
© Emerald PublishingLimited
0022-0418
DOI 10.1108/JD-06-2018-0087
Received 5 June 2018
Revised 12 October 2018
Accepted 17 October 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0022-0418.htm
This research was partially supported by the VRE4EIC project, a project that has received funding from the
European Unions Horizon 2020 research and innovation program under grant agreement No. 676247.
The computational part of the research has been carried out on the SWISH DataLab software infrastructure
developed within the VRE4EIC project (Bogaard et al., 2017). The authors thank the National Library of the
Netherlands for providing access to their data and feedback on earlier drafts of this paper.
270
JD
75,2
Our work is carried out in the context of the online search interface to the historical
newspaper collection of the National Library of the Netherlands. The documents in the
collection are described with rich, professionally curated bibliographic metadata about their
format and origin. The search interface providing access to the documents is typical for a
digital library: in addition to regular query input for full text search, users can filter search
results based on selected metadata values using facets (Hearst et al., 2002). Curators at the
National Libraryof the Netherlands are interested in understanding how users search within
their historicalnewspaper collection.This will allow them to provide improvedsearch features
for user groups with specific tasks searching in different parts of the collection. This study
therefore addresses the following research question:
RQ1. How do search patterns differ among users searching in different parts of the
collection?
Previous work has used categorizations of the queries found in logs to find distinct search
patterns,for example, in the study of religious searchrelating to five religions (Wan-Chiket al.,
2013), or an investigation of different types of learning in search (Eickhoff et al., 2014).
Query analysis,however, suffers from variousdisadvantages. Queriesare ambiguous, as they
form an uncontrolled vocabulary with little context to interpret the underlying information
need. Most queries appear infrequently in the logs. As a consequence, when investigating
patterns ofqueries and clicks, even the most frequently occurringpatterns occur infrequently.
Furthermore, queries may contain privacy-sensitive information (Jones et al., 2008).
We propose to use the metadata instead to investigatedifferent search patterns in a historical
newspaper collection. The metadatavalues of clicked documents and thecorresponding facet
values come from a controlled vocabulary. We can observe search patterns by grouping
individual, unique queries based on facet values. Likewise, (long tail) clicked documents can
be grouped by their associated metadata values. Moreover, metadata values of facets and
clicked documents are less privacy-sensitive than queries entered by users.
We start with an analysis of faceted search vs non-faceted search to investigate the role
of facets in search. Our results show that faceted search (57 percent of all search) is
responsible for the larger part of time spent (median session duration of over an hour vs less
than 10 min), the majority of unique queries (79 percent) and documents clicked (78 percent)
and downloaded (72 percent). We create subsets based on the metadata of facets selected in
search, using the selected facet values as a proxy for user interest.
We find distinct searchpatterns based on the kind of facet selected: publication date, item
type or geographical region. Forexample, users searching withinthe Second World War keep
returningto the platform over an extendedperiod of time (median sessionduration eight days)
and click and downloadmany documents (median of 25 clicks,31 percent of sessions includes
a download). Many users are interested in family announcements(18 percent of all sessions),
with visits that are typically highly focused on the subject matter and contain relatively few
clicks. Search forSurinam, though not as popular, is also very focused, with almost all clicks
on documentsfrom this part of the collection(84 percent) in these comparatively shorter visits
(median session duration of just under five hours).
The contribution of this paper is twofold. First, we provide detailed insights into user behavior
in a historical newspaper collection, observing distinct search patterns within different parts of
the collection. Based on our findings, we are able to formulate concrete suggestions for
improvement of the online search platform of the National Library: suggestions for improvements
to the user interface, recommendations for a different default setting of parameters and
recommendations for prioritization of their ongoing digitization efforts. Second, we illustrate how
metadata can be used to analyze behavior in a digital library or archive. As such, it enables us to
do a comparative analysis of: what users search for (from the faceted query log data), what they
find (from click log data) and what is or is not present in the collection (from collection metadata).
271
Metadata for
identifying
search patterns

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT