Large-scale analysis of query logs to profile users for dataset search

DOIhttps://doi.org/10.1108/JD-12-2021-0245
Published date27 April 2022
Date27 April 2022
Pages66-85
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
AuthorRomina Sharifpour,Mingfang Wu,Xiuzhen Zhang
Large-scale analysis of query logs
to profile users for dataset search
Romina Sharifpour
School of Computing Technologies, RMIT University, Melbourne, Australia
Mingfang Wu
ARDC, Caulfield East, Australia, and
Xiuzhen Zhang
School of Computing Technologies, RMIT University, Melbourne, Australia
Abstract
Purpose With an explosion of datasets available on the Web, dataset search has gained attention as an
emerging research domain. Understanding usersdataset behaviour is imperative for providing effective data
discovery services. In this paper, the authors present a study on usersdataset search behaviour through the
analysis of search logs from a research data discovery portal.
Design/methodology/approach Using query and session based features, the authors apply cluster
analysis to discover distinct user profiles with different search behaviours. One particular behavioural
construct of our interest is usersexpertise that the authors generate via computing semantic similarity
between userssearch queries and the title of metadata records in the displayed search results.
Findings The findings revealed that there are six distinct classes of user behaviours for dataset search,
namely; Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore.
Research limitations/implications The user profiles are derived based on analysis of the search log of
the research data catalogue in this study. Further research is needed to generalise the user profiles to other
dataset search settings. Future research can take on a confirmatory approach to verify these user groups and
establish a deeper understanding of their information needs.
Practical implications The findings in this paper have implications for designing search systems that
tailor search results matching the diverse information needs of different user groups.
Originality/value We propose for the first time a taxonomy of users for dataset search based on their
domain expertise and search behaviour.
Keywords Dataset search, Log analysis, Search behaviour, Clustering, Semantic text similarity,
Word embedding
Paper type Research paper
1. Introduction
Recent years have witnessed a phenomenal growth in the amount of data produced, stored
and curated. Increased computational power and the ability to store massive amount of data
at a low cost has led to the emergence and collection of massive number of open datasets
available on the Web. Dataset search is typically achieved through metadata; metadata
provides information about a dataset or a collection of datasets, such as title, description and
creator. Currently, there are thousands of online data repositories available, providing
metadata and access to millions of datasets from governments, research institutions,
scientific publishers as well as data brokers. The more datasets are published, the more
complex the problem of dataset discovery becomes (Brickley et al., 2019).
Understanding the user behaviour is known to be central to the improvements of data
discovery services (Arapakis et al., 2014) and for the same reason extensive research have
investigated usersinformat ion seeking behaviour. Previous r esearch attempting to
JD
79,1
66
The authors thank the Australian Research Data Commons for making their search log dataset available
for the study; special thanks to Mr. Joel Benn, for extracting and helping clean up the search log dataset.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0022-0418.htm
Received 21 December 2021
Revised 9 March 2022
28 March 2022
Accepted 29 March 2022
Journal of Documentation
Vol. 79 No. 1, 2023
pp. 66-85
© Emerald Publishing Limited
0022-0418
DOI 10.1108/JD-12-2021-0245
understand userssearch behaviour have primarily focused on searching for documents or
information within a Web search setting (Bhavnani, 2002;Jansen and Spink, 2006) or Digital
Libraries (Gross and Taylor, 2005) and searching for products on E-commerce websites
(Sondhi et al., 2018). There is a general consensus among the past research that various users
characteristics such as their search expertise, domain knowledge of the search topics and
cognitive factors influence the way users preform search, formulate queries and assess their
search outcomes (White et al., 2009;Jansen et al., 2009;Wildemuth, 2004;Kathuria et al., 2010).
Past research also makes a distinction between domain expertise and search expertise.
While domain knowledge is related to searchers knowledge of the search topic, search
expertise relates to the knowledge of the search process and the ability to construct a query
that results in high precision of a search results (Wildemuth, 2004;H
olscher and Strube, 2000).
Searchers domain knowledge and Web search expertise both are known to impact the
process of search strategy as well as search success. Among the research concerning the
domain knowledge, studies by Bhavnani (2002) and White and Drucker (2007) revealed that
domain experts often started their research on websites containing key resources, rather than
utilising the general web search engines. Others suggested that domain knowledge affects
individuals ability to choose a more diverse and suitable search terms (Vakkari, 2002;White
and Drucker, 2007;White et al., 2009). Among those concerning the web expertise, the results
indicated differences between search expert and novices in terms of their search process, with
experts being characterised with specific query formulation and search strategies (White
et al., 2009;White and Drucker, 2007).
Domain knowledge and search expertise are also known to have an effect on each other.
Past research suggested that domain knowledge influences the userssearch tactics such as
adding or deleting concepts to the search query (Wildemuth, 2004), spending more time
preparing search queries as well as devising search queries that contain more specific
vocabulary from the domain-specific lexicon (Hsieh-Yee, 1993).
Majorityof existingresearch in the literature focuseson user behaviourfor searchingtextual
documents/ web pages, images or videos. Limited research but growing interest exists in the
research community to uncover user dataset search behaviour, in light of the vast amount of
datasets that is becoming available on the Web due to the Open Data initiatives (Carevic et al.,
2020). Similarly, we have seen growing interests in designing information retrieval models
specifically for dataset search (Chapman et al., 2020;Brickley et al., 2019). Furthermore, this body
of emerging research suggests distinctions between the search for datasets and information.
Dataset search is identified to be more challenging in comparison to classical information search.
Thisis mainly due tothe diverse anddistinctivenature of datasetsearch thatembeds both users
complex information need as well as query formulation (Carevic et al.,2020). Additionally,
dataset search involves more difficult selection decision compared to search for information
(Kern and Mathiak, 2015). This is partially due to the reason that the data for relevance
judgement is not readily accessible within the metadata of datasets, making it difficult for users
to understand the structure of datasets. Consequently, research to fully understand users
behaviours seeking for datasets is imperative to the successful establishment of effective
metadata or retrieval models that can satisfy the complex in formation needs for data search.
To date, a few studies exist that have attempted to understand the dataset users
behaviour and intentions (Kacprzak et al., 2019;Carevic et al., 2020;Chen et al., 2019). These
studies however remain limited in explaining various aspects of the usersbehaviour.
Most of these research in particular fail to consider individuals differences, such as level
of domain knowledge and web search expertise as well as how formulating the query and
search behaviour can vary in light of such differences. On the other hand, the majority of the
research that characterises usersbehaviour, heavily relies on in-depth interviews,
quantitative surveys or lab experiments (Gregory et al., 201 9;Jansen et al.,2009;
Koesten et al., 2017). While these methods are very valuable in uncovering users
Large-scale
analysis of
query logs
67

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT