Large-scale analysis of query logs to profile users for dataset search
DOI | https://doi.org/10.1108/JD-12-2021-0245 |
Published date | 27 April 2022 |
Date | 27 April 2022 |
Pages | 66-85 |
Subject Matter | Library & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet |
Author | Romina Sharifpour,Mingfang Wu,Xiuzhen Zhang |
Large-scale analysis of query logs
to profile users for dataset search
Romina Sharifpour
School of Computing Technologies, RMIT University, Melbourne, Australia
Mingfang Wu
ARDC, Caulfield East, Australia, and
Xiuzhen Zhang
School of Computing Technologies, RMIT University, Melbourne, Australia
Abstract
Purpose –With an explosion of datasets available on the Web, dataset search has gained attention as an
emerging research domain. Understanding users’dataset behaviour is imperative for providing effective data
discovery services. In this paper, the authors present a study on users’dataset search behaviour through the
analysis of search logs from a research data discovery portal.
Design/methodology/approach –Using query and session based features, the authors apply cluster
analysis to discover distinct user profiles with different search behaviours. One particular behavioural
construct of our interest is users’expertise that the authors generate via computing semantic similarity
between users’search queries and the title of metadata records in the displayed search results.
Findings –The findings revealed that there are six distinct classes of user behaviours for dataset search,
namely; Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore.
Research limitations/implications –The user profiles are derived based on analysis of the search log of
the research data catalogue in this study. Further research is needed to generalise the user profiles to other
dataset search settings. Future research can take on a confirmatory approach to verify these user groups and
establish a deeper understanding of their information needs.
Practical implications –The findings in this paper have implications for designing search systems that
tailor search results matching the diverse information needs of different user groups.
Originality/value –We propose for the first time a taxonomy of users for dataset search based on their
domain expertise and search behaviour.
Keywords Dataset search, Log analysis, Search behaviour, Clustering, Semantic text similarity,
Word embedding
Paper type Research paper
1. Introduction
Recent years have witnessed a phenomenal growth in the amount of data produced, stored
and curated. Increased computational power and the ability to store massive amount of data
at a low cost has led to the emergence and collection of massive number of open datasets
available on the Web. Dataset search is typically achieved through metadata; metadata
provides information about a dataset or a collection of datasets, such as title, description and
creator. Currently, there are thousands of online data repositories available, providing
metadata and access to millions of datasets from governments, research institutions,
scientific publishers as well as data brokers. The more datasets are published, the more
complex the problem of dataset discovery becomes (Brickley et al., 2019).
Understanding the user behaviour is known to be central to the improvements of data
discovery services (Arapakis et al., 2014) and for the same reason extensive research have
investigated users’informat ion seeking behaviour. Previous r esearch attempting to
JD
79,1
66
The authors thank the Australian Research Data Commons for making their search log dataset available
for the study; special thanks to Mr. Joel Benn, for extracting and helping clean up the search log dataset.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0022-0418.htm
Received 21 December 2021
Revised 9 March 2022
28 March 2022
Accepted 29 March 2022
Journal of Documentation
Vol. 79 No. 1, 2023
pp. 66-85
© Emerald Publishing Limited
0022-0418
DOI 10.1108/JD-12-2021-0245
understand users’search behaviour have primarily focused on searching for documents or
information within a Web search setting (Bhavnani, 2002;Jansen and Spink, 2006) or Digital
Libraries (Gross and Taylor, 2005) and searching for products on E-commerce websites
(Sondhi et al., 2018). There is a general consensus among the past research that various user’s
characteristics such as their search expertise, domain knowledge of the search topics and
cognitive factors influence the way users preform search, formulate queries and assess their
search outcomes (White et al., 2009;Jansen et al., 2009;Wildemuth, 2004;Kathuria et al., 2010).
Past research also makes a distinction between domain expertise and search expertise.
While domain knowledge is related to searcher’s knowledge of the search topic, search
expertise relates to the knowledge of the search process and the ability to construct a query
that results in high precision of a search results (Wildemuth, 2004;H€
olscher and Strube, 2000).
Searcher’s domain knowledge and Web search expertise both are known to impact the
process of search strategy as well as search success. Among the research concerning the
domain knowledge, studies by Bhavnani (2002) and White and Drucker (2007) revealed that
domain experts often started their research on websites containing key resources, rather than
utilising the general web search engines. Others suggested that domain knowledge affects
individuals ability to choose a more diverse and suitable search terms (Vakkari, 2002;White
and Drucker, 2007;White et al., 2009). Among those concerning the web expertise, the results
indicated differences between search expert and novices in terms of their search process, with
experts being characterised with specific query formulation and search strategies (White
et al., 2009;White and Drucker, 2007).
Domain knowledge and search expertise are also known to have an effect on each other.
Past research suggested that domain knowledge influences the users’search tactics such as
adding or deleting concepts to the search query (Wildemuth, 2004), spending more time
preparing search queries as well as devising search queries that contain more specific
vocabulary from the domain-specific lexicon (Hsieh-Yee, 1993).
Majorityof existingresearch in the literature focuseson user behaviourfor searchingtextual
documents/ web pages, images or videos. Limited research but growing interest exists in the
research community to uncover user dataset search behaviour, in light of the vast amount of
datasets that is becoming available on the Web due to the Open Data initiatives (Carevic et al.,
2020). Similarly, we have seen growing interests in designing information retrieval models
specifically for dataset search (Chapman et al., 2020;Brickley et al., 2019). Furthermore, this body
of emerging research suggests distinctions between the search for datasets and information.
Dataset search is identified to be more challenging in comparison to classical information search.
Thisis mainly due tothe diverse anddistinctivenature of datasetsearch thatembeds both users’
complex information need as well as query formulation (Carevic et al.,2020). Additionally,
dataset search involves more difficult selection decision compared to search for information
(Kern and Mathiak, 2015). This is partially due to the reason that the data for relevance
judgement is not readily accessible within the metadata of datasets, making it difficult for users
to understand the structure of datasets. Consequently, research to fully understand users’
behaviours seeking for datasets is imperative to the successful establishment of effective
metadata or retrieval models that can satisfy the complex in formation needs for data search.
To date, a few studies exist that have attempted to understand the dataset users’
behaviour and intentions (Kacprzak et al., 2019;Carevic et al., 2020;Chen et al., 2019). These
studies however remain limited in explaining various aspects of the users’behaviour.
Most of these research in particular fail to consider individuals differences, such as level
of domain knowledge and web search expertise as well as how formulating the query and
search behaviour can vary in light of such differences. On the other hand, the majority of the
research that characterises users’behaviour, heavily relies on in-depth interviews,
quantitative surveys or lab experiments (Gregory et al., 201 9;Jansen et al.,2009;
Koesten et al., 2017). While these methods are very valuable in uncovering users’
Large-scale
analysis of
query logs
67
To continue reading
Request your trial