Extracting core questions in community question answering based on particle swarm optimization

Published date03 September 2019
DOIhttps://doi.org/10.1108/DTA-02-2019-0025
Date03 September 2019
Pages456-483
AuthorMing Li,Lisheng Chen,Yingcheng Xu
Subject MatterLibrary & information science
Extracting core questions in
community question answering
based on particle
swarm optimization
Ming Li and Lisheng Chen
China University of Petroleum-Beijing, Beijing, China, and
Yingcheng Xu
China National Institute of Standardization, Beijing, China
Abstract
Purpose A large number of questions are posted on community question answering (CQA) websites every
day. Providing a set of core questions will ease the question overload problem. These core questions should
cover the main content of the original question set. There should be low redundancy within the core questions
and a consistent distribution with the original question set. The paper aims to discuss these issues.
Design/methodology/approach In the paper, a method named QueExt method forextracting core questions
is proposed. First, questions are modeled using a biterm topic model. Then, these questions are clustered based on
particle swarm optimization (PSO). With the clustering results, the number of core questions to be extracted from
each cluster can be determined. Afterwards, the multi-objective PSO algorithm is proposed to extract the core
questions. Both PSO algorithms are integrated with operators in genetic algorithms to avoid the local optimum.
Findings Extensive experiments on real data collected from the famous CQA website Zhihu have been
conductedand the experimental results demonstrate the superior performance over otherbenchmark methods.
Research limitations/implications Theproposed method providesnew insightinto and enriches research
on informationoverload in CQA.It performs better thanother methods in extracting core shorttext documents,
and thusprovides a betterway to extract core data.The PSO is a novel methodused for selectingcore questions.
The research on the application of the PSO model is expanded. The study also contributes to research on
PSO-based clustering. Withthe integration of K-means++, the key parameter number of clusters is optimized.
Originality/value The novel core question extraction method in CQA is proposed, which provides a novel
and efficient way to alleviate the question overload. The PSO model is extended and novelty used in selecting
core questions. The PSO model is integrated with K-means++ method to optimize the number of clusters,
which is just the key parameter in text clustering based on PSO. It provides a new way to cluster texts.
Keywords Knowledge management, Social media, Particle swarm optimization, Text mining,
Community question answering, Core question extraction
Paper type Research paper
1. Introduction
With the rapid development of the internet, individuals or organizations acquire
increasingly more knowledge through the internet. Community question answering (CQA)
websites have become important knowledge-sharing platforms (Liu and Jansen, 2017). The
sites are online questionanswering websites where users can post or answer questions
freely (Shah and Kitzie, 2012). Yahoo! Answers[1] and Zhihu[2] are popular CQA websites on
which users can ask questions via posted questions and share knowledge by answering
questions. Millions of questions have been posted and answered on these CQA websites.
Many users gain benefits from these CQA websites and increasingly more users participate
on the websites.
Data Technologies and
Applications
Vol. 53 No. 4, 2019
pp. 456-483
© Emerald PublishingLimited
2514-9288
DOI 10.1108/DTA-02-2019-0025
Received 12 February 2019
Revised 9 May 2019
6 August 2019
Accepted 28 August 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2514-9288.htm
The authors declare no conflict of interest. The research is supported by the National Natural Science
Foundation of China (Grant No. 71571191), the Humanity and Social Science Youth Foundation of the
Ministry of Education in China (Grant No. 15YJCZH081) and National Natural Science Foundation of
China (Grant No. 91646122).
456
DTA
53,4
With the increasing number of community users, the number of posted questions is becoming
increasingly larger. Question overload decreases the findings of questions both for knowledge
seekers and for question answerers (Li and King, 2010). Many studies have been conducted to
support the findings of questions. For example, to help answerers find relevant unanswered
questions, the routing method has been proposed. Unanswered questions are routed to the
answerers who probably can give answers to these questions according to the answerers
expertise (Cheng et al., 2017; Li et al., 2011; Srba et al., 2015; Zhao et al., 2015). Question search
refers to finding the questions that are related to the query (Cao et al., 2008). The main goal is to
bridge the gap between queries and existing questions (Cai et al., 2017; Wang et al., 2018;
Wu et al., 2014; Zhang et al., 2014). Question search is often used to help knowledge seekers who
are confused with the huge number of questions find appropriate questions.
Although these works have made great progress in overcoming question overload
problems, there still some problems to be resolved. These methods focus on ranking
questions according to the needs that are represented by evaluation functions. However,
there are manly similar or even duplicate posts in CQA because of its openness (Singh et al.,
2018). Especially, questions that are ranked first are often quite similar, and usersneeds can
only be partially satisfied by concentrating on a small part. Other important parts will be
missed. Then, there arises the requirement of maximally meeting ones needs with a small
set of core questions. Some methods have been proposed to extract a subset of data to
represent the whole data set (Ma et al., 2011; Zhang et al., 2016). In these methods, texts are
modeled by the TF-IDF method, which performs well on long formal documents. However,
the performance of the TF-IDF method is affected when dealing with short texts because of
the sparsity of short texts (Cheng et al., 2014). Since most questions in CQA are short, these
extraction methods are not suitable for extracting questions. Moreover, these methods are
based on the greedy algorithm, which extract documents one by one. The most other better
combinations cannot be found. The effects of the extraction are affected.
To resolve above problems, in this paper, the method named QueExt for extracting core
questions on CQA websites are proposed. First, the questions are modeled using the biterm
topic model (BTM), which fits better for short text modeling. Then, questions with similar topics
are automatically clustered. The clustering is modeled as a single object optimization problem,
which is resolved using particle swarm optimization (PSO). With the clustering results, the
number of core questions that needs to be extracted in each cluster is determined. Afterwards,
the core questions are extracted from each cluster according to the cluster size. The extraction is
novelly modeled as a multiple-object optimization problem, which is also resolved using PSO.
To avoid the local optimum, both PSO algorithms are integrated with operators in genetic
algorithms. Finally, the experiments show the better performance of the proposed method.
In the following section, studies on CQA, PSO and multiple objective optimization are
presented. Section 3 gives the clustering method and introduces the core question extraction
method. In Section 4, the experiments are given in detail. We conclude in Section 5 with
future work.
2. Related works
2.1 Community question answering
CQA systems are a typical Web 2.0 knowledge sharing application (Srba and Bielikova,
2015). This type of application provides a community service where users post and answer
questions. Knowledge seekers need their questions to be answered to obtain the knowledge
that they seek from the corresponding answers. Experts or professional users need there to
be suitable unanswered questions to share their knowledge. As the number of posted
questions rapidly increases through time, the massive number of questions leads to the
problem of information overload. Then, we must discover how to find the required questions
efficiently to improve the effectiveness of knowledge sharing.
457
Extracting
core questions
in CQA

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT