Identifying domain relevant user generated content through noise reduction: a test in a Chinese stock discussion forum

DOIhttps://doi.org/10.1108/IDD-04-2017-0043
Published date20 November 2017
Pages181-193
Date20 November 2017
AuthorXiangbin Yan,Yumei Li,Weiguo Fan
Subject MatterLibrary & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia
Identifying domain relevant user generated
content through noise reduction: a test in a
Chinese stock discussion forum
Xiangbin Yan
University of Science and Technology, Beijing, China
Yumei Li
Harbin Institute of Technology, Harbin, China, and
Weiguo Fan
Department of Accounting and Information Systems, Virginia Polytechnic Institute and State University,
Blacksburg, Virginia, USA
Abstract
Purpose – Getting high-quality data by removing the noisy data from the user-generated content (UGC) is the first step toward data mining and
effective decision-making based on ubiquitous and unstructured social media data. This paper aims to design a framework for revoking noisy data
from UGC.
Design/methodology/approach – In this paper, the authors consider a classification-based framework to remove the noise from the unstructured
UGC in social media community. They treat the noise as the concerned topic non-relevant messages and apply a text classification-based approach
to remove the noise. They introduce a domain lexicon to help identify the concerned topic from noise and compare the performance of several
classification algorithms combined with different feature selection methods.
Findings – Experimental results based on a Chinese stock forum show that 84.9 per cent of all the noise data from the UGC could be removed with
little valuable information loss. The support vector machines classifier combined with information gain feature extraction model is the best choice
for this system. With longer messages getting better classification performance, it has been found that the length of messages affects the system
performance.
Originality/value – The proposed method could be used for preprocessing in text mining and new knowledge discovery from the big data.
Keywords Social media, User-generated content, Feature selection, Text classification, Domain lexicon, Noise reduction
Paper type Research paper
1. Introduction
With the emergence of information technology and
applications, the total amount of global data has showed
unprecedented explosive growth. At the same time, the data
become increasingly more complex. The low value density is
one of the characteristics of big data (Feng et al., 2013), which
means that there is much noise among the data we are really
interested in. For example, in a video recording of
uninterrupted monitoring having duration of 1 h, there may be
only a few seconds containing potentially useful information.
Getting rid of the noise information is an important step to go
through the huge volume of data to extract useful information.
With the advent of Web 2.0, the internet has evolved to
support multimedia-rich content delivery, end user personal
content generation and community-based social interactions
(Zhang et al., 2011c). Online reviews, Web discussions and
blog articles have become important channels for users to
publish and share information, which lead to the explosive
growth of user-generated content (UGC). This provides us
with a great opportunity to explore the mass point of views on
the Web (Liu et al., 2007), and assess the value of these UGC.
Web 2.0 sites may accumulate opinions from participants
including any types of internet user, such as customers and
investors of a company. These discussions could provide
insights on many perspectives, such as consumers’ view on
products, stock values and public opinions on social events.
This crowd-sourced information could provide us solutions
that might answer our questions in a better way, and may
provide a new perspective to the whole world. For example,
Liu proves that online reviews offer significant explanatory
power for both aggregate and weekly box office revenue
(Liu, 2006). Previous research has shown that consumer
behavior is increasingly influenced by peer opinions (Smith,
2009), which means that consumers’ reviews are valuable to
both companies and consumers. The communities powered
by UGC have become prevalent and useful tools for
knowledge acquisition, exchange and collaborative
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/2398-6247.htm
Information Discovery and Delivery
45/4 (2017) 181–193
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-04-2017-0043]
Received 20 April 2017
Revised 24 June 2017
Accepted 30 June 2017
181

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT