Inducing stock market lexicons from disparate Chinese texts

Published date23 December 2019
Pages508-525
DOIhttps://doi.org/10.1108/IMDS-04-2019-0254
Date23 December 2019
AuthorFutao Zhao,Zhong Yao,Jing Luan,Hao Liu
Subject MatterInformation & knowledge management,Information systems,Data management systems,Knowledge management,Knowledge sharing,Management science & operations,Supply chain management,Supply chain information systems,Logistics,Quality management/systems
Inducing stock market lexicons
from disparate Chinese texts
Futao Zhao
School of Economics and Management, Beihang University, Beijing, China
Zhong Yao
School of Economics and Management, Beihang University, Beijing, China and
Institute of Economics and Business, Beihang University, Beijing, China
Jing Luan
School of Economics and Management,
Beijing Jiaotong University, Beijing, China, and
Hao Liu
School of Business Administration, Northeastern University, Shenyang, China and
Northeastern University at Qinhuangdao, Qinhuangdao, China
Abstract
Purpose The purpose of this paper is to propose a methodology to construct a stock market sentiment
lexicon by incorporating domain-specific knowledge extracted from diverse Chinese media outlets.
Design/methodology/approach This paper presents a novel method to automatically generate financial
lexicons using a unique data set that comprises news articles, analyst reports and social media. Specifically, a
novel method based on keyword extraction is used to build a high-quality seed lexicon and an ensemble
mechanism is developed to integrate the knowledge derived from distinct language sources. Meanwhile, two
different methods, Pointwise Mutual Information and Word2vec, are applied to capture word associations.
Finally, an evaluation procedure is performed to validate the effectiveness of the method compared with four
traditional lexicons.
Findings The experimental results from the three real-world testing data sets show that the ensemble
lexicons can significantly improve sentiment classification performance compared with the four baseline
lexicons, suggesting the usefulness of leveraging knowledge derived from diverse media in domain-specific
lexicon generation and corresponding sentiment analysis tasks.
Originality/value This work appears to be the first to construct financial sentiment lexicons from over 2m
posts and headlines collected from more than one language source. Furthermore, the authors believe that the
data set established in this study is one of the largest corpora used for Chinese stock market lexicon
acquisition. This work is valuable to extract collective sentiment from multiple media sources and provide
decision-making support for stock market participants.
Keywords Sentiment analysis, Stock market, Sentiment lexicon
Paper type Research paper
1. Introduction
Financial markets aggregate diverse types of information that are indispensable for
investors during their decision-making processes (Xu and Zhang, 2013). Practitioners and
researchers mostly utilize two predominant information sources: traditional media
(e.g. news articles and analyst reports) and social media (e.g. online communities and
blogs) (Kearney and Liu, 2014). As a complementary part to conventional media, the
tremendous amount of user-generated content on social media has been an important data
source for public mood mining (Fan and Gordon, 2014). Among aforementioned channels,
most of the information is textual data (Chau and Xu, 2012). Although textual information is
prevalent, it is difficult to decode; thus, sentiment analysis (SA) is applied to convert opinion
Industrial Management & Data
Systems
Vol. 120 No. 3, 2020
pp. 508-525
© Emerald PublishingLimited
0263-5577
DOI 10.1108/IMDS-04-2019-0254
Received 26 April 2019
Revised 16 October 2019
Accepted 4 December 2019
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0263-5577.htm
This research is supported by the National Natural Science Foundation of China Nos 71271012,
71671011 and 71332003.
508
IMDS
120,3
in texts into a machine-friendly form (Loughran and McDonald, 2016). A sentiment lexicon
comprises words with certain semantic orientation (e.g. positive, negative or neutral), and it
is regarded as a fundamental component of the SA system (Feldman, 2013). Unsupervised
sentiment classification, also called the lexicon-scoring approach, calculates the sentiment
polarity of text according to the lexicon (Taboada et al., 2011). Moreover, the lexicon entries
are important features to construct supervised sentiment classifiers (Liu and Zhang, 2012).
Thus, the sentiment lexicon is favorable for both dominant SA approaches.
SA has been frequently used to predict stock market variables (Pröllochs et al., 2016;
Oliveira et al., 2017). Moreover, conventional and social media are both useful data sources
to enhance market insights (Yu et al., 2013). Thus, SA systems used for market prediction
require sentiment lexicons adapted to the financial domain and diverse media sources.
However, few studies have focused on building such lexicons. Loughran and McDonald
(2011) manually create a financial dictionary with corporation disclosures collected from an
official website. Oliveira et al. (2016) construct a stock market sentiment lexicon by
exploiting a large-scale corpus of labeled microblog messages. Three limitations exist in the
state-of-the-art lexicons. First, none of them has incorporated knowledge from other
financial media into sentiment lexicon acquisition, a phenomenon that causes restricted
performance in new language domains. Second, considering the large volume of data, it
would be rather costly to adopt the manual approach. Finally, most of the studies use
English texts in the lexicon generation procedure. For foreign languages, direct translation
from English lexicons may be ineffective due to cultural difference. For example, in the stock
market context, the word (red) usually suggests a pessimistic opinion; however, in
Chinese it means bullish,a positive view.
To overcome these limitations, we propose an automated approach to build a stock
market lexicon, which is applicable to various media in the financial domain. The main
contributions of our work are as follows:
We propose a method that can incorporate domain-specific knowledge from diverse
media to generate a high-quality sentiment lexicon for the stock market.
We are among the first to build a Chinese financial lexicon using a unique data set
including aggressive news, analyst reports and stock comments from multiple media
sources.
We conduct experiments on three data sets derived from different media sources. The
results demonstrate that the lexicons we obtain outperform four prevalent traditional
dictionaries. Our work contributes to opinion mining research and provides decision-
making support for practices in the Chinese stock market.
The paper proceeds as follows. Section 2 shows related work. Section 3 presents our
approach to generate Chinese domain-specific lexicons for the stock market. We describe the
experiments conducted and discuss the obtained results in Section 4 and Section 5. Finally,
Section 6 concludes this paper.
2. Related work
2.1 Web media and stock market
With the popularity of Web 2.0, web media plays an important role in various business
problems due to its sharply increased content and rapid dissemination (Fan and Gordon,
2014; Agarwal et al., 2019). The user-generated content has been applied to summarize
opinions for many domains, including emergency events (Yates and Paquette, 2011;
Zhang et al., 2019; Wu et al., 2019), political campaigns (Cogburn and Espinoza-Vasquez,
2011; Enli, 2017) and brand management (Dessart et al., 2015; Nisar and Whitehead, 2016;
Greco and Polli, 2019).
509
Stock market
lexicons

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT