Characters-based sentiment identification method for short and informal Chinese text

Date19 February 2018
Pages57-66
DOIhttps://doi.org/10.1108/IDD-05-2017-0047
Published date19 February 2018
AuthorQiujun Lan,Haojie Ma,Gang Li
Subject MatterLibrary & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia
Characters-based sentiment identication
method for short and informal Chinese text
Qiujun Lan and Haojie Ma
Business School, Hunan University, Changsha, China, and
Gang Li
School of Information Technology, Deakin University, Melbourne, Australia
Abstract
Purpose Sentiment identication of Chinese text faces many challenges, such as requiring complex preprocessing steps, preparing various word
dictionaries carefully and dealing with a lot of informal expressions, which lead to high computational complexity.
Design/methodology/approach A method based on Chinese characters instead of words is proposed. This method represents the text int o a
xed length vector and introduces the chi-square statistic to measure the categorical sentiment score of a Chinese character. Based on these, the
sentiment identication could be accomplished through four main steps.
Findings Experiments on corpus with various themes indicate that the performance of proposed method is a little bit worse than existi ng Chinese
words-based methods on most texts, but with improved performance on short and informal texts. Especially, the computation complexity of the
proposed method is far better than words-based methods.
Originality/value The proposed method exploits the property of Chinese characters being a linguistic unit with semantic information. Contrasting
to word-based methods, the computational efciency of this method is signicantly improved at slight loss of accuracy. It is more sen tentious and
cuts off the problems resulted from preparing predened dictionaries and various data preprocessing.
Keywords Information technology, Text mining, Data mining, Chinese character, Sentiment identication, Short text
Paper type Research paper
1. Introduction
With the rapid development of the internet, and the advent of
Web2.0, text sentiment identication has been a hot research
area with a range of applications in market intelligence,
recommendationsystem and social public feelings analysis (Xia
et al.,2010;Zhong and Deng, 2012;Zhang et al., 2010;Tang
et al., 2007;Manek et al.,2017;David et al, 2016). As a
burgeoning technology, text sentiment identication can
automatically analyze documentsfrom the huge and expansive
text information, providing convenience for commodity
evaluation, public opinion control and investor sentiment
research.
Chinese, one ofcial languages of the United Nations, is
widely used and with a long history. According to the report
released by UN Broadband Commission,by 2015, the number
of internet users in Chinese had exceed thenumber of users in
English. English is a phonic and alphabetic language, while
Chinese is ideographic and written in graphic characters. Text
sentiment identicationsteps between Chinese and English are
very different. The biggest one is in segmentation. English
segmentation can be divided into three parts (text splitting,
removing stop word and stemming). As making up of words, it
is easy to divide English sentences just using spaces. Chinese,
consists of characters, needs more complex splitting method.
Furthermore, there are often ambiguities in Chinese text
segmentation. It has become a big challenge in Chinese
segmentation. Scholars from various regions/countries such as
Taiwan, Singapore, Hong Kong and Japan, as well as from
Mainland China, are interested in Chinese information
processing technologies and the related text sentiment
identication technology (Chou et al.,2015;Zagibalov and
Carroll, 2008;Huand Chen, 2016).
Nowadays, e-commerce is developing rapidly. More
comment text is generated. Customers prefer to share their
opinions in BBS, microblog, etc. In the nancial eld, some
researchers construct investor sentiment index using nancial
forum users reviews (Yi et al.,2016). As a part of behavioral
nance, it can apply to high frequency trading. Sun, Najand
and Shen explore the predictive relation between high-
frequency investor and stock marketreturns (Sun et al.,2016).
They found substantial evidence that intraday S&P 500 index
returns are predictable using lagged half-hour investor
sentiment. However, the reviewsare short, informal, lled with
many buzzwords, slang, typos, etc. After preprocessing, it is
features sparsely and information scantily. The traditional
sentiment classication methods do not perform well in these
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/2398-6247.htm
Information Discovery and Delivery
46/1 (2018) 5766
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-05-2017-0047]
The Research Sponsored by Natural Science Foundation of China (Grant
No. 71171076) and the key project of National Natural Science Fund of
China (Grant No. 71431008).
Received 2 May 2017
Revised 4 December 2017
Accepted 5 December 2017
57

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT