Characters-based sentiment identification method for short and informal Chinese text
Date | 19 February 2018 |
Pages | 57-66 |
DOI | https://doi.org/10.1108/IDD-05-2017-0047 |
Published date | 19 February 2018 |
Author | Qiujun Lan,Haojie Ma,Gang Li |
Subject Matter | Library & information science,Library & information services,Lending,Document delivery,Collection building & management,Stock revision,Consortia |
Characters-based sentiment identification
method for short and informal Chinese text
Qiujun Lan and Haojie Ma
Business School, Hunan University, Changsha, China, and
Gang Li
School of Information Technology, Deakin University, Melbourne, Australia
Abstract
Purpose –Sentiment identification of Chinese text faces many challenges, such as requiring complex preprocessing steps, preparing various word
dictionaries carefully and dealing with a lot of informal expressions, which lead to high computational complexity.
Design/methodology/approach –A method based on Chinese characters instead of words is proposed. This method represents the text int o a
fixed length vector and introduces the chi-square statistic to measure the categorical sentiment score of a Chinese character. Based on these, the
sentiment identification could be accomplished through four main steps.
Findings –Experiments on corpus with various themes indicate that the performance of proposed method is a little bit worse than existi ng Chinese
words-based methods on most texts, but with improved performance on short and informal texts. Especially, the computation complexity of the
proposed method is far better than words-based methods.
Originality/value –The proposed method exploits the property of Chinese characters being a linguistic unit with semantic information. Contrasting
to word-based methods, the computational efficiency of this method is significantly improved at slight loss of accuracy. It is more sen tentious and
cuts off the problems resulted from preparing predefined dictionaries and various data preprocessing.
Keywords Information technology, Text mining, Data mining, Chinese character, Sentiment identification, Short text
Paper type Research paper
1. Introduction
With the rapid development of the internet, and the advent of
Web2.0, text sentiment identification has been a hot research
area with a range of applications in market intelligence,
recommendationsystem and social public feelings analysis (Xia
et al.,2010;Zhong and Deng, 2012;Zhang et al., 2010;Tang
et al., 2007;Manek et al.,2017;David et al, 2016). As a
burgeoning technology, text sentiment identification can
automatically analyze documentsfrom the huge and expansive
text information, providing convenience for commodity
evaluation, public opinion control and investor sentiment
research.
Chinese, one official languages of the United Nations, is
widely used and with a long history. According to the report
released by UN Broadband Commission,by 2015, the number
of internet users in Chinese had exceed thenumber of users in
English. English is a phonic and alphabetic language, while
Chinese is ideographic and written in graphic characters. Text
sentiment identificationsteps between Chinese and English are
very different. The biggest one is in segmentation. English
segmentation can be divided into three parts (text splitting,
removing stop word and stemming). As making up of words, it
is easy to divide English sentences just using spaces. Chinese,
consists of characters, needs more complex splitting method.
Furthermore, there are often ambiguities in Chinese text
segmentation. It has become a big challenge in Chinese
segmentation. Scholars from various regions/countries such as
Taiwan, Singapore, Hong Kong and Japan, as well as from
Mainland China, are interested in Chinese information
processing technologies and the related text sentiment
identification technology (Chou et al.,2015;Zagibalov and
Carroll, 2008;Huand Chen, 2016).
Nowadays, e-commerce is developing rapidly. More
comment text is generated. Customers prefer to share their
opinions in BBS, microblog, etc. In the financial field, some
researchers construct investor sentiment index using financial
forum user’s reviews (Yi et al.,2016). As a part of behavioral
finance, it can apply to high frequency trading. Sun, Najand
and Shen explore the predictive relation between high-
frequency investor and stock marketreturns (Sun et al.,2016).
They found substantial evidence that intraday S&P 500 index
returns are predictable using lagged half-hour investor
sentiment. However, the reviewsare short, informal, filled with
many buzzwords, slang, typos, etc. After preprocessing, it is
features sparsely and information scantily. The traditional
sentiment classification methods do not perform well in these
The current issue and full text archive of this journal is available on
Emerald Insight at: www.emeraldinsight.com/2398-6247.htm
Information Discovery and Delivery
46/1 (2018) 57–66
© Emerald Publishing Limited [ISSN 2398-6247]
[DOI 10.1108/IDD-05-2017-0047]
The Research Sponsored by Natural Science Foundation of China (Grant
No. 71171076) and the key project of National Natural Science Fund of
China (Grant No. 71431008).
Received 2 May 2017
Revised 4 December 2017
Accepted 5 December 2017
57
To continue reading
Request your trial