A paper-text perspective. Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era

DOIhttps://doi.org/10.1108/EL-09-2016-0192
Pages689-708
Date07 August 2017
Published date07 August 2017
AuthorHao Wang,Sanhong Deng
Subject MatterInformation & knowledge management,Information & communications technology,Internet
A paper-text perspective
Studies on the inuence of feature
granularity for Chinese short-text-
classication in the Big Data era
Hao Wang and Sanhong Deng
School of Information Management, Nanjing University, Nanjing, China
Abstract
Purpose In the era of Big Data, network digital resources are growing rapidly, especially the short-text
resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to
compare the categories discriminative capacity (CDC) of Chinese language fragments with different
granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature,
such as Chinese characters in Chinese short-text classication (CSTC).
Design/methodology/approach This study takes discipline classication of journal articles from
CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classication features
with various granularities, including keywords, terms and characters, the classication effects accessed by
the SVM algorithm are comprehensively compared and evaluated from three angles of using the same
experiment samples, testing before and after feature optimization, and introducing external data.
Findings The granularity of a classication feature has an important impact on CSTC. In general, the
larger the granularity is, the better the classication result is, and vice versa. However, a low-granularity
feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a
high-granularity feature if synthetically considering classication precision, computational complexity and
text coverage.
Originality/value This is the rst study to propose that Chinese characters are more suitable as
descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character
features could be strengthened by mixing frequency and position as weight.
Keywords Categories discriminative capacity, Chinese character features,
Chinese short-Text-Classication, Feature granularity, Feature optimization
Paper type Research paper
1. Introduction
Text classication (TC) is a methodology that uses computers and natural language
processing technology to automatically label text categories (Baccianella et al., 2014;
Sebastiani, 2002;Sheydaei et al., 2015). In recent years, with the advent of the Big Data era,
various types of network information resources have grown rapidly, especially short-text
resources. In this work, short text is that text whose length is equal to or less than the length
of one sentence which ends with a mark of a period, question mark or exclamation, so, of
course, longer than the length of one phrase. Generally speaking, its length is not more than
50 Chinese characters or English words. Therefore, an article title, a twitter message or a
micro-blog comment can all be considered short texts. On the other hand, information
This work was supported by Jiangsu Province Natural Science Foundation Project named “Study on
Chinese Ontology Learning Oriented Patent Forewarning” (No. BK20130587) and the Major Program of
National Social Science Foundation of China named “Study on Rapid Response Information System of
Emergency Decision for Unexpected Events” (No. 13&ZD174).
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
A paper-text
perspective
689
Received 27 September 2016
Revised 13 April 2017
Accepted 15 April 2017
TheElectronic Library
Vol.35 No. 4, 2017
pp.689-708
©Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-09-2016-0192
resource digitization is constantly expanding, accompanied by a strengthening of the
requirement and extent of automatic processing. Hence, TC, as a means of predicting discrete
attributes for natural language fragments, plays an extremely important role in the eld of
intelligent processing of Big Data. There are many elds that can be modeled using the TC
approach. These include category generation of resources, such as bibliographies and theses
(Wang, 2009;Wang et al., 2010;Zhang and Clark, 2015), emotional division of Web comments
(Lima et al., 2015;Maks and Vossen, 2012;Wang et al., 2013) and grading of clinical care
(Botsis et al., 2011;Figueroa et al., 2012;Marano et al., 2014). Chinese short-text
classication (CSTC) uses Chinese short text as a processing object. However, due to the
characteristics of signicant discrimination in languages and the shortness of the text,
directly adopting character language processing techniques to achieve CSTC will create
many problems.
It should be pointed out that the traditional TC method denotes categories and
unclassied text as feature vectors and then calculates the similarities between category
vectors and text vectors; the category that has the greatest similarity is the category a given
text belongs to (Nagwani, 2015;Peng and Huang, 2007;Zhang and Li, 2008). Machine
learning (ML) quickly replaced TC because the feature vectors of categories were complexly
constructed and weakly operated, and ML uses algorithms to learn the characteristic
distribution of classied texts. The obtained classier is then applied to unclassied texts to
automatically generate their categories. The most important TC-based ML is the
construction of the text-feature matrix (TFM), which is a unied description of all texts using
one set of features. In analysis of English texts, TFM is generally built with words/terms as
the descriptive attributes of texts and the frequency that words occur in texts as the attribute
value. These are ideal text features because there are not a large number of common English
words, and common English words have wide coverage. However, it is difcult to determine
the ideal features for Chinese texts, especially short texts, because in Chinese texts, word
features are very large and have narrow coverage in short texts, while character features are
subjectively considered as lacking meaning, resulting in poor classication performance.
Short text is a very important information resource that exists in the Web
environment. Examples of short-text include: messages, micro letters, Web comments
and news headlines. Short-text messages are short in length but rich in information and
easy to disseminate. Short text has become a common content agent and form of
information exchange. However, it is difcult to acquire the descriptive features of short
texts. This has caused a serious impediment to the smooth progress of classication
processing of short texts. Classication of short texts is a key aspect for achieving Web
applications, such as message screening, comment ltering and news classifying.
Effective CSTC has become a pressing problem.
To sum up, against the background of Big Data, the degree of digitization for information
resources is deepening, and the application of automatic classication is also increasing.
Therefore, in the context of data scale continuously increasing, to ensure the greatest
revealing of classication features and to ensure full learning are all worthy topics for
discussion. To ensure and support the automatic classication technology to better serve
Chinese digital information under the environment of Big Data, this work attempts to
discover the distribution laws of various granular features from short texts, and on this
basis, to deeply compare the inuences and effects of granularity features to CSTC from the
aspects of reasonable feature set selection and feature weight setting and to discuss the
effective modes of feature set with the best comprehensive classication results and practical
operability.
EL
35,4
690

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT