A paper-text perspective. Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era

Document

Cited in

DOI	https://doi.org/10.1108/EL-09-2016-0192
Pages	689-708
Date	07 August 2017
Published date	07 August 2017
Author	Hao Wang,Sanhong Deng
Subject Matter	Information & knowledge management,Information & communications technology,Internet

A paper-text perspective

Studies on the inuence of feature

granularity for Chinese short-text-

classication in the Big Data era

Hao Wang and Sanhong Deng

School of Information Management, Nanjing University, Nanjing, China

Abstract

Purpose –In the era of Big Data, network digital resources are growing rapidly, especially the short-text

resources, such as tweets, comments, messages and so on, are showing a vigorous vitality. This study aims to

compare the categories discriminative capacity (CDC) of Chinese language fragments with different

granularities and to explore and verify feasibility, rationality and effectiveness of the low-granularity feature,

such as Chinese characters in Chinese short-text classication (CSTC).

Design/methodology/approach –This study takes discipline classication of journal articles from

CSSCI as a simulation environment. On the basis of sorting out the distribution rules of classication features

with various granularities, including keywords, terms and characters, the classication effects accessed by

the SVM algorithm are comprehensively compared and evaluated from three angles of using the same

experiment samples, testing before and after feature optimization, and introducing external data.

Findings –The granularity of a classication feature has an important impact on CSTC. In general, the

larger the granularity is, the better the classication result is, and vice versa. However, a low-granularity

feature is also feasible, and its CDC could be improved by reasonable weight setting, even exceeding a

high-granularity feature if synthetically considering classication precision, computational complexity and

text coverage.

Originality/value –This is the rst study to propose that Chinese characters are more suitable as

descriptive features in CSTC than terms and keywords and to demonstrate that CDC of Chinese character

features could be strengthened by mixing frequency and position as weight.

Keywords Categories discriminative capacity, Chinese character features,

Chinese short-Text-Classication, Feature granularity, Feature optimization

Paper type Research paper

1. Introduction

Text classication (TC) is a methodology that uses computers and natural language

processing technology to automatically label text categories (Baccianella et al., 2014;

Sebastiani, 2002;Sheydaei et al., 2015). In recent years, with the advent of the Big Data era,

various types of network information resources have grown rapidly, especially short-text

resources. In this work, short text is that text whose length is equal to or less than the length

of one sentence which ends with a mark of a period, question mark or exclamation, so, of

course, longer than the length of one phrase. Generally speaking, its length is not more than

50 Chinese characters or English words. Therefore, an article title, a twitter message or a

micro-blog comment can all be considered short texts. On the other hand, information

This work was supported by Jiangsu Province Natural Science Foundation Project named “Study on

Chinese Ontology Learning Oriented Patent Forewarning” (No. BK20130587) and the Major Program of

National Social Science Foundation of China named “Study on Rapid Response Information System of

Emergency Decision for Unexpected Events” (No. 13&ZD174).

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0264-0473.htm

A paper-text

perspective

689

Received 27 September 2016

Revised 13 April 2017

Accepted 15 April 2017

TheElectronic Library

Vol.35 No. 4, 2017

pp.689-708

©Emerald Publishing Limited

0264-0473

DOI 10.1108/EL-09-2016-0192

resource digitization is constantly expanding, accompanied by a strengthening of the

requirement and extent of automatic processing. Hence, TC, as a means of predicting discrete

attributes for natural language fragments, plays an extremely important role in the eld of

intelligent processing of Big Data. There are many elds that can be modeled using the TC

approach. These include category generation of resources, such as bibliographies and theses

(Wang, 2009;Wang et al., 2010;Zhang and Clark, 2015), emotional division of Web comments

(Lima et al., 2015;Maks and Vossen, 2012;Wang et al., 2013) and grading of clinical care

(Botsis et al., 2011;Figueroa et al., 2012;Marano et al., 2014). Chinese short-text

classication (CSTC) uses Chinese short text as a processing object. However, due to the

characteristics of signicant discrimination in languages and the shortness of the text,

directly adopting character language processing techniques to achieve CSTC will create

many problems.

It should be pointed out that the traditional TC method denotes categories and

unclassied text as feature vectors and then calculates the similarities between category

vectors and text vectors; the category that has the greatest similarity is the category a given

text belongs to (Nagwani, 2015;Peng and Huang, 2007;Zhang and Li, 2008). Machine

learning (ML) quickly replaced TC because the feature vectors of categories were complexly

constructed and weakly operated, and ML uses algorithms to learn the characteristic

distribution of classied texts. The obtained classier is then applied to unclassied texts to

automatically generate their categories. The most important TC-based ML is the

construction of the text-feature matrix (TFM), which is a unied description of all texts using

one set of features. In analysis of English texts, TFM is generally built with words/terms as

the descriptive attributes of texts and the frequency that words occur in texts as the attribute

value. These are ideal text features because there are not a large number of common English

words, and common English words have wide coverage. However, it is difcult to determine

the ideal features for Chinese texts, especially short texts, because in Chinese texts, word

features are very large and have narrow coverage in short texts, while character features are

subjectively considered as lacking meaning, resulting in poor classication performance.

Short text is a very important information resource that exists in the Web

environment. Examples of short-text include: messages, micro letters, Web comments

and news headlines. Short-text messages are short in length but rich in information and

easy to disseminate. Short text has become a common content agent and form of

information exchange. However, it is difcult to acquire the descriptive features of short

texts. This has caused a serious impediment to the smooth progress of classication

processing of short texts. Classication of short texts is a key aspect for achieving Web

applications, such as message screening, comment ltering and news classifying.

Effective CSTC has become a pressing problem.

To sum up, against the background of Big Data, the degree of digitization for information

resources is deepening, and the application of automatic classication is also increasing.

Therefore, in the context of data scale continuously increasing, to ensure the greatest

revealing of classication features and to ensure full learning are all worthy topics for

discussion. To ensure and support the automatic classication technology to better serve

Chinese digital information under the environment of Big Data, this work attempts to

discover the distribution laws of various granular features from short texts, and on this

basis, to deeply compare the inuences and effects of granularity features to CSTC from the

aspects of reasonable feature set selection and feature weight setting and to discuss the

effective modes of feature set with the best comprehensive classication results and practical

operability.

35,4

690

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

A paper-text perspective. Studies on the influence of feature granularity for Chinese short-text-classification in the Big Data era

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users