Exploring topics related to data mining on Wikipedia

Date07 August 2017
Pages667-688
DOIhttps://doi.org/10.1108/EL-09-2016-0188
Published date07 August 2017
AuthorYanyan Wang,Jin Zhang
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Exploring topics related to data
mining on Wikipedia
Yanyan Wang and Jin Zhang
School of Information Studies, University of Wisconsin-Milwaukee,
Milwaukee, Wisconsin, USA
Abstract
Purpose Data mining has been a popular research area in the past decades. Many researchers study
data-mining theories, methods, applications and trends; however, there are very few studies on
data-mining-related topics in social media. This paper aims to explore the topics related to data mining based
on the data collected from Wikipedia.
Design/methodology/approach In total, 402 data-mining-related articles were obtained from
Wikipedia. These articles were manually classied into several categories by the coding method. Each
category formed an article-term matrix. These matrices were analysed and visualized by the self-organizing
map approach. Several clusters were observed in each category. Finally, the topics of these clusters were
extracted by content analysis.
Findings The articles obtained were classied into six categories: applications, foundation and concepts,
methodologies, organizations, related elds and topics and technology support. Business, biology and
security were the three prominent topics of the applications category. The technologies supporting data
mining were software, systems, databases, programming languages and so forth. The general public was
more interested in data-mining organizations than the researchers. They also focused on the applications of
data mining in business more than in other elds.
Originality/value This study will help researchers gain insight into the general public’s perceptions of
data mining and discover the gap between the general public and themselves. It will assist researchers in
nding new techniques and methods which will potentially provide them with new data-mining methods and
research topics.
Keywords Social media, Data mining, Social Web mining, Theme discovery
Paper type Research paper
Introduction
With the development of internet technologies, information and data are produced, shared,
and stored much faster than before. The volume of data grows every day as companies
capture large amounts of data about markets, products, customers and suppliers.
Individuals also receive large quantities of data from their daily life and the internet.
Moreover, the evolution of mobile devices, social media and Web technologies boosts the
growth of data and information. It is difcult, however, to deal with huge data sets using
traditional data analysis approaches. Because of these circumstances, the concept of data
mining was created.
To explore the internal relationship and patterns of data, data mining was proposed in the
1990s. Since then, data mining has been studied and used as a useful research method. As
more and more people face the problems of data analysis and management, this concept has
been widely accepted and the related techniques and methods have been frequently used by
both researchers and general users. Research topics about data mining can be found in a
large number of publications. In addition, there are introductions and discussions of data
mining on the internet, especially on social media platforms. Different from data mining
research studies, the content of data mining on social media platforms has its own features.
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
Topics related
to data mining
on Wikipedia
667
Received 22 September 2016
Revised 16 March 2017
Accepted 9 April 2017
TheElectronic Library
Vol.35 No. 4, 2017
pp.667-688
©Emerald Publishing Limited
0264-0473
DOI 10.1108/EL-09-2016-0188
Since the use of data-mining theories, methods, and technologies continually increases, it
is necessary to gain insight into data-mining and related topics. Previous research papers
have studied various aspects of data mining, but few have explored the data-mining-related
topics based on data collected from social media. Because Wikipedia is the largest online
knowledge collaboration, to ll the gap, this study aims to explore the data-mining-related
topics on Wikipedia. The self-organizing map (SOM) approach, a machine learning
approach, was applied to this data analysis.
Literature review
Data mining
Data mining is a method to reveal previously unknown and reliable insights from large data
sets (Elkan, 2001). Because the massive volume of data from different elds keeps growing,
useful analysis methods and techniques are urgently needed. Therefore, data mining has
become an increasingly important research area (Liao et al., 2012).
With the development of data mining, a variety of methods and techniques from other
areas have been introduced to the data-mining area, such as classication, clustering and
database technology (Liao et al., 2012). In Han and Kamber’s (2006) book, they pointed out the
disciplines that most inuence and improve the data-mining method. These are statistics,
machine learning, database systems, warehousing and information retrieval. Meanwhile,
data mining has impacted other research elds, such as chemistry, medicine, business and so
forth (Aljumah et al., 2013;Borghini et al., 2010;Zhang et al., 2013).
In addition to prediction, data mining has other functions. Han and Kamber (2006)
summarized the different patterns that can be mined: frequent patterns, associations and
correlations; classication and regression; clustering analysis; and outlier analysis. Fu (2011)
gave a similar opinion on time series data mining, which said that the main tasks of time
series data mining are pattern discovery and clustering, classication, rule discovery and
summarization. Different data-mining methods and techniques have been proposed and
applied to accomplish different tasks. For example, k-means, fuzzy c-means and SOM are
frequently used in clustering analysis. Moreover, there are specic methods to mine certain
types of data, like the model-based sequence clustering methods for mining temporal data
(Law and Kwok, 2000).
Social Web mining
Web mining, as a branch of data mining, is gradually playing increasingly important roles in
research. Social Web mining is one of the primary components in studies related to Web
mining and social media. Social media is the way people generate, share and communicate
information in virtual communities and networks (Ahlqvist et al., 2008). Under the big
umbrella of social media, social media sites and applications vary a lot. For instance, Twitter,
which is regarded as a microblog, allows users to communicate and create posts of less than
140 characters (Kwak et al., 2010), while Wikipedia provides opportunities for collaborative
information and knowledge production (Bruns, 2006). With the development of mobile
devices, geo-mapping tools (e.g. Google Maps) and self-tracking applications (e.g. Quantied
Self) have been invented.
The methods applied in social Web mining can be classied into two groups: social
network analysis methods and sentiment analysis methods. Social network analysis tries to
reveal human relationships and connections (Hansen et al., 2010). In recent years, various
tools have been invented to analyse and visualize social networks, such as UCINET, Pajek,
NetworkX in Python and igraph in R (Borgatti et al., 2002;Kolaczyk and Csárdi, 2014;de
Nooy et al., 2011). Sentiment analysis is known as opinion mining, which is related to text
mining (Thelwall et al., 2011). Therefore, methods for text mining are also used in sentiment
EL
35,4
668

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT