A survey on mining stack overflow: question and answering (Q&A) community

DOIhttps://doi.org/10.1108/DTA-07-2017-0054
Publication Date03 April 2018
Date03 April 2018
Pages190-247
AuthorArshad Ahmad,Chong Feng,Shi Ge,Abdallah Yousif
SubjectLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
A survey on mining stack
overflow: question and answering
(Q&A) community
Arshad Ahmad, Chong Feng, Shi Ge and Abdallah Yousif
Computer Science and Technology, Beijing Institute of Technology, Beijing, China
Abstract
Purpose Software developers extensively use stack overflow (SO) for knowledge sharing on software
development. Thus, software engineering researchers have started mining the structured/unstructured data
present in certain software repositories including the Q&A software developer community SO, with the aim to
improve software development. The purpose of this paper is show that how academics/practitioners can get
benefit from the valuable user-generated content shared on various online social networks, specifically from
Q&A community SO for software development.
Design/methodology/approach A comprehensive literature review was conducted and 166 research
papers on SO were categorized about software development from the inception of SO till June 2016.
Findings Most of the studies revolve around a limited number of software development tasks;
approximately 70 percent of the papers used millions of posts data, applied basic machine learning methods,
and conducted investigations semi-automatically and quantitative studies. Thus, future research should
focus on the overcoming existing identified challenges and gaps.
Practical implications TheworkonSOisclassifiedintotwomaincategories;SO design and usage
and SO content applications.These categories not only giv e insights to Q&A forum prov iders about
the shortcomings in desi gn and usage of such forums but also provide ways to ove rcome them in future.
It also enables softwa re developers to exploi t such forums for the ide ntified under-utiliz ed tasks of
software development .
Originality/value The study is the first of its kind to explore the work on SO about software development
and makes an original contribution by presenting a comprehensive review, design/usage shortcomings of
Q&A sites, and future research challenges.
Keywords Mining, Software development, Information retrieval, Text mining, Software repositories,
Stack overflow
Paper type Literature review
1. Introduction
1.1 Background
With the advent of rapid advancement and rise in the use of diverse social media
technologies (Wang et al., 2017; Khusro et al., 2017; Jafari et al., 2012; Borjigen, 2015; Udanor
et al., 2016), e.g., Q&A forums/communities, blogs, and Wikis, the software engineering
research community has also started realizing and utilizing it for software development.
Software development is a knowledge-intensive activity (Ahmad and Khan, 2008; Alnuem
et al., 2012; Khan et al., 2012; Khan et al., 2011) and the recent advancements in social media
technologies have further pushed software developers to leverage it for knowledge sharing,
learning, and collaborating with others. Besides, the innovations in social media
technologies have also changed the shapes and ways of developing software, challenging
old conventions about how developers learn and work with others. These different social
media technologies thus serve not only as software repositories having structured/
unstructured data about software development life cycle (SDLC), but also serve as a
professional user base (Storey et al., 2014; de Souza et al., 2016; Storey, 2015; Parnin et al.,
2013; Pagano and Maalej, 2011; Tian et al., 2012; MacLeod et al., 2015).
Data Technologies and
Applications
Vol. 52 No. 2, 2017
pp. 190-247
© Emerald PublishingLimited
2514-9288
DOI 10.1108/DTA-07-2017-0054
Received 22 July 2017
Revised 8 November 2017
Accepted 9 December 2017
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2514-9288.htm
The authors would like to acknowledge the support provided by the National 863 Project, China, under
Research Grant No. 2015AA015404.
190
DTA
52,2
In recent years, the software engineering research community has endeavored to
enhance software development through deeply mining and assessing software repositories,
e.g., e-mail archived communication, source code changes, bugs repository, execution logs
(Chen et al., 2015; Godfrey et al., 2008; Hassan, 2008), and online Q&A forums (Ponzanelli
et al., 2014a; Ponzanelli et al., 2015; Treude and Robillard, 2016). These different research
efforts depict the possibility of obtaining useful and real-world results via mining these
repositories. Thus, enabling software developers and project managers to comprehend their
software systems and eventually enhance the quality of their end products in a more timely
and cost-efficient way (Chen et al., 2015; Tichy, 2010).
Recently, specific achievements have been reported with mining and investigating the
availablestructured data in these softwarerepositories (Chen et al., 2015).These achievements
are gained due to research community interests, analyzing effort level required, the presence
of useful knowledge, different tools support to mine/analyze, the nature of data (structured)
itself, and availability of such repositories. Structured data refer to information that is
organized following some specific data model or known form/structure. For example, source
codes parse trees,call graphs, inheritancegraphs, execution logs, and tracesare all structured
data software repositories.
However, the recent exponential surge in availability and forms of unstructured data in
software repositories has also pushed the software engineering research community to mine
and analyze the useful knowledge present in such repositories, i.e., different versioning
systems, e.g., Git[1], SVN[2], CVS[3], archived communications, e.g., mailing lists, chat logs,
and online forums, e.g., Q&A websites, mobile app stores, software artifacts, online
video-sharing websites, e.g., programming tutorials shared on YouTube[4], and slide
hosting services, e.g., technical presentations shared on SlideShare[5] (Bavota, 2016).
Unstructured data refer to natural language text or information that is not organized by
following some specific data model or known form/structure (Chen et al., 2015). Despite the
rise in availability of unstructured data (approximately 80-85 percent in software repository)
and researchers focus, there are still some associated challenges, i.e., the lack of automated
techniques for mining and understanding, which are believed to be an impediment for
researchers and practitione rs to efficiently utilize these repos itories for software
development (Chen et al., 2015; Hassan, 2008; Blumberg and Atre, 2003).
The recent years have also witnessed enormous growth in developing various software
development tools, languages, and platforms for various purposes due to diverse demands
from developers and users. With this demanding pace of evolution, software developers also
need to be skilled not only with their existing tools, languages, and platforms but also with
every newer versions or feature released. Currently, most of the software developers use
several online development communities to solve their problems by posing/discussing Q&A
with other professionals in the community, e.g., Quora[6], Stack Exchange (SE)[7], Reddit[8],
and GitHub[9]. Among all these communities, SE is deemed to be the most popular Q&A
community due to the number of registered users, daily visits, and above all the satisfaction
level of users. The SE Q&A community itself has a diverse set of communities covering
different topics in mathematics, statistics, computer science/programming, and education.
In total, there are 150 +Q&A communities on SE including stack overflow (SO) founded in
2008, the prominent site for software developers to discover and post Q&A about the entire
software development spectrum.
The importance andabundance of unstructured data in software repositories for software
engineering researchers can be realized from these sample statistics, e.g., GitHub[10] hosted
about 46 K softwareprojects in year 2009, 1 M in 2010, 10 M in 2013, 27 M in 2015, and in total,
more than 47 M projects hosted so far as of 2016 (software projects growth ratio is about
586 times in around six years duration). On SE[11], every hour about 102 K users search
for help, and about 6K hours of fresh video contents are shared on YouTube[12].
191
A survey on
mining stack
overflow
Similarly, SO Q&A community is aimed to serve diverse software development topics on
tools, platforms and other related software development issues. The SO[13] community itself
serves more than 40M professionals and novice programmers every month and has more
than 6 M registered users, approximately 12M questions, 20 M answers,51 M comments, and
46 K tags on various issues/topics of software development.
The questions posted by software developers on SO usually are long unstructured
natural language texts containing code snippets and weblinks, which makes it more
challenging for researchers/practitioners to automatically mine and analyze those SO posts.
Some of the frequently reported associated challenges in automatically mining/analyzing
such natural language texts are vagueness, impreciseness, grammatical mistakes, spelling
errors or typos, noisiness, synonyms, and unknown acronyms (Carreno and Winbladh, 2013;
Chen et al., 2014; Iacob and Harrison, 2013; Chen et al., 2015). We illustrate below some
sample chunks of SO posts (Q&A and comments) to know their characteristics and
associated challenges in automatically mining, analyzing and understanding them.
In following Question No. 1, the poster asks on SO about a feature possible in Android
Studio similar to Xcode feature having title pragma mark equivalent in Android Studio.
Question No. 1: XCode have a feature called pragma mark Its very util and Im looking
for anything similar into Android Studio it can be native or a plugin[14].
Answer X: In Android Studio you can add regions using the steps [].
Comment: Cool, I wish it showed in the Structure view (CMD +7) in bold like it did in the
Xcode dropdown but theres always going to be development tool differences.
In Question No. 1, there is some vagueness present, as reported in Chen et al. (2015).
The word used utilis incomplete and imprecise, hence, processing such kinds of vague
words used in SO questions is quite challenging to automatically mine and analyze.
The following Question No. 2 is posted on SO about adding/requesting a feature in
Android Studio having title Add unimplemented methods.
Question No. 2: In the Eclipse IDE there is a great feature allows you to add (implement)
all of the required methods of the particular class. Im looking for this feature in the Android
Studio IDE, but without success so far. Is there something similar? For me it is one of the
key-features and cant live without[15].
Answer X: Of course there is. It is called Implement methods or Override Methods.
The default shortcut is CTRL-I and CTRL-O. See descrption of Implementing Methods and
Overriding Methods.
Comment: Ok, but this is not what Im asking for. I dont want to choose methods to
implemet. I want IDE to do it for me like Eclipse were doing. For example when I clicked
Add unimplemented methodsinside any Activity extented class all of these onCreate()
onPause() onResume() were generated.
Comment: the answer below by pbespechnyi is the right one.
Comment: Yup ALT+ENTERshould be the right answer not CTRL-O.’”
Answer Y: Alt +Enter - on class definition; Ctrl +Iin class body to show list of
unimplemented methods. Ctrl +Oin class body to show list of override methods.
In Question No. 2, many difficulties exist for processing suchtexts. For instance, there are
several grammatical, typos or spelling errors (descrption, implemet, extented), as reported in
Chen et al. (2015).Other difficulties includethe text in the comments or answersrefers to some
weblinks or answers. For example, in one of the above comments, the user refers the poster to
see pbespechnyi,so it is quite challenging to mine, analyze, and understand such kind of
posts whetherpbespechnyiis a person nameor some other word having differentmeanings.
The following Question No. 3 is posted by software developer about a missing feature in
Android Studio having title Android Studio is missing permissions.
Question No. 3: I recently updated Android Studio to 2.2.2. The IDE is missing
permissions from both Manifestclass and AndroidManifest.xml (code suggestion). I knew
192
DTA
52,2

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT