Data Technologies and Applications
- Emerald Group Publishing Limited
- Publication date:
- The discrete Fourier transformation for seasonality and anomaly detection of an application to rare data
Purpose: The discrete Fourier transformation (DFT) has been proven to be a successful method for determining whether a discrete time series is seasonal and, if so, for detecting the period. This paper deals exclusively with rare data, in which instances occur periodically at a low frequency. Design/methodology/approach: Data based on real-world situations is simulated for analysis. Findings: Cycle number detection is done with spectral analysis, period detection is completed using DFT coefficients and signal shifts in the time domain are found using the convolution theorem. Additionally, a new method for detecting anomalies in binary, rare data is presented: the sum of distances. Using this method, expected events which have not occurred and unexpected events which have occurred at various sampling frequencies can be detected. Anomalies which are not considered outliers to be found. Research limitations/implications: Aliasing can contribute to extra frequencies which point to extra periods in the time domain. This can be reduced or removed with techniques such as windowing. In future work, this will be explored. Practical implications: Applications include determining seasonality and thus investigating the underlying causes of hard drive failure, power outages and other undesired events. This work will also lend itself well to finding patterns among missing desired events, such as a scheduled hard drive backup or an employee's regular login to a server. Originality/value: This paper has shown how seasonality and anomalies are successfully detected in seasonal, discrete, rare and binary data. Previously, the DFT has only been used for non-rare data.
- Age-specific survival in prostate cancer using machine learning
Purpose: The incidence of prostate cancer is increasing from the past few decades. Various studies have tried to determine the survival of patients, but metastatic prostate cancer is still not extensively explored. The survival rate of metastatic prostate cancer is very less compared to the earlier stages. The study aims to investigate the survivability of metastatic prostate cancer based on the age group to which a patient belongs, and the difference between the significance of the attributes for different age groups. Design/methodology/approach: Data of metastatic prostate cancer patients was collected from a cancer hospital in India. Two predictive models were built for the analysis-one for the complete dataset, and the other for separate age groups. Machine learning was applied to both the models and their accuracies were compared for the analysis. Also, information gain for each model has been evaluated to determine the significant predictors for each age group. Findings: The ensemble approach gave the best results of 81.4% for the complete dataset, and thus was used for the age-specific models. The results concluded that the age-specific model had the direct average accuracy of 83.74% and weighted average accuracy of 79.9%, with the highest accuracy levels for age less than 60. Originality/value: The study developed a model that predicts the survival of metastatic prostate cancer based on age. The study will be able to assist the clinicians in determining the best course of treatment for each patient based on ECOG, age and comorbidities.
- Identifying financial statement fraud with decision rules obtained from Modified Random Forest
Purpose: Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies tend to conceal bad information, which causes a great loss to various stakeholders. Thus, the objective of the paper is to propose a novel approach to building a classification model to identify FSF, which shows high classification performance and from which human-readable rules are extracted to explain why a company is likely to commit FSF. Design/methodology/approach: Having prepared multiple sub-datasets to cope with class imbalance problem, we build a set of decision trees for each sub-dataset; select a subset of the set as a model for the sub-dataset by removing the tree, each of whose performance is less than the average accuracy of all trees in the set; and then select one such model which shows the best accuracy among the models. We call the resulting model MRF (Modified Random Forest). Given a new instance, we extract rules from the MRF model to explain whether the company corresponding to the new instance is likely to commit FSF or not. Findings: Experimental results show that MRF classifier outperformed the benchmark models. The results also revealed that all the variables related to profit belong to the set of the most important indicators to FSF and that two new variables related to gross profit which were unapprised in previous studies on FSF were identified. Originality/value: This study proposed a method of building a classification model which shows the outstanding performance and provides decision rules that can be used to explain the classification results. In addition, a new way to resolve the class imbalance problem was suggested in this paper.
- Retrieval performance of Google, Yahoo and Bing for navigational queries in the field of “life science and biomedicine”
Purpose: The purpose of this study is to assess the retrieval performance of three search engines, i.e. Google, Yahoo and Bing for navigational queries using two important retrieval measures, i.e. precision and relative recall in the field of life science and biomedicine. Design/methodology/approach: Top three search engines namely Google, Yahoo and Bing were selected on the basis of their ranking as per Alexa, an analytical tool that provides ranking of global websites. Furthermore, the scope of study was confined to those search engines having interface in English. Clarivate Analytics' Web of Science was used for the extraction of navigational queries in the field of life science and biomedicine. Navigational queries (classified as one-word, two-word and three-word queries) were extracted from the keywords of the papers representing the top 100 contributing authors in the select field. Keywords were also checked for the duplication. Two important evaluation parameters, i.e. precision and relative recall were used to calculate the performance of search engines on the navigational queries. Findings: The mean precision for Google scores high (2.30) followed by Yahoo (2.29) and Bing (1.68), while mean relative recall also scores high for Google (0.36) followed by Yahoo (0.33) and Bing (0.31) respectively. Research limitations/implications: The study is of great help to the researchers and academia in determining the retrieval efficiency of Google, Yahoo and Bing in terms of navigational query execution in the field of life science and biomedicine. The study can help users to focus on various search processes and the query structuring and its execution across the select search engines for achieving desired result list in a professional search environment. The study can also act as a ready reference source for exploring navigational queries and how these queries can be managed in the context of information retrieval process. It will also help to showcase the retrieval efficiency of various search engines on the basis of subject diversity (life science and biomedicine) highlighting the same in terms of query intention. Originality/value: Though many studies have been conducted highlighting the retrieval efficiency of search engines the current work is the first of its kind to study the retrieval effectiveness of Google, Yahoo and Bing on navigational queries in the field of life science and biomedicine. The study will help in understanding various methods and approaches to be adopted by the users for the navigational query execution across a professional search environment, i.e. “life science and biomedicine”
- Computational implementation and formalism of FAIR data stewardship principles
Purpose: The progress of life science and social science research is contingent on effective modes of data storage, data sharing and data reproducibility. In the present digital era, data storage and data sharing play a vital role. For productive data-centric tasks, findable, accessible, interoperable and reusable (FAIR) principles have been developed as a standard convention. However, FAIR principles have specific challenges from computational implementation perspectives. The purpose of this paper is to identify the challenges related to computational implementations of FAIR principles. After identification of challenges, this paper aims to solve the identified challenges. Design/methodology/approach: This paper deploys Petri net-based formal model and Petri net algebra to implement and analyze FAIR principles. The proposed Petri net-based model, theorems and corollaries may assist computer system architects in implementing and analyzing FAIR principles. Findings: To demonstrate the use of derived petri net-based theorems and corollaries, existing data stewardship platforms – FAIRDOM and Dataverse – have been analyzed in this paper. Moreover, a data stewardship model – “Datalection” has been developed and conversed about in the present paper. Datalection has been designed based on the petri net-based theorems and corollaries. Originality/value: This paper aims to bridge information science and life science using the formalism of data stewardship principles. This paper not only provides new dimensions to data stewardship but also systematically analyzes two existing data stewardship platforms FAIRDOM and Dataverse.
- Knowledge and data mining for recent and advanced applications using emerging technologies
- Scholarly publication venue recommender systems. A systematic literature review
Purpose: The purpose of this investigation is to identify, evaluate, integrate and summarize relevant and qualified papers through conducting a systematic literature review (SLR) on the application of recommender systems (RSs) to suggest a scholarly publication venue for researcher's paper. Design/methodology/approach: To identify the relevant papers published up to August 11, 2018, an SLR study on four databases (Scopus, Web of Science, IEEE Xplore and ScienceDirect) was conducted. We pursued the guidelines presented by Kitchenham and Charters (2007) for performing SLRs in software engineering. The papers were analyzed based on data sources, RSs classes, techniques/methods/algorithms, datasets, evaluation methodologies and metrics, as well as future directions. Findings: A total of 32 papers were identified. The most data sources exploited in these papers were textual (title/abstract/keywords) and co-authorship data. The RS classes in the selected papers were almost equally used. DBLP was the main dataset utilized. Cosine similarity, social network analysis (SNA) and term frequency–inverse document frequency (TF–IDF) algorithm were frequently used. In terms of evaluation methodologies, 24 papers applied only offline evaluations. Furthermore, precision, accuracy and recall metrics were the popular performance metrics. In the reviewed papers, “use more datasets” and “new algorithms” were frequently mentioned in the future work part as well as conclusions. Originality/value: Given that a review study has not been conducted in this area, this paper can provide an insight into the current status in this area and may also contribute to future research in this field.
- Predicting corporate credit rating based on qualitative information of MD&A transformed using document vectorization techniques
Purpose: The purpose of this study is to investigate the effectiveness of qualitative information extracted from firm’s annual report in predicting corporate credit rating. Qualitative information represented by published reports or management interview has been known as an important source in addition to quantitative information represented by financial values in assigning corporate credit rating in practice. Nevertheless, prior studies have room for further research in that they rarely employed qualitative information in developing prediction model of corporate credit rating. Design/methodology/approach: This study adopted three document vectorization methods, Bag-Of-Words (BOW), Word to Vector (Word2Vec) and Document to Vector (Doc2Vec), to transform an unstructured textual data into a numeric vector, so that Machine Learning (ML) algorithms accept it as an input. For the experiments, we used the corpus of Management’s Discussion and Analysis (MD&A) section in 10-K financial reports as well as financial variables and corporate credit rating data. Findings: Experimental results from a series of multi-class classification experiments show the predictive models trained by both financial variables and vectors extracted from MD&A data outperform the benchmark models trained only by traditional financial variables. Originality/value: This study proposed a new approach for corporate credit rating prediction by using qualitative information extracted from MD&A documents as an input to ML-based prediction models. Also, this research adopted and compared three textual vectorization methods in the domain of corporate credit rating prediction and showed that BOW mostly outperformed Word2Vec and Doc2Vec.
- Patch antenna design optimization using opposition based grey wolf optimizer and map-reduce framework
Purpose: Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made numerous achievements in several fields, some systems require additional improvements to meet few challenges. Yet, they require application-specific improvement for optimally designing microstrip patch antenna. The paper aims to discuss these issues. Design/methodology/approach: This paper intends to adopt an advanced meta-heuristic search algorithm called as grey wolf optimization (GWO), which is said to be inspired by the hunting behaviour of grey wolves, for the design of patch antenna parameters. The searching for the optimal design of the antenna is paced up using the opposition-based solution search. Moreover, the proposed model derives a nonlinear objective model to aid the design of the solution space of antenna parameters. After executing the simulation model, this paper compares the performance of the proposed GWO-based microstrip patch antenna with several conventional models. Findings: The gain of the proposed model is 27.05 per cent better than WOAD, 2.07 per cent better than AAD, 15.80 per cent better than GAD, 17.49 per cent better than PSAD and 3.77 per cent better than GWAD model. Thus, it has proved that the proposed antenna model has attained high gain, leads to cause superior performance. Originality/value: This paper presents a technique for designing the microstrip patch antenna, using the proposed GWO algorithm. This is the first work utilizes GWO-based optimization for microstrip patch antenna.
- GWLM–NARX. Grey Wolf Levenberg–Marquardt-based neural network for rainfall prediction
Purpose: Weather forecasting is the trending topic around the world as it is the way to predict the threats posed by extreme rainfall conditions that lead to damage the human life and properties. These issues can be managed only when the occurrence of the worse weather is predicted in advance, and sufficient warnings can be executed in time. Thus, keeping in mind the importance of the rainfall prediction system, the purpose of this paper is to propose an effective rainfall prediction model using the nonlinear auto-regressive with external input (NARX) model. Design/methodology/approach: The paper proposes a rainfall prediction model using the time-series prediction that is enabled using the NARX model. The time-series prediction ensures the effective prediction of the rainfall in a particular area or the locality based on the rainfall data in the previous term or month or year. The proposed NARX model serves as an adaptive prediction model, for which the rainfall data of the previous period is the input, and the optimal computation is based on the proposed algorithm. The adaptive prediction using the proposed algorithm is exhibited in the NARX, and the proposed algorithm is developed based on the Grey Wolf Optimization and the Levenberg–Marqueret (LM) algorithm. The proposed algorithm inherits the advantages of both the algorithms with better computational time and accuracy. Findings: The analysis using two databases enables the better understanding of the proposed rainfall detection methods and proves the effectiveness of the proposed prediction method. The effectiveness of the proposed method is enhanced and the accuracy is found to be better compared with the other existing methods and the mean square error and percentage root mean square difference of the proposed method are found to be around 0.0093 and 0.207. Originality/value: The rainfall prediction is enabled adaptively using the proposed Grey Wolf Levenberg–Marquardt (GWLM)-based NARX, wherein an algorithm, named GWLM, is proposed by the integration of Grey Wolf Optimizer and LM algorithm.
- Applying big data analytics to support Kansei engineering for hotel service development
Purpose: Leisure and tourism activities have proliferated and become important parts of modern life, and the hotel industry plays a necessary role in the supply for and demand from consumers. The purpose of this paper is to develop guidelines for hotel service development by applying a service...
- FactQA: question answering over domain knowledge graph based on two-level query expansion
Purpose: With the advent of the era of Big Data, the scale of knowledge graph (KG) in various domains is growing rapidly, which holds huge amount of knowledge surely benefiting the question answering (QA) research. However, the KG, which is always constituted of entities and relations, is...
- M-banking barriers in Pakistan: a customer perspective of adoption and continuity intention
Purpose: The purpose of this paper is to determine barriers jeopardizing the adoption and usage intention of mobile banking (M-banking) in Pakistan and provide deeper insights to fix such deteriorating factors. Design/methodology/approach: Data was collected in countrywide regional headquarters to ...
- Scholarly publication venue recommender systems. A systematic literature review
Purpose: The purpose of this investigation is to identify, evaluate, integrate and summarize relevant and qualified papers through conducting a systematic literature review (SLR) on the application of recommender systems (RSs) to suggest a scholarly publication venue for researcher's paper. Design/...
- Automatic meeting summarization and topic detection system
Purpose: Producing meeting documents requires an instantaneous recorder during meetings, which costs extra human resources and takes time to amend the file. However, a high-quality meeting document can enable users to recall the meeting content efficiently. The paper aims to discuss these issues. D...
- Feature intersection for agent-based customer churn prediction
Purpose: Telecommunication has a decisive role in the development of technology in the current era. The number of mobile users with multiple SIM cards is increasing every second. Hence, telecommunication is a significant area in which big data technologies are needed. Competition among the...
- Multidimensional appropriate clustering and DBSCAN for SAT solving
Purpose: This paper is an extended version of Hireche and Drias (2018) presented at the WORLD-CIST’18 conference. The major contribution, in this work, is defined in two phases. First of all, the use of data mining technologies and especially the tools of data preprocessing for instances of hard...
- Selection methods and diversity preservation in many-objective evolutionary algorithms
Purpose: One of the main components of multi-objective, and therefore, many-objective evolutionary algorithms, is the selection mechanism. It is responsible for performing two main tasks simultaneously. First, it has to promote convergence by selecting solutions which are as close as possible to...
- CoboChild: a blended mobile game-based learning service for children in museum contexts
Purpose: The purpose of this paper is to develop a blended mobile game-based learning service called CoboChild Mobile Exploration Service (hereinafter CoboChild) to support children’s learning in an environment blending virtual game worlds and a museum’s physical space. The contextual model of...
- Fuzzy-based MTD. A fuzzy decisive approach for moving target detection in multichannel SAR framework
Purpose: Synthetic aperture radar exploits the receiving signals in the antenna for detecting the moving targets and estimates the motion parameters of the moving objects. The limitation of the existing methods is regarding the poor power density such that those received signals are essentially to...