Data Technologies and Applications
- Emerald Group Publishing Limited
- Publication date:
- The discrete Fourier transformation for seasonality and anomaly detection of an application to rare data
Purpose: The discrete Fourier transformation (DFT) has been proven to be a successful method for determining whether a discrete time series is seasonal and, if so, for detecting the period. This paper deals exclusively with rare data, in which instances occur periodically at a low frequency. Design/methodology/approach: Data based on real-world situations is simulated for analysis. Findings: Cycle number detection is done with spectral analysis, period detection is completed using DFT coefficients and signal shifts in the time domain are found using the convolution theorem. Additionally, a new method for detecting anomalies in binary, rare data is presented: the sum of distances. Using this method, expected events which have not occurred and unexpected events which have occurred at various sampling frequencies can be detected. Anomalies which are not considered outliers to be found. Research limitations/implications: Aliasing can contribute to extra frequencies which point to extra periods in the time domain. This can be reduced or removed with techniques such as windowing. In future work, this will be explored. Practical implications: Applications include determining seasonality and thus investigating the underlying causes of hard drive failure, power outages and other undesired events. This work will also lend itself well to finding patterns among missing desired events, such as a scheduled hard drive backup or an employee's regular login to a server. Originality/value: This paper has shown how seasonality and anomalies are successfully detected in seasonal, discrete, rare and binary data. Previously, the DFT has only been used for non-rare data.
- Age-specific survival in prostate cancer using machine learning
Purpose: The incidence of prostate cancer is increasing from the past few decades. Various studies have tried to determine the survival of patients, but metastatic prostate cancer is still not extensively explored. The survival rate of metastatic prostate cancer is very less compared to the earlier stages. The study aims to investigate the survivability of metastatic prostate cancer based on the age group to which a patient belongs, and the difference between the significance of the attributes for different age groups. Design/methodology/approach: Data of metastatic prostate cancer patients was collected from a cancer hospital in India. Two predictive models were built for the analysis-one for the complete dataset, and the other for separate age groups. Machine learning was applied to both the models and their accuracies were compared for the analysis. Also, information gain for each model has been evaluated to determine the significant predictors for each age group. Findings: The ensemble approach gave the best results of 81.4% for the complete dataset, and thus was used for the age-specific models. The results concluded that the age-specific model had the direct average accuracy of 83.74% and weighted average accuracy of 79.9%, with the highest accuracy levels for age less than 60. Originality/value: The study developed a model that predicts the survival of metastatic prostate cancer based on age. The study will be able to assist the clinicians in determining the best course of treatment for each patient based on ECOG, age and comorbidities.
- Identifying financial statement fraud with decision rules obtained from Modified Random Forest
Purpose: Financial statement fraud (FSF) committed by companies implies the current status of the companies may not be healthy. As such, it is important to detect FSF, since such companies tend to conceal bad information, which causes a great loss to various stakeholders. Thus, the objective of the paper is to propose a novel approach to building a classification model to identify FSF, which shows high classification performance and from which human-readable rules are extracted to explain why a company is likely to commit FSF. Design/methodology/approach: Having prepared multiple sub-datasets to cope with class imbalance problem, we build a set of decision trees for each sub-dataset; select a subset of the set as a model for the sub-dataset by removing the tree, each of whose performance is less than the average accuracy of all trees in the set; and then select one such model which shows the best accuracy among the models. We call the resulting model MRF (Modified Random Forest). Given a new instance, we extract rules from the MRF model to explain whether the company corresponding to the new instance is likely to commit FSF or not. Findings: Experimental results show that MRF classifier outperformed the benchmark models. The results also revealed that all the variables related to profit belong to the set of the most important indicators to FSF and that two new variables related to gross profit which were unapprised in previous studies on FSF were identified. Originality/value: This study proposed a method of building a classification model which shows the outstanding performance and provides decision rules that can be used to explain the classification results. In addition, a new way to resolve the class imbalance problem was suggested in this paper.
- Retrieval performance of Google, Yahoo and Bing for navigational queries in the field of “life science and biomedicine”
Purpose: The purpose of this study is to assess the retrieval performance of three search engines, i.e. Google, Yahoo and Bing for navigational queries using two important retrieval measures, i.e. precision and relative recall in the field of life science and biomedicine. Design/methodology/approach: Top three search engines namely Google, Yahoo and Bing were selected on the basis of their ranking as per Alexa, an analytical tool that provides ranking of global websites. Furthermore, the scope of study was confined to those search engines having interface in English. Clarivate Analytics' Web of Science was used for the extraction of navigational queries in the field of life science and biomedicine. Navigational queries (classified as one-word, two-word and three-word queries) were extracted from the keywords of the papers representing the top 100 contributing authors in the select field. Keywords were also checked for the duplication. Two important evaluation parameters, i.e. precision and relative recall were used to calculate the performance of search engines on the navigational queries. Findings: The mean precision for Google scores high (2.30) followed by Yahoo (2.29) and Bing (1.68), while mean relative recall also scores high for Google (0.36) followed by Yahoo (0.33) and Bing (0.31) respectively. Research limitations/implications: The study is of great help to the researchers and academia in determining the retrieval efficiency of Google, Yahoo and Bing in terms of navigational query execution in the field of life science and biomedicine. The study can help users to focus on various search processes and the query structuring and its execution across the select search engines for achieving desired result list in a professional search environment. The study can also act as a ready reference source for exploring navigational queries and how these queries can be managed in the context of information retrieval process. It will also help to showcase the retrieval efficiency of various search engines on the basis of subject diversity (life science and biomedicine) highlighting the same in terms of query intention. Originality/value: Though many studies have been conducted highlighting the retrieval efficiency of search engines the current work is the first of its kind to study the retrieval effectiveness of Google, Yahoo and Bing on navigational queries in the field of life science and biomedicine. The study will help in understanding various methods and approaches to be adopted by the users for the navigational query execution across a professional search environment, i.e. “life science and biomedicine”
- Computational implementation and formalism of FAIR data stewardship principles
Purpose: The progress of life science and social science research is contingent on effective modes of data storage, data sharing and data reproducibility. In the present digital era, data storage and data sharing play a vital role. For productive data-centric tasks, findable, accessible, interoperable and reusable (FAIR) principles have been developed as a standard convention. However, FAIR principles have specific challenges from computational implementation perspectives. The purpose of this paper is to identify the challenges related to computational implementations of FAIR principles. After identification of challenges, this paper aims to solve the identified challenges. Design/methodology/approach: This paper deploys Petri net-based formal model and Petri net algebra to implement and analyze FAIR principles. The proposed Petri net-based model, theorems and corollaries may assist computer system architects in implementing and analyzing FAIR principles. Findings: To demonstrate the use of derived petri net-based theorems and corollaries, existing data stewardship platforms – FAIRDOM and Dataverse – have been analyzed in this paper. Moreover, a data stewardship model – “Datalection” has been developed and conversed about in the present paper. Datalection has been designed based on the petri net-based theorems and corollaries. Originality/value: This paper aims to bridge information science and life science using the formalism of data stewardship principles. This paper not only provides new dimensions to data stewardship but also systematically analyzes two existing data stewardship platforms FAIRDOM and Dataverse.
- Knowledge and data mining for recent and advanced applications using emerging technologies
- Scholarly publication venue recommender systems. A systematic literature review
Purpose: The purpose of this investigation is to identify, evaluate, integrate and summarize relevant and qualified papers through conducting a systematic literature review (SLR) on the application of recommender systems (RSs) to suggest a scholarly publication venue for researcher's paper. Design/methodology/approach: To identify the relevant papers published up to August 11, 2018, an SLR study on four databases (Scopus, Web of Science, IEEE Xplore and ScienceDirect) was conducted. We pursued the guidelines presented by Kitchenham and Charters (2007) for performing SLRs in software engineering. The papers were analyzed based on data sources, RSs classes, techniques/methods/algorithms, datasets, evaluation methodologies and metrics, as well as future directions. Findings: A total of 32 papers were identified. The most data sources exploited in these papers were textual (title/abstract/keywords) and co-authorship data. The RS classes in the selected papers were almost equally used. DBLP was the main dataset utilized. Cosine similarity, social network analysis (SNA) and term frequency–inverse document frequency (TF–IDF) algorithm were frequently used. In terms of evaluation methodologies, 24 papers applied only offline evaluations. Furthermore, precision, accuracy and recall metrics were the popular performance metrics. In the reviewed papers, “use more datasets” and “new algorithms” were frequently mentioned in the future work part as well as conclusions. Originality/value: Given that a review study has not been conducted in this area, this paper can provide an insight into the current status in this area and may also contribute to future research in this field.
- Predicting corporate credit rating based on qualitative information of MD&A transformed using document vectorization techniques
Purpose: The purpose of this study is to investigate the effectiveness of qualitative information extracted from firm’s annual report in predicting corporate credit rating. Qualitative information represented by published reports or management interview has been known as an important source in addition to quantitative information represented by financial values in assigning corporate credit rating in practice. Nevertheless, prior studies have room for further research in that they rarely employed qualitative information in developing prediction model of corporate credit rating. Design/methodology/approach: This study adopted three document vectorization methods, Bag-Of-Words (BOW), Word to Vector (Word2Vec) and Document to Vector (Doc2Vec), to transform an unstructured textual data into a numeric vector, so that Machine Learning (ML) algorithms accept it as an input. For the experiments, we used the corpus of Management’s Discussion and Analysis (MD&A) section in 10-K financial reports as well as financial variables and corporate credit rating data. Findings: Experimental results from a series of multi-class classification experiments show the predictive models trained by both financial variables and vectors extracted from MD&A data outperform the benchmark models trained only by traditional financial variables. Originality/value: This study proposed a new approach for corporate credit rating prediction by using qualitative information extracted from MD&A documents as an input to ML-based prediction models. Also, this research adopted and compared three textual vectorization methods in the domain of corporate credit rating prediction and showed that BOW mostly outperformed Word2Vec and Doc2Vec.
- Patch antenna design optimization using opposition based grey wolf optimizer and map-reduce framework
Purpose: Microstrip patch antenna is generally used for several communication purposes particularly in the military and civilian applications. Even though several techniques have been made numerous achievements in several fields, some systems require additional improvements to meet few challenges. Yet, they require application-specific improvement for optimally designing microstrip patch antenna. The paper aims to discuss these issues. Design/methodology/approach: This paper intends to adopt an advanced meta-heuristic search algorithm called as grey wolf optimization (GWO), which is said to be inspired by the hunting behaviour of grey wolves, for the design of patch antenna parameters. The searching for the optimal design of the antenna is paced up using the opposition-based solution search. Moreover, the proposed model derives a nonlinear objective model to aid the design of the solution space of antenna parameters. After executing the simulation model, this paper compares the performance of the proposed GWO-based microstrip patch antenna with several conventional models. Findings: The gain of the proposed model is 27.05 per cent better than WOAD, 2.07 per cent better than AAD, 15.80 per cent better than GAD, 17.49 per cent better than PSAD and 3.77 per cent better than GWAD model. Thus, it has proved that the proposed antenna model has attained high gain, leads to cause superior performance. Originality/value: This paper presents a technique for designing the microstrip patch antenna, using the proposed GWO algorithm. This is the first work utilizes GWO-based optimization for microstrip patch antenna.
- GWLM–NARX. Grey Wolf Levenberg–Marquardt-based neural network for rainfall prediction
Purpose: Weather forecasting is the trending topic around the world as it is the way to predict the threats posed by extreme rainfall conditions that lead to damage the human life and properties. These issues can be managed only when the occurrence of the worse weather is predicted in advance, and sufficient warnings can be executed in time. Thus, keeping in mind the importance of the rainfall prediction system, the purpose of this paper is to propose an effective rainfall prediction model using the nonlinear auto-regressive with external input (NARX) model. Design/methodology/approach: The paper proposes a rainfall prediction model using the time-series prediction that is enabled using the NARX model. The time-series prediction ensures the effective prediction of the rainfall in a particular area or the locality based on the rainfall data in the previous term or month or year. The proposed NARX model serves as an adaptive prediction model, for which the rainfall data of the previous period is the input, and the optimal computation is based on the proposed algorithm. The adaptive prediction using the proposed algorithm is exhibited in the NARX, and the proposed algorithm is developed based on the Grey Wolf Optimization and the Levenberg–Marqueret (LM) algorithm. The proposed algorithm inherits the advantages of both the algorithms with better computational time and accuracy. Findings: The analysis using two databases enables the better understanding of the proposed rainfall detection methods and proves the effectiveness of the proposed prediction method. The effectiveness of the proposed method is enhanced and the accuracy is found to be better compared with the other existing methods and the mean square error and percentage root mean square difference of the proposed method are found to be around 0.0093 and 0.207. Originality/value: The rainfall prediction is enabled adaptively using the proposed Grey Wolf Levenberg–Marquardt (GWLM)-based NARX, wherein an algorithm, named GWLM, is proposed by the integration of Grey Wolf Optimizer and LM algorithm.
- A data-driven neural network architecture for sentiment analysis
Purpose: The fabulous results of convolution neural networks in image-related tasks attracted attention of text mining, sentiment analysis and other text analysis researchers. It is, however, difficult to find enough data for feeding such networks, optimize their parameters, and make the right...
- Age-specific survival in prostate cancer using machine learning
Purpose: The incidence of prostate cancer is increasing from the past few decades. Various studies have tried to determine the survival of patients, but metastatic prostate cancer is still not extensively explored. The survival rate of metastatic prostate cancer is very less compared to the earlier ...
- An innovative citation recommendation model for draft papers with varying degrees of information completeness
Purpose: As researchers are writing a draft paper with incomplete structure or text, one of burdensome tasks is to deliberate about which references should be cited for one sentence or paragraph of this draft. In view of the rapid increase in the number of research papers, researchers desire to...
- Collaborative knowledge management for corporate ecological responsibility
Purpose: Knowledge has become the basis of enhancing the core competitiveness of enterprises in this era of knowledge-driven economies. Collaborative knowledge management not only realizes the real-time exchange and communication of knowledge among different enterprises, but also facilitates the...
- Data science from a library and information science perspective
Purpose: Data science is a relatively new field which has gained considerable attention in recent years. This new field requires a wide range of knowledge and skills from different disciplines including mathematics and statistics, computer science and information science. The purpose of this paper...
- Domain-specific word embeddings for patent classification
Purpose: Patent offices and other stakeholders in the patent domain need to classify patent applications according to a standardized classification scheme. The purpose of this paper is to examine the novelty of an application it can then be compared to previously granted patents in the same class....
- Factors that enable knowledge creation in higher education: a structural model
Purpose: Knowledge is recognized as a valuable asset and universities are in search of a new strategy that allows them to build their knowledge and experience. To achieve this goal, it seems essential to find the factors associated with knowledge creation (KC) in universities. There is currently no ...
- Generalised grey target decision method based on the Gini–Simpson index involving mixed attributes and uncertain numbers
Purpose: The purpose of this paper is to investigate a novel generalised grey target decision method (GGTDM) with index and weight involving mixed attribute values. Design/methodology/approach: The mixed attribute values are transformed into binary connection numbers and also comprised of two-tuple...
- Identification of operational demand in law enforcement agencies. An application based on a probabilistic model of topics
Purpose: The purpose of this paper is to develop a methodology for knowledge discovery in emergency response service databases based on police occurrence reports, generating information to help law enforcement agencies plan actions to investigate and combat criminal activities. Design/methodology/a...
- Kairotic and chronological knowing: diabetes logbooks in-and-out of the hospital
Purpose: The paper reflects on the role of knowledge artefacts in the patient-provider relationship across the organisational boundaries of the clinical setting. Drawing on the analysis of the diabetes logbook, the purpose of this paper is to illustrate the role of knowledge artefacts in a...