Data Technologies and Applications

Publisher:: Emerald Group Publishing Limited
Publication date:: 2021-02-01

ISBN:: 2514-9288

Issue Number

Latest documents

Impact on recommendation performance of online review helpfulness and consistency
Purpose: The existing collaborative filtering algorithm may select an insufficiently representative customer as the neighbor of a target customer, which means that the performance in providing recommendations is not sufficiently accurate. This study aims to investigate the impact on recommendation performance of selecting influential and representative customers. Design/methodology/approach: Some studies have shown that review helpfulness and consistency significantly affect purchase decision-making. Thus, this study focuses on customers who have written helpful and consistent reviews to select influential and representative neighbors. To achieve the purpose of this study, the authors apply a text-mining approach to analyze review helpfulness and consistency. In addition, they evaluate the performance of the proposed methodology using several real-world Amazon review data sets for experimental utility and reliability. Findings: This study is the first to propose a methodology to investigate the effect of review consistency and helpfulness on recommendation performance. The experimental results confirmed that the recommendation performance was excellent when a neighbor was selected who wrote consistent or helpful reviews more than when neighbors were selected for all customers. Originality/value: This study investigates the effect of review consistency and helpfulness on recommendation performance. Online review can enhance recommendation performance because it reflects the purchasing behavior of customers who consider reviews when purchasing items. The experimental results indicate that review helpfulness and consistency can enhance the performance of personalized recommendation services, increase customer satisfaction and increase confidence in a company.
ABEE: automated bio entity extraction from biomedical text documents
Purpose: The purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts. Design/methodology/approach: In the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text. Findings: The proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments. Research limitations/implications: As such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups. Practical implications: As far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery. Social implications: During the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area. Originality/value: In this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score.
Research on the generalization of social bot detection from two dimensions: feature extraction and detection approaches
Purpose: The proliferation of bots in social networks has profoundly affected the interactions of legitimate users. Detecting and rejecting these unwelcome bots has become part of the collective Internet agenda. Unfortunately, as bot creators use more sophisticated approaches to avoid being discovered, it has become increasingly difficult to distinguish social bots from legitimate users. Therefore, this paper proposes a novel social bot detection mechanism to adapt to new and different kinds of bots. Design/methodology/approach: This paper proposes a research framework to enhance the generalization of social bot detection from two dimensions: feature extraction and detection approaches. First, 36 features are extracted from four views for social bot detection. Then, this paper analyzes the feature contribution in different kinds of social bots, and the features with stronger generalization are proposed. Finally, this paper introduces outlier detection approaches to enhance the ever-changing social bot detection. Findings: The experimental results show that the more important features can be more effectively generalized to different social bot detection tasks. Compared with the traditional binary-class classifier, the proposed outlier detection approaches can better adapt to the ever-changing social bots with a performance of 89.23 per cent measured using the F1 score. Originality/value: Based on the visual interpretation of the feature contribution, the features with stronger generalization in different detection tasks are found. The outlier detection approaches are first introduced to enhance the detection of ever-changing social bots.
Social support on Reddit for antiretroviral therapy
Purpose: Social media platforms such as Reddit can be used as a place for people with shared health problems to share knowledge and support. Previous studies have focused on the overall picture of how much social support people who live with HIV/AIDS (PLWHA) receive from online interactions. Yet, only few studies have examined the impact of social support from social media platforms on antiretroviral therapy (ART), which is a necessary lifelong therapy for PLWHA. This study used social support theory to examine related Reddit posts. Design/methodology/approach: This study used content analysis to analyze ART-related Reddit posts. Each Reddit post was manually coded by two coders for social support type. A computational text analysis tool, Linguistic Inquiry and Word Count, was used to generate linguistic features. ANOVA analyses were conducted to compare differences in user engagement and well-being across the types of social support. Findings: Results suggest that most of the posts were informational support posts, followed by emotional support posts and instrumental support posts. Results indicate that there are no significant differences within user engagement variables, but there are significant differences within several well-being variables including analytic score, clout score, health words usage and negative emotional words usage among social support types. Originality/value: This study contributes to further understanding of social support theory in an online context used predominantly by a younger generation. Practical advice for public health researchers and practitioners is discussed.
Do SEC filings indicate any trends? Evidence from the sentiment distribution of forms 10-K and 10-Q with FinBERT
Purpose: This study quantified companies' views on the COVID-19 pandemic with sentiment analysis of US public companies' disclosures. The study aims to provide timely insights to shareholders, investors and consumers by exploring sentiment trends and changes in the industry and the relationship with stock price indices. Design/methodology/approach: From more than 50,000 Form 10-K and Form 10-Q published between 2020 and 2021, over one million texts related to the COVID-19 pandemic were extracted. Applying the FinBERT fine-tuned for this study, the texts were classified into positive, negative and neutral sentiments. The correlations between sentiment trends, differences in sentiment distribution by industry and stock price indices were investigated by statistically testing the changes and distribution of quantified sentiments. Findings: First, there were quantitative changes in texts related to the COVID-19 pandemic in the US companies' disclosures. In addition, the changes in the trend of positive and negative sentiments were found. Second, industry patterns of positive and negative sentiment changes were similar, but no similarities were found in neutral sentiments. Third, in analyzing the relationship between the representative US stock indices and the sentiment trends, the results indicated a positive relationship with positive sentiments and a negative relationship with negative sentiments. Originality/value: Performing sentiment analysis on formal documents like Securities and Exchange Commission (SEC) filings, this study was differentiated from previous studies by revealing the quantitative changes of sentiment implied in the documents and the trend over time. Moreover, an appropriate data preprocessing procedure and analysis method were presented for the time-series analysis of the SEC filings.
A new approach for histological classification of breast cancer using deep hybrid heterogenous ensemble
Purpose: Hundreds of thousands of deaths each year in the world are caused by breast cancer (BC). An early-stage diagnosis of this disease can positively reduce the morbidity and mortality rate by helping to select the most appropriate treatment options, especially by using histological BC images for the diagnosis. Design/methodology/approach: The present study proposes and evaluates a novel approach which consists of 24 deep hybrid heterogenous ensembles that combine the strength of seven deep learning techniques (DenseNet 201, Inception V3, VGG16, VGG19, Inception-ResNet-V3, MobileNet V2 and ResNet 50) for feature extraction and four well-known classifiers (multi-layer perceptron, support vector machines, K-nearest neighbors and decision tree) by means of hard and weighted voting combination methods for histological classification of BC medical image. Furthermore, the best deep hybrid heterogenous ensembles were compared to the deep stacked ensembles to determine the best strategy to design the deep ensemble methods. The empirical evaluations used four classification performance criteria (accuracy, sensitivity, precision and F1-score), fivefold cross-validation, Scott–Knott (SK) statistical test and Borda count voting method. All empirical evaluations were assessed using four performance measures, including accuracy, precision, recall and F1-score, and were over the histological BreakHis public dataset with four magnification factors (40×, 100×, 200× and 400×). SK statistical test and Borda count were also used to cluster the designed techniques and rank the techniques belonging to the best SK cluster, respectively. Findings: Results showed that the deep hybrid heterogenous ensembles outperformed both their singles and the deep stacked ensembles and reached the accuracy values of 96.3, 95.6, 96.3 and 94 per cent across the four magnification factors 40×, 100×, 200× and 400×, respectively. Originality/value: The proposed deep hybrid heterogenous ensembles can be applied for the BC diagnosis to assist pathologists in reducing the missed diagnoses and proposing adequate treatments for the patients.
Property Assertion Constraints for ontologies and knowledge graphs
Purpose: The curation of ontologies and knowledge graphs (KGs) is an essential task for industrial knowledge-based applications, as they rely on the contained knowledge to be correct and error-free. Often, a significant amount of a KG is curated by humans. Established validation methods, such as Shapes Constraint Language, Shape Expressions or Web Ontology Language, can detect wrong statements only after their materialization, which can be too late. Instead, an approach that avoids errors and adequately supports users is required. Design/methodology/approach: For solving that problem, Property Assertion Constraints (PACs) have been developed. PACs extend the range definition of a property with additional logic expressed with SPARQL. For the context of a given instance and property, a tailored PAC query is dynamically built and triggered on the KG. It can determine all values that will result in valid property value assertions. Findings: PACs can avoid the expansion of KGs with invalid property value assertions effectively, as their contained expertise narrows down the valid options a user can choose from. This simplifies the knowledge curation and, most notably, relieves users or machines from knowing and applying this expertise, but instead enables a computer to take care of it. Originality/value: PACs are fundamentally different from existing approaches. Instead of detecting erroneous materialized facts, they can determine all semantically correct assertions before materializing them. This avoids invalid property value assertions and provides users an informed, purposeful assistance. To the author's knowledge, PACs are the only such approach.
Mining the determinants of review helpfulness: a novel approach using intelligent feature engineering and explainable AI
Purpose: This paper aims to find determinants that can predict the helpfulness of online customer reviews (OCRs) with a novel approach. Design/methodology/approach: The approach consists of feature engineering using various text mining techniques including BERT and machine learning models that can classify OCRs according to their potential helpfulness. Moreover, explainable artificial intelligence methodologies are used to identify the determinants for helpfulness. Findings: The important result is that the boosting-based ensemble model showed the highest prediction performance. In addition, it was confirmed that the sentiment features of OCRs and the reputation of reviewers are important determinants that augment the review helpfulness. Research limitations/implications: Each online community has different purposes, fields and characteristics. Thus, the results of this study cannot be generalized. However, it is expected that this novel approach can be integrated with any platform where online reviews are used. Originality/value: This paper incorporates feature engineering methodologies for online reviews, including the latest methodology. It also includes novel techniques to contribute to ongoing research on mining the determinants of review helpfulness.
A hybrid approach for predicting missing follower–followee links in social networks using topological features with ensemble learning
Purpose: Social networking platforms are increasingly using the Follower Link Prediction tool in an effort to expand the number of their users. It facilitates the discovery of previously unidentified individuals and can be employed to determine the relationships among the nodes in a social network. On the other hand, social site firms use follower–followee link prediction (FFLP) to increase their user base. FFLP can help identify unfamiliar people and determine node-to-node links in a social network. Choosing the appropriate person to follow becomes crucial as the number of users increases. A hybrid model employing the Ensemble Learning algorithm for FFLP (HMELA) is proposed to advise the formation of new follower links in large networks. Design/methodology/approach: HMELA includes fundamental classification techniques for treating link prediction as a binary classification problem. The data sets are represented using a variety of machine-learning-friendly hybrid graph features. The HMELA is evaluated using six real-world social network data sets. Findings: The first set of experiments used exploratory data analysis on a di-graph to produce a balanced matrix. The second set of experiments compared the benchmark and hybrid features on data sets. This was followed by using benchmark classifiers and ensemble learning methods. The experiments show that the proposed (HMELA) method predicts missing links better than other methods. Practical implications: A hybrid suggested model for link prediction is proposed in this paper. The suggested HMELA model makes use of AUC scores to predict new future links. The proposed approach facilitates comprehension and insight into the domain of link prediction. This work is almost entirely aimed at academics, practitioners, and those involved in the field of social networks, etc. Also, the model is quite effective in the field of product recommendation and in recommending a new friend and user on social networks. Originality/value: The outcome on six benchmark data sets revealed that when the HMELA strategy had been applied to all of the selected data sets, the area under the curve (AUC) scores were greater than when individual techniques were applied to the same data sets. Using the HMELA technique, the maximum AUC score in the Facebook data set has been increased by 10.3 per cent from 0.8449 to 0.9479. There has also been an 8.53 per cent increase in the accuracy of the Net Science, Karate Club and USAir databases. As a result, the HMELA strategy outperforms every other strategy tested in the study.
A cascaded deep-learning-based model for face mask detection
Purpose: This work aims to present a deep learning model for face mask detection in surveillance environments such as automatic teller machines (ATMs), banks, etc. to identify persons wearing face masks. In surveillance environments, complete visibility of the face area is a guideline, and criminals and law offenders commit crimes by hiding their faces behind a face mask. The face mask detector model proposed in this work can be used as a tool and integrated with surveillance cameras in autonomous surveillance environments to identify and catch law offenders and criminals. Design/methodology/approach: The proposed face mask detector is developed by integrating the residual network (ResNet)34 feature extractor on top of three You Only Look Once (YOLO) detection layers along with the usage of the spatial pyramid pooling (SPP) layer to extract a rich and dense feature map. Furthermore, at the training time, data augmentation operations such as Mosaic and MixUp have been applied to the feature extraction network so that it can get trained with images of varying complexities. The proposed detector is trained and tested over a custom face mask detection dataset consisting of 52,635 images. For validation, comparisons have been provided with the performance of YOLO v1, v2, tiny YOLO v1, v2, v3 and v4 and other benchmark work present in the literature by evaluating performance metrics such as precision, recall, F1 score, mean average precision (mAP) for the overall dataset and average precision (AP) for each class of the dataset. Findings: The proposed face mask detector achieved 4.75–9.75 per cent higher detection accuracy in terms of mAP, 5–31 per cent higher AP for detection of faces with masks and, specifically, 2–30 per cent higher AP for detection of face masks on the face region as compared to the tested baseline variants of YOLO. Furthermore, the usage of the ResNet34 feature extractor and SPP layer in the proposed detection model reduced the training time and the detection time. The proposed face mask detection model can perform detection over an image in 0.45 s, which is 0.2–0.15 s lesser than that for other tested YOLO variants, thus making the proposed detection model perform detections at a higher speed. Research limitations/implications: The proposed face mask detector model can be utilized as a tool to detect persons with face masks who are a potential threat to the automatic surveillance environments such as ATMs, banks, airport security checks, etc. The other research implication of the proposed work is that it can be trained and tested for other object detection problems such as cancer detection in images, fish species detection, vehicle detection, etc. Practical implications: The proposed face mask detector can be integrated with automatic surveillance systems and used as a tool to detect persons with face masks who are potential threats to ATMs, banks, etc. and in the present times of COVID-19 to detect if the people are following a COVID-appropriate behavior of wearing a face mask or not in the public areas. Originality/value: The novelty of this work lies in the usage of the ResNet34 feature extractor with YOLO detection layers, which makes the proposed model a compact and powerful convolutional neural-network-based face mask detector model. Furthermore, the SPP layer has been applied to the ResNet34 feature extractor to make it able to extract a rich and dense feature map. The other novelty of the present work is the implementation of Mosaic and MixUp data augmentation in the training network that provided the feature extractor with 3× images of varying complexities and orientations and further aided in achieving higher detection accuracy. The proposed model is novel in terms of extracting rich features, performing augmentation at the training time and achieving high detection accuracy while maintaining the detection speed.