On the differences between CNNs and vision transformers for COVID-19 diagnosis using CT and chest x-ray mono- and multimodality
| Date | 10 January 2024 |
| Pages | 517-544 |
| DOI | https://doi.org/10.1108/DTA-01-2023-0005 |
| Published date | 10 January 2024 |
| Author | Sara El-Ateif,Ali Idri,José Luis Fernández-Alemán |
On the differences between CNNs
and vision transformers for
COVID-19 diagnosis using CT and
chest x-ray mono- and
multimodality
Sara El-Ateif
Software Project Management Research Team, ENSIAS, Mohammed V
University, Rabat, Morocco
Ali Idri
Software Project Management Research Team, ENSIAS, Mohammed V
University, Rabat, Morocco and
Mohammed VI Polytechnic University, Ben Guerir, Morocco, and
José Luis Fernández-Alemán
Informatica y Sistemas, Universidad de Murcia, Murcia, Spain
Abstract
Purpose –COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19
using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray
(CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it
has become difficult to keep abreast of the disease. Deep learning models have been developed in order to
assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques
currently focus only on one modality (CXR).
Design/methodology/approach –This work aims to leverage the benefits of both CT and CXR to
improve COVID-19 diagnosis. This paper studies the differences between using convolutional
MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the
MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by
reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the
Borda Count method. We also use the Grad-CAM algorithm to studythe model’s interpretability. Finally, the
model’s robustness is tested by evaluating it on Gaussian noised images.
Findings –Although pretrained MobileNetV2 was the best model in terms of performance, the best model in
terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer
using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.
Originality/value –ModelscomparedarepretrainedonMedNISTandleverageboththeCTandCXRmodalities.
Keywords COVID-19, Multimodality, Feature fusion, Deep convolutional neural networks, Vision
transformers
Paper type Research paper
1. Introduction
Recent advances in deep learning (DL) when applied to computer vision have proven that
convolutional neural networks (CNNs) consider only local features and find it difficult to
The authors would like to express their gratitude for the support provided by the Google Ph.D.
Fellowship. This research is part of the OASSIS-UMU (PID2021-122554OB-C32) project (supported by
the Spanish Ministry of Science and Innovation). This project is also funded by the European Regional
Development Fund (ERDF).
ThecurrentissueandfulltextarchiveofthisjournalisavailableonEmeraldInsightat:
https://www.emerald.com/insight/2514-9288.htm
517
Received 3 January 2023
Revised 28 May 2023
Accepted 28 October 2023
Data Technologies and
Applications
Vol. 58 No. 3, 2024
pp. 517-544
© Emerald Publishing Limited
2514-9288
DOI 10.1108/DTA-01-2023-0005
Differences
between CNNs
and ViT
generalize when compared to vision transformers (ViT). ViT were inspired by natural
language processing (NLP) transformers and are transformer models that tackle
computer vision tasks by dividing the images into patches rather than tokens, as usually
occurs in NLP. ViT use multi-head attention (MHA), which helps model long-range pixel
relationships that give ViT the ability to provide global information and then generalize
(Dosovitskiy et al., 2020;Shamshad et al., 2022). Although ViT require huge datasets in
order to converge and outperform CNNs, they consume less computational resources during
training (Dosovitskiy et al., 2020) and are more robust: they are less sensitive to adversarial
perturbations at high frequencies and obtain significantly better certified robustness than
CNN-based models (Shao et al., 2021). Moreover, recent models have been developed in
order to converge on smaller datasets such as DeiT (Touvron et al., 2020) and Swin
Transformer (Liu et al. (2021)) (DeiT-S and Swin-T respectively scored 79.8 per cent and
81.3 per cent Top-1 accuracy on ImageNet1K).
Several studies concerning medical imaging are currently using ViT variants for disease
prognosis, diagnosis, detection and tracking or to enhance the medical workflow by means
of segmentation, classification, detection, reconstruction, synthesis, registration and clinical
report generation, as listed in Shamshad et al. (2022). With regard to classification, and
especially when using 2D images, the review reports several studies performed for COVID-
19 diagnosis (Jiang and Lin, 2021;Krishnan and Krishnan, 2021;Liu and Yin, 2021;Perera
et al., 2021;Shome et al., 2021;Fan et al., 2022;Le Dinh et al., 2022;Mondal et al., 2022).
COVID-19 is an acute respiratory disease that, according to the World Health Organization,
has as of the 5th of August 2022 caused 6,407,556 deaths and infected 579,092,623
individuals. Physicians diagnose this illness using real-time polymerase chain reaction
and use the chest x-ray (CXR) and computed tomography (CT) imaging modalities for
further analysis, depending on the availability of the modality, severity and stage of the
disease (Aljondi and Alghamdi, 2020). With regard to the use of the CXR modality,
Krishnan and Krishnan (2021) compared the ImageNet pretrained DenseNet, InceptionV3,
WideResNet101 and the ViT B/32 model, which outperformed the three CNN-based models
and obtained an accuracy score of 97.61 per cent. Shome et al. (2021) proposed COVID-
Transformer, which is based on ViT L-16 and was compared with fine-tuned baselines
EfficientNetB0, InceptionV3, Resnet50, MobileNetV3, Xception and DenseNet-121; it
obtained an accuracy score of 92 per cent for the detection of COVID-19 in normal cases
and an area under the curve (AUC) score of 98 per cent for the detection of COVID-19,
normal cases and pneumonia. Liu and Yin (2021) tailored and fine-tuned the Vision
Outlooker-D3 (VOLO-D3) transformer and trained it on a large balanced CXR images
dataset for COVID-19 detection. Jiang and Lin (2021) used weighted average methods
combined in parallel with trained Swin Transformer and Transformer in Transformer
results to distinguish among COVID-19, pneumonia and healthy cases; and obtained an
accuracy of 94.75 per cent. In order to distinguish between COVID-19, pneumonia and
normal patients and assess the severity of the disease, Le Dinh et al. (2022) trained
DenseNet121, ResNet50, InceptionNet, Swin Transformer and Hybrid EfficientNet-DOLG
on a customized dataset. The last model (Hybrid EfficientNet-DOLG) outperformed the four
former ones with an F1 macro-average of 95 per cent in the classification task and of
80 per cent in the severity assessment task. With regard to the use of the CT modality, Fan
et al. (2022) proposed Trans-CNN Net, a parallel bi-branch model containing a transformer
and CNN module in each branch using bidirectional fusion that fuses global (from ViT) and
local features (from CNN), respectively. Trans-CNN Net outperformed ResNet-152 and
DeiT-B with an accuracy of 96.7 per cent. In the case of using both CT and CXR, Mondal
et al. (2022) designed the xViTCOS for both the CT (accuracy = 98 per cent) and CXR
(accuracy = 96 per cent) modalities, respectively, using a multistage learning approach.
Meanwhile, Perera et al. (2021) used the point-of-care (POC) ultrasound modality to train
DTA
58,3
518
Get this document and AI-powered insights with a free trial of vLex and Vincent AI
Get Started for FreeStart Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting