Band gap information extraction from materials science literature – a pilot study

Date26 August 2022
Pages438-454
DOIhttps://doi.org/10.1108/AJIM-03-2022-0141
Published date26 August 2022
Subject MatterLibrary & information science,Information behaviour & retrieval,Information & knowledge management,Information management & governance,Information management
AuthorSatanu Ghosh,Kun Lu
Band gap information extraction
from materials science literature
a pilot study
Satanu Ghosh and Kun Lu
School of Library and Information Studies, University of Oklahoma, Norman,
Oklahoma, USA
Abstract
Purpose The purpose of this paper is to present a preliminary work on extracting band gap information of
materials from academic papers. With increasing demand for renewable energy, band gap information will
help material scientists design and implement novel photovoltaic (PV) cells.
Design/methodology/approachThe authors collected 1.44 million titles and abstracts of scholarly articles
related to materials science, and then filtered the collection to 11,939 articles that potentially contain relevant
information about materials and their band gap values. ChemDataExtractor was extended to extract
information about PV materials and their band gap information. Evaluation was performed on randomly
sampled information records of 415 papers.
Findings The findings of this study show that the current system is able to correctly extract information for
51.32% articles, with partially correct extraction for 36.62% articles and incorrect for 12.04%. The authors
have also identified the errors belonging to three main categories pertaining to chemical entity identification,
band gap information and interdependency resolution. Future work will focus on addressing these errors to
improve the performance of the system.
Originality/value The authors did not find any literature to date on band gap information extraction from
academic text using automated methods. This work is unique and original. Band gap information is of
importance to materials scientists in applications such as solar cells, light emitting diodes and laser diodes.
Keywords Band gap information extraction, Photovoltaic cell, Solar cell, Renewable energy, Text mining,
Academic text, ChemDataExtractor
Paper type Research paper
Introduction
Global energy consumption is expected to increase nearly 50% by 2050 as a result of
economic and population growth according to the US Energy Information Administration
(Nalley and LaRose, 2021). Renewable and clean energy needs to play a bigger role in meeting
the demand due to the growing evidence of climate change that has been associated with
recent catastrophic events across the world (Pidcock and McSweeney, 2021). Photovoltaic
(PV) materials can convert light into electricity, which allows us to harness the abundant
clean solar energy. Existing PV materials have significant drawbacks in efficiencies,
containing toxic metal and/or relying on scarce elements (Todorov et al., 2010;Mitzi et al.,
2011;Saparov and Mitzi, 2016;Correa-Baena et al., 2017). To address the problems, novel PV
materials are needed. However, traditional avenues for the discovery and implementation of
energy materials are inefficient, partially due to the reliance on trial-and-error methods.
Recent advances in data-driven approaches offer new opportunities for more efficient
materials design and discovery.
PV materials can convert light energy to electric current due to a physical phenomenon
called photovolatic effect.For a PV material to convert light into electricity, photons in
AJIM
75,3
438
The authors appreciate the contribution of Dr. Bayram Saparov and Dr. Bin Wang to the ideation of this
work. This work is partially supported by the University of Oklahoma Data Institute for Societal
Challenges Seed Grant and College of Arts and Sciences Data Scholarship Initiative Grant.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2050-3806.htm
Received 15 March 2022
Revised 14 June 2022
19 July 2022
11 August 2022
Accepted 12 August 2022
Aslib Journal of Information
Management
Vol. 75 No. 3, 2023
pp. 438-454
© Emerald Publishing Limited
2050-3806
DOI 10.1108/AJIM-03-2022-0141
the light need to carry enough energy to excite electrons in the material into a free state to
create electric current. Band gap is the minimum amount of energy required to excite an
electron in a material into such a free state. Band gap is an intrinsic property of materials.
Materials with too high band gaps are not suited for PV cells because photons will not have
enough energy to excite the electrons in these materials. On the other hand, materials with
too low band gaps are not ideal for PV cells either, because photons will carry excessive
energy for exciting the electrons and the extra energy will be converted to heat, which is
undesired. Knowing the band gap information is very important for material scientiststo
determine candidate materials for PV cells. This information has been widely reported in
scientific literature from experimental and computational studies, and continues to appear
in upcoming publications, but the volume of the literature prevents scientists from gaining
a complete view of the band gaps of various materials. Manually collecting this
information has been attempted (e.g. Kasap, 2006), but is inefficient and unable to keep up
with the ever-increasing volume of scientific literature. As a result, most scientific
decisions are made based on partial information, which can lead to missed opportunities
for discovering novel solar materials.
This study develops an automated method to extract band gap information from
materials science literature. The method is evaluated based on the extraction results on a
random sample of 415 articles from a collection of 11,939 materials science articles potentially
containing band gap information. Text mining for materials science is still in its early stage
(Kononova et al., 2021). The closest tool available for extracting such chemical information
from scientific literature is ChemDataExtractor, which was developed to extract
spectroscopic attributes and experimental properties (Swain and Cole, 2016), and recently
extended to extract material properties relevant to battery materials (Huang and Cole, 2020).
We extend the ChemDataExtractor tool to extract band gap information. Machine-learning-
based approaches could also be used to extract information from text. However, training data
are very scarce in the materials science domain, and no specific training data can be found for
the band gap information extraction task.
As far as we know, no existing study has developed automated methods to extract band
gap information from materials science literature. This study aims to fill this gap. In addition
to solar cells, the band gap information is also useful for other applications, such as light
emitting diodes and laser diodes.
Related work
Text mining for scientific literature
Generation of new knowledge is of utmost importance for scientific progress [1]. Scientific
publications remain to be the primary channel for scientists to communicate new ideas and
discoveries. As the volume of scientific publication continues to grow rapidly, it has become
increasingly challenging for scientists to keep up with the latest development in the field.
This can lead to suboptimal decisions based on incomplete information. Text mining relies on
natural language processing techniques and/or manually curated ontologies to analyze large
amounts of text automatically in order to offer more efficient ways for scientists to harness
the existing knowledge in scientific literature. This may involve extracting (Mooney and
Bunescu, 2005), summarizing (Nenkova and McKeown, 2012), aggregating (Serrano et al.,
2013), categorizing (Brindha et al., 2016) and inferring (Erraguntla et al., 2012) information
from text. In addition, by analyzing and synthesizing what has been reported in the literature,
literature-based discoveries may also be achieved (Gordon and Dumais, 1998). Information
extraction plays an important role in transforming the unstructured text into structured
information that is easy to query and access. While the influx of data and expanding volume
of scientific literature can lead to data-intensive scientific discovery (Tolle et al., 2011), the
Band gap
information
extraction
439

Get this document and AI-powered insights with a free trial of vLex and Vincent AI

Get Started for Free

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex

Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant

  • Access comprehensive legal content with no limitations across vLex's unparalleled global legal database

  • Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength

  • Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities

  • Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting

vLex