Design and implementation of crawling algorithm to collect deep web information for web archiving

Date03 April 2018
Publication Date03 April 2018
DOIhttps://doi.org/10.1108/DTA-07-2017-0053
Pages266-277
AuthorHyo-Jung Oh,Dong-Hyun Won,Chonghyuck Kim,Sung-Hee Park,Yong Kim
SubjectLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Metadata,Information & knowledge management,Information & communications technology,Internet
Design and implementation of
crawling algorithm to collect deep
web information for web archiving
Hyo-Jung Oh
Graduate School of Archives and Records Management,
Chonbuk National University, Jeonju, The Republic of Korea
Dong-Hyun Won
Center for Disaster Safety Information, Chonbuk National University, Jeonju,
The Republic of Korea
Chonghyuck Kim
Department of English Language and Literature, Chonbuk National University,
Jeonju, The Republic of Korea
Sung-Hee Park
Physical Medicine and Rehabilitation, Chonbuk National University, Jeonju,
The Republic of Korea, and
Yong Kim
Department of Library and information Science, Chonbuk National University,
Jeonju, The Republic of Korea
Abstract
Purpose The purpose of this paper is to describe the development of an algorithm for realizing web
crawlers that automatically collect dynamically generated webpages from the deep web.
Design/methodology/approach This study proposes and develops an algorithm to collect web
information as if the web crawler gathers static webpages by managing script commands as links.
The proposed web crawler actually experiments with the algorithm by collecting deep webpages.
Findings Among the findings of this study is that if the actual crawling process provides search results as
script pages, the outcome only collects the first page. However, the proposed algorithm can collect deep
webpages in this case.
Research limitations/implications To use a script as a link, a human must first analyze the web
document. This study uses the web browser object provided by Microsoft Visual Studio as a script launcher,
so it cannot collect deep webpages if the web browser object cannot launch the script, or if the web document
contains script errors.
Practical implications The research results show deep webs are estimated to have 450 to 550 times more
information than surface webpages, and it is difficult to collect web documents. However, this algorithm helps
to enable deep web collection through script runs.
Originality/value This study presents a new method to be utilized with script links instead of adopting
previous keywords. The proposed algorithm is available as an ordinary URL. From the conducted
experiment, analysis of scripts on individual websites is needed to employ them as links.
Keywords Archives, Web archiving, Automatic crawler, Deep web, Link, Web information
Paper type Research paper
Introduction
Background and objective
According to a survey conducted by Netcraft in April 2017, there are 1,816,416,499 websites
in the world. Considering information such as content provided by individual websites, the
Data Technologies and
Applications
Vol. 52 No. 2, 2018
pp. 266-277
© Emerald PublishingLimited
2514-9288
DOI 10.1108/DTA-07-2017-0053
Received 7 July 2017
Revised 13 January 2018
Accepted 23 January 2018
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/2514-9288.htm
This work was supported by the Ministry of Education of the Republic of Korea and the National Research
Foundation of Korea (NRF-2016S1A5B8913575). The chief of this project was Professor Kim, who has been
with us for the past few years but will be remembered in our hearts for the coming countless years.
266
DTA
52,2

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT