Inverse local context analysis. A method for exhaustively gathering documents from limited accessible data sources
DOI | https://doi.org/10.1108/EL-12-2014-0211 |
Date | 06 June 2016 |
Published date | 06 June 2016 |
Pages | 405-418 |
Author | Wei Lu,Xinghu Yue,Qikai Cheng,Rui Meng |
Subject Matter | Information & knowledge management,Information & communications technology,Internet |
Inverse local context analysis
A method for exhaustively gathering
documents from limited accessible
data sources
Wei Lu, Xinghu Yue, Qikai Cheng and Rui Meng
School of Information Management, Wuhan University, Wuhan, China
Abstract
Purpose – The purpose of this paper is to explore the use of inverse local context analysis (ILCA) to
obtain data from limited accessible data sources.
Design/methodology/approach – The experimental results show that the method the authors
proposed can obtain all retrieved documents from the limited accessible data source using the least
number of queries.
Findings – The experimental results show that the method we proposed can obtain all retrieved
documents from the limited accessible data source using the least number of queries.
Originality/value – To the best of the authors’ knowledge, this paper provides the rst attempt to
gather all the retrieved documents from limited accessible data source, and the efciency and ease of
implementation of the proposed solution make it feasible for practical applications. The method the
authors proposed can also benet the construction of web corpus.
Keywords Query expansion, Limited accessible data source, Local context analysis, Total recall,
Exhaustive search
Paper type Research paper
Introduction
The World Wide Web provides a large and heterogeneous data corpus for scholars from
different domain areas to investigate diverse research questions (Robert, 2009). It has
been utilized as a language corpus for linguists (Baroni and Bernardini, 2004;Kilgarriff
and Grefenstette, 2003) and successfully applied in many natural language processing
(NLP) applications, such as machine translation (Grefenstette, 1999), term extraction
and grammar checking (Liu and Curran, 2006). In information retrieval, web data have
been successfully applied in web page classication (Qi and Davison, 2009), image
retrieval and annotation (Wang et al., 2008) and user query classication (Hu et al., 2009).
Search engines and online databases provide a convenient way of accessing these
resources (Zhu and Xie, 2003). Many successful studies created corpora, partially or
fully, by utilizing search engine results. Examples include public opinion monitoring
and trend analysis (Sharoff, 2006;Wang, 2011;Zhu, 2012). When creating corpora from
search engines or online databases, a common method is to submit a query, manually or
automatically, to the search system and then gather documents from the search results.
However, to reduce system cost, modern commercial search engines, such as Google and
Baidu (the largest Chinese search engine), usually do not provide all of the retrieved
documents in their results. According to the statistics of the general search engines in
2014, for each query, Google returns less than 1,000 results and Baidu returns, at most,
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
Inverse local
context
analysis
405
Received 10 December 2014
Revised 5 April 2015
9 April 2015
Accepted 25 May 2015
TheElectronic Library
Vol.34 No. 3, 2016
pp.405-418
©Emerald Group Publishing Limited
0264-0473
DOI 10.1108/EL-12-2014-0211
To continue reading
Request your trial