Inverse local context analysis. A method for exhaustively gathering documents from limited accessible data sources

DOIhttps://doi.org/10.1108/EL-12-2014-0211
Date06 June 2016
Published date06 June 2016
Pages405-418
AuthorWei Lu,Xinghu Yue,Qikai Cheng,Rui Meng
Subject MatterInformation & knowledge management,Information & communications technology,Internet
Inverse local context analysis
A method for exhaustively gathering
documents from limited accessible
data sources
Wei Lu, Xinghu Yue, Qikai Cheng and Rui Meng
School of Information Management, Wuhan University, Wuhan, China
Abstract
Purpose – The purpose of this paper is to explore the use of inverse local context analysis (ILCA) to
obtain data from limited accessible data sources.
Design/methodology/approach The experimental results show that the method the authors
proposed can obtain all retrieved documents from the limited accessible data source using the least
number of queries.
Findings – The experimental results show that the method we proposed can obtain all retrieved
documents from the limited accessible data source using the least number of queries.
Originality/value – To the best of the authors’ knowledge, this paper provides the rst attempt to
gather all the retrieved documents from limited accessible data source, and the efciency and ease of
implementation of the proposed solution make it feasible for practical applications. The method the
authors proposed can also benet the construction of web corpus.
Keywords Query expansion, Limited accessible data source, Local context analysis, Total recall,
Exhaustive search
Paper type Research paper
Introduction
The World Wide Web provides a large and heterogeneous data corpus for scholars from
different domain areas to investigate diverse research questions (Robert, 2009). It has
been utilized as a language corpus for linguists (Baroni and Bernardini, 2004;Kilgarriff
and Grefenstette, 2003) and successfully applied in many natural language processing
(NLP) applications, such as machine translation (Grefenstette, 1999), term extraction
and grammar checking (Liu and Curran, 2006). In information retrieval, web data have
been successfully applied in web page classication (Qi and Davison, 2009), image
retrieval and annotation (Wang et al., 2008) and user query classication (Hu et al., 2009).
Search engines and online databases provide a convenient way of accessing these
resources (Zhu and Xie, 2003). Many successful studies created corpora, partially or
fully, by utilizing search engine results. Examples include public opinion monitoring
and trend analysis (Sharoff, 2006;Wang, 2011;Zhu, 2012). When creating corpora from
search engines or online databases, a common method is to submit a query, manually or
automatically, to the search system and then gather documents from the search results.
However, to reduce system cost, modern commercial search engines, such as Google and
Baidu (the largest Chinese search engine), usually do not provide all of the retrieved
documents in their results. According to the statistics of the general search engines in
2014, for each query, Google returns less than 1,000 results and Baidu returns, at most,
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0264-0473.htm
Inverse local
context
analysis
405
Received 10 December 2014
Revised 5 April 2015
9 April 2015
Accepted 25 May 2015
TheElectronic Library
Vol.34 No. 3, 2016
pp.405-418
©Emerald Group Publishing Limited
0264-0473
DOI 10.1108/EL-12-2014-0211

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT