Clustering search results. Part I: web‐wide search engines

Date27 February 2007
Published date27 February 2007
Pages85-91
DOIhttps://doi.org/10.1108/14684520710731056
AuthorPéter Jacsó
Subject MatterInformation & knowledge management,Library & information science
SAVVY SEARCHING
Clustering search results. Part I:
web-wide search engines
Pe
´ter Jacso
´
University of Hawaii, Hawaii, USA
Abstract
Purpose – The purpose of this paper is to examine clustering search results. Traditionally, search
results from professional online information services presented the results in reverse chronological
order. Later, relevance ranking was introduced for ordering the display of the hits on the result list to
separate the wheat from the chaff.
Design/methodology/approach – The need for better presentation of search results retrieved from
millions, then billions, of highly unstructured and untagged Web pages became obvious. Clustering
became a popular software tool to enhance relevance ranking by grouping items in the typically very
large result list. The clusters of items with common semantic and/or other characteristics can guide the
users in refining their original queries, to zoom in on smaller clusters and drill down through
sub-groups within the cluster.
Findings – Despite its proven efficiency, clustering is not available, except for Ask, in the primary
Web-wide search engines (Windows Live, Yahoo and Google).
Originality/value – Smaller, secondary Web-wide search engines (WiseNut, Gigablast, and
especially Exalead) offer good clustering options.
Keywords Search engines,Cluster analysis, Worldwide web
Paper type General review
Introduction
Although traditionally search resultsfrom professional online informationservices were
presented in reverse chronological order, this could be changed to sort the results by
author, journalname, article title and some other data elements. The scope andchoice of
data elements for sorting depend on the host system. Dialog, for example, has offered
many sort options but no sortingby the article title. The sort options also depend on the
type of database. In a business directory the results typically can be sorted by postal
code, NAICS code, value of assets and liabilities, number of employees, etc.
Later, relevance ranking was introduced for ordering the display sequence of the
hits on the result list, which is supposed to separate the wheat from the chaff. The
relevance ranking algorithm is intended to determine which document or document
surrogate best matches the subject as presented by the user’s query. The different
search programs use very different ranking algorithms. The details of the algorithm
are not revealed, but the widely different rank position of the same items in the same
set retrieved from the same datafile on the various hosts clearly shows that relevance is
in the eye of the search software (Jacso
´, 2005). An item which is top-listed for the same
query in the result list generated from the implementation of the same datafile on
System A may have a much lower rank order in System B, and thus not even seen by
the average user who looks at the first page of the result list which usually shows 10
items by default.
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
Clustering search
results
85
Online Information Review
Vol. 31 No. 1, 2007
pp. 85-91
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520710731056

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT