Document file searching

DOIhttps://doi.org/10.1108/eb045639
Published date01 March 1998
Date01 March 1998
Pages199-209
AuthorHoward Falk
Subject MatterInformation & knowledge management,Library & information science
Document file searching
Howard Falk
1. Finding the Desired Document
Wherever computer use becomes established, document files begin to accumulate, and the problem of
find-
ing the desired document becomes increasingly important as the number of files increases.
If each file name provided a complete description of the essential nature of the document contents, this
find-
ing problem would not
exist.
But, file names are almost always short. For desktop computers, file names have
been mainly limited to 8 characters plus a 3-character
suffix.
The
truth is, even if file names were much longer
than that, they would not be a suitable vehicle to answer the rich variety of document search questions that
arise.
2. Automatic Word indexes
The traditional approach to finding desired docu-
ments has been to index the document collection,
and then use the index, (in printed form, or in the
form of a database), to identify desired documents.
When the documents are stored as computer files,
the finding process can be considerably streamlined.
The basic approach is to create a complete index of
the words contained in all the documents in the
col-
lection.
(By word, I mean a string of text characters
that follows a space, and also ends with a space).
This word index is compiled automatically, by soft-
ware.
To keep the index small, words that are not
useful for searching are
excluded.
The index relates
each occurrence of a word with the document file
that contains that word.
Search software is used to bring the word index into
action.
You
enter search words, and the search soft-
ware flips through the index making comparisons.
When a word match is found, the files referred to by
the index are displayed on a hit list. Most search soft-
ware can handle strings of characters (parts of
words), and phrases containing more than a single
word,
as well as individual words.
The hit list can provide direct access to the desired
documents. Many search packages display text lines
from within the files on the hit list. And most allow
you to view the files on the hit list by launching the
appropriate application (for
example,
Microsoft Word
to view a Word file) to view the documents.
3. Indexing Should Be Fast
How long does it take to index a collection of docu-
ment files? This is one of several important ques-
tions that should be asked about any document file
searching
facility.
You
really do not want to have your
computer taken over for hours, or even for longer
than a few minutes, by indexing software.
The need for rapid indexing is compounded by the
fact that a collection of documents is usually not a
static thing. New documents may be added and
other documents removed. Each time a change
occurs, the collection should be indexed all over
again.
If not, the index will be incomplete, obsolete,
and not too useful.
One way of meliorating the need for indexing speed
is to make the indexing process a background task.
Then,
indexing can take place in the gaps of time
when current computer tasks are idle. For example,
when a word processing user pauses to think while
typing material into a document, an idle moment is
created during which indexing could take place in
the background. In fact, there is usually much more
idle time than there is active time while most word
processing tasks are underway.
4. Reindexing Should Be Automatic
Indexing,
whenever it is needed, should be an auto-
matic process. Otherwise the integrity of the index,
and of the file-finding mechanism, will depend on the
ability of each and every user to remember to
rein-
dex every time a document is added, revised, or
deleted.
At the same time, reindexing should not interfere
with processing of other computer activities. Making
the indexing task a background task helps, but may
not be sufficient. Even in the background, the com-
puting load imposed by automatic indexing can
sig-
nificantly slow such tasks as Internet access.
Therefore, it is important to minimize the effect of
automatic indexing on computer operation.
One approach is to have the user select the time of
day when reindexing is scheduled to take
place.
This
works well if the computer is left running at hours
when no one is using it. However, those in charge of
the computer may be reluctant to allow it to run all
day and night.
HARDWARE
CORNER
The Electronic Library, Vol. 16, No. 3, June 1998 199

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT