A study of metadata element co‐occurrence

Published date01 July 2006
DOIhttps://doi.org/10.1108/14684520610686319
Pages428-453
Date01 July 2006
AuthorJin Zhang,Iris Jastram
Subject MatterInformation & knowledge management,Library & information science
A study of metadata element
co-occurrence
Jin Zhang and Iris Jastram
School of Information Studies, University of Wisconsin, Milwaukee,
Wisconsin, USA
Abstract
Purpose – This paper aims to investigate the internet web page metadata usage behavior in terms of
their metadata element co-occurrences. Metadata are designed to facilitate both web
publishers/authors to organize their web pages and search engines to index the web pages accurately.
Design/methodology/approach – This study examines the types of metadata elements employed
by different professional groups of web authors, the number of elements they prefer to use, and the
types of element combinations they typically embed in their pages’ HTML code.
Findings – The findings reveal that the “keyword” and “description” elements were the most popular
single elements. The most popular combination of two elements was that of “keyword and
description”. Very few authors included combinations of five elements. This study also shows that
preferences for element combinations varied by domains.
Originality/value This approach will enhance the current understanding of metadata usage
behavior and may help search engine designers as they continue their quest for improved indexing
and retrieval of web pages.
Keywords Behaviour, Internet, Information organizations
Paper type Research paper
1. Introduction
The proliferation of information on the internet has made information retrieval from
that resource a challenging discussion topic for researchers, search engine and subject
index developers, and Internet users alike. Metadata could help this situation if it were
used consistently and well. There is no centralized control over the form or conten t of
embedded metadata however, which causes many to fear that it is too easily misused
or abused. Given the vast potential metadata possesses to enhance internet information
organization and retrieval, and given the equally vast potential for internet resource
creators to misrepresent their pages through metadata (either accidentally or
maliciously), researchers have focused their efforts either on the theoretical side or the
practical side of metadata implementation: how metadata can or should be used and
how metadata is being used.
The most commonly used metadata scheme on the Internet is the HTML “meta” tag.
The researchers found that of the 2,400 pages visited, 62.83 percent included this type
of metadata embedded in the HTML code. This is a much greater percentage than the
7.42 percent of pages containing Dublin Core and the 44.12 percent containing any
other scheme of metadata. This scheme has no standardized element set, leaving the
choice of the type and quantity of elements entirely up to the resource author. This type
of metadata can be found in the source code of a web page in the format ,META
name ¼“[tag name, such as keywords]” content ¼“[metadata content, such as a list of
keywords]”/.. Because the author has complete control over the type and quantity of
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
OIR
30,4
428
Refereed article received
28 February 2006
Revision approved for
publication 15 April 2006
Online Information Review
Vol. 30 No. 4, 2006
pp. 428-453
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684520610686319
metadata elements, this scheme allows metadata to be as simple or as complex as the
author wishes.
It is this flexibility that causes much of the discussion among researchers and
search engine developers. How much granularity is beneficial and how much dilutes
the effectiveness of the scheme or renders the scheme too complex for the average web
author? As Campbell points out, metadata is pulled in two directions: that of traditional
information organization and bibliographic description on one side, and that of “the
emerging standards that will form the web of the future” on the other (Campbell, 2002).
On the one hand, metadata stems from a long history of describing resources using
standardized formats and vocabularies. On the other hand, metadata of the future is as
yet undetermined; this type of metadata is still evolving quietly on the web.
Those researchers who believe metadata should be governed by stricter standards
argue that the lack of controlled vocabulary fundamentally dilutes metadata’s
effectiveness. Chepesuik, for example, argues that metadata is really “cataloging by
another name” (Chepesuik, 1999). As such, he maintains, controlled vocabulary is
necessary to fend off “bibliographic chaos” (Chepesuik, 1999). He quotes Michael
Gorman as saying:
There is no third way between cataloging, controlled vocabularies, etc. (expensive and
effective) and the chaos of keyword searching on the web (inexpensive and utterly ineffective)
(Chepesuik, 1999).
Other researchers also note that without some standardization and centralized control,
metadata will have little value and therefore will not be used by search engines (see
Sokvitne, 2000; Henshaw and Valauskas, 2001; Tennant, 2003, 2004). The stakes are
therefore quite high. Lack of bibliographic control could lead to such inconsistent
metadata that search engines completely disregard it, which would make the use of
metadata by authors publishing pages to the open internet an exercise in futility.
Proponents of a looser metadata scheme, however, argue that if metadata is too
difficult for the average web author to create, those authors will not use the scheme or
will misuse the scheme, each of which could result in the ultimate demise of metadata
as a tool of internet resource discovery. Carl Lagoze (2001), for example, argues that
even though there is a place for greater granularity in metadata, there is also a strong
argument for “pidgin” metadata on the internet. This type of metadata would be simple
enough that multiple and diverse search algorithms could access its contents, and in
this way the pidgin scheme would allow for basic cross-domain resource discovery
(Lagoze, 2001). Diane Hillman (2003) agrees, saying that there are such differences in
vocabulary preferences between spheres of knowledge that pidgin metadata schemes
provide better cross-domain retrieval possibilities.
Those who give advice and do research on how to increase web page visibility seem
to agree with Lagoze and Hillman. Most advocate using only “keyword”, “description”,
or a combination of those two elements (see Richardson, 2003; Search Engine
Optimization, n.d.; Search Engine Optimization 1-2-3, n.d.; Sullivan, 2003; Yahoo.com,
n.d.). Other research indicates that the “keyword”, “description”, and “title” el ements
influence retrieval and ranking more than other elements do (Zhang and Dimitroff,
2005). This type of research tends to support the proposal to keep metadata simple.
Creating metadata is expensive, requiring time and thought. Every element added
costs money, so in an age of tightening profit margins metadata’s strength is often seen
Metadata
element
co-occurrence
429

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT