Metadata mega mess in Google Scholar

Published date23 February 2010
Pages175-191
Date23 February 2010
DOIhttps://doi.org/10.1108/14684521011024191
AuthorPéter Jacsó
Subject MatterInformation & knowledge management,Library & information science
SAVVY SEARCHING
Metadata mega mess in Google
Scholar
Pe
´ter Jacso
´
University of Hawaii, Hawaii, USA
Abstract
Purpose – Google Scholar (GS) has shed the beta label on the fifth anniversary of launching its
service. This paper aims to address this issue.
Design/methodology/approach – As good as GS is – through its keyword search option – to find
information about tens of millions of documents, many of them in open access full text format, it is as
bad for metadata-based searching when, beyond keywords in the title, abstract, descriptor and/or full
text, the searcher also has to use author name, journal title and/or publication year in specifying the
query. This paper provides a review of recent developments in Google Scholar.
Findings – GS is especially inappropriate for bibliometric searches, for evaluating the publishing
performance and impact of researchers and journals.
Originality/value – Even if the clean up of Google Scholar accelerates it should not be forgotten that
those evaluations of individuals and journals that have been done based on Google Scholar in the past
few years have grossly handicapped many authors and journals whose name was replaced by
phantom entries.
Keywords Information searches, Text retrieval
Paper type Conceptual paper
Google Scholar (GS) has shed the beta label on the fifth anniversary of launching its
service. As good as GS is – through its keyword search option – to find information
about tens of millions of documents, many of them in open access full text format, it is as
bad for metadata-based searching when, beyond keywords in the title, abstract, descriptor
and/or full text, the searcher also has to use author name, journal title and/or publication
year in specifying the query. GS is especially inappropriate for bibliometric searches, for
evaluating the publishing performance and impact of researchers and journals.
Current results of testing this production version clearly indicate that GS is not
ready even for calculating the fairly simple indicators of publishing productivity of
individuals and journals in a given time period. Some systematic flaws (reported in
earlier issues), which resulted in several million erroneously attributed records, have
been fixed (corrected or deleted) but many others remain and huge volumes of records
keep being generated by the grossly under-trained and inferior web crawlers of GS.
Often these errors deprive the real authors of their authorship, decreasing their
productivity and massively distorting their citedness counts by attributing
publications to phantom/putative authors, bumping the real authors. GS’s parsers
keep creating authors from section headings in articles, from menu options on web
pages and/or from journal names, and from MeSH (Medical Subject Headings) terms
assigned to documents. In other cases phantom authors do not replace the real ones,
just join them as co-author(s). In both scenarios the indicators may be further distorted
by erroneous publication year data construed by GS from page numbers, parts of
The current issue and full text archive of this journal is available at
www.emeraldinsight.com/1468-4527.htm
Metadata mega
mess in Google
Scholar
175
Online Information Review
Vol. 34 No. 1, 2010
pp. 175-191
qEmerald Group Publishing Limited
1468-4527
DOI 10.1108/14684521011024191
phone/fax numbers, postal codes and other numeric data. These errors taint
bibliometric indicators suc h as articles per journal, publica tions per author,
publications per author per year, number of authors per publication, and others. The
major patterns and sample volumes of these errors in traditional bibliographic
metadata are presented in this paper, to be followed in a separate paper about the
metadata mess in the citedness counts calculated by GS for individual researchers,
publications and lately also for partial or complete journal runs.
Introduction
The launch of the Google Scholar service was an important milestone because it offered
unprecedented free access to bibliographic data (often with abstract) and also to
millions of full text articles. It grew out of the CrossRef project, initiated in 2000, to
create a citation-linking network that allows the quick look-up of the full text of cited
references in the digital full text archives of scholarly publishers – by subscribers.
Google Inc. was the software partner in the pilot phase of the project that started with
the cooperation of 45 scholarly publishers. As of Fall 2009 the number of participating
publishers is about 2,900 (including societies) and the number of items assigned a DOI
(digital object identifier) is well above 38 million. The number of serial publications in
CrossRef is close to 21,000.
I have regularly reviewed GS since its launch and have praised its huge advantage
in finding information at no charge about scholarly papers through keyword searching
and often also leading to a free full text version of good quality papers on any topic.
However, I have also kept warning researchers about several problems in GS, such as
the:
.large-scale problems of the primitive parsing of the digital mega collections of
the largest academic publishers;
.insane handling of the simplest Boolean OR operation and the lack of word
truncation;
.lumping together in its result list of the master records for the papers written by
the author and the mini records labelled as [citation] extracted from the cited
references; and
.production of highly inflated hit counts and often absurdly high citation counts
(due to the extremely loose criteria for citation matching).
Most of these problems originate from a mix of incompetence, carelessness and
reckless negligence in essential quality control tests. GS developers choose to ignore
the readily available metadata tags that identify the various bibliographic data
elements (author, journal, publication year, etc.) and to let their very under-educated
parsers figure out who are the authors and what is the publication year and the name
of the journals, conference proceedings and books.
The GS team unleashed parsers that were trained a decade ago to retrieve and index
the unstructuredweb, focusing on identifyingthe URL and the title of the web page from
the HTML commands,when indexing the pages with practicallyno metadata before the
introduction of the Dublin Core standard. This worked fine in the revolutionary general
Google search engine for finding the needles in the haystack andfor creating results sets
matching the typicalqueries of the users that consisted of 1.2 words on average. Indeed,
the very smart relevance-ranking algorithmof Google offered many of the most relevant
OIR
34,1
176

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT