Cultural heritage as digital noise: nineteenth century newspapers in the digital archive

DOIhttps://doi.org/10.1108/JD-09-2016-0106
Pages1228-1243
Date09 October 2017
Published date09 October 2017
AuthorJohan Jarlbrink,Pelle Snickars
Subject MatterLibrary & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet
Cultural heritage as digital noise:
nineteenth century newspapers in
the digital archive
Johan Jarlbrink and Pelle Snickars
Department of Culture and Media Studies, Umeå University, Umeå, Sweden
Abstract
Purpose The purpose of this paper is to explore and analyze the digitized newspaper collection at the
National Library of Sweden, focusing on cultural heritage as digital noise. In what specific ways are
newspapers transformed in the digitization process? If the digitized document is not the same as the source
document is it still a historical record, or is it transformed into something else?
Design/methodology/approach The authorshave analyzed the XML files fromAf tonbladet 1830 to 1862.
The most frequent newspaper words not matching a high-quality references corpus were selected
to zoom in on the noisiestpart of the paper. The variety of the interpretations generatedby optical character
recognition (OCR)was examined, as well as texts generatedby auto-segmentation. The authorshave made a
limited ethnographic study of the digitization process.
Findings The research shows that the digital collection of Aftonbladet contains extreme amounts of noise:
millions of misinterpreted words generated by OCR, and millions of texts re-edited by the auto-segmentation
tool. How the tools work is mostly unknown to the staff involved in the digitization process? Sticking to any
idea of a provenance chain is hence impossible, since many steps have been outsourced to unknown factors
affecting the source document.
Originality/value The detail examination of digitally transformed newspapers is valuable to scholars
depending on newspaper databases in their research. The paper also highlights the fact that libraries
outsourcing digitization processes run the risk of losing control over the quality of their collections.
Keywords Sweden, Archives, Documents, Accuracy, Print media, Auto-segmentation,
Character recognition equipment, Large-scale digitization
Paper type Research paper
Introduction
In October 1847, the telegraphic wire in St Germain outside of Paris was struck by lightning.
The Swedish newspaper Aftonbladet reported that a telegraph assistant at a station nearby
had discovered the demolished telegraph printing several letters on its own. Yet, according
to the paper, since they were not coherent, he decided to signal the phrase used for I do not
understand.’”In doing so, however, he received a heavy electric shock, which was followed
by a loud bang, sounding like a gunshot(Aftonbladet, 1847).
Within a digitization project initiatedby the National Library of Sweden, the 1847 October
copy of Aftonbladet was digitized in 2013 at the Swedish Media Conversion Centre.
The newspaper Aftonbladet, founded in 1830, was oneof the key titles in nineteenth century
Sweden. It is often described as the first modern newspaper consequently, it was also the
first newspaperto be completely digitized by the National Library. Thenagain, if a telegraph
struck by lightningin the late 1840s producedsome real uncanny results,the same can be said
of present day digitization processes. The digital version of the paper with the lightning-
telegraph incident, in fact, literally reported that the struck assistant saw a dazzling
light along the wires on the walls conducting electricity de visI devärdigavid värdigavid
dejemte fullkomen ihåförvintparkerslagna förvintparkerslagna parkerslagna kentas till70
70 misvårt fruktarsnart tAf eoch sisrans njes ej [] which fell down in pieces, burning the
table and the floor.
Journal of Documentation
Vol. 73 No. 6, 2017
pp. 1228-1243
© Emerald PublishingLimited
0022-0418
DOI 10.1108/JD-09-2016-0106
Received 5 September 2016
Revised 2 March 2017
Accepted 19 March 2017
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0022-0418.htm
This research has been funded by The Torsten Söderberg Foundation.
1228
JD
73,6
The mysterious words in the middle of the quote are not Swedish, and no reader of
Aftonbladet in 1847 would have found them in any newspaper copy (hence no reference to
the quote). As the non-coherent letters printed by the telegraph they seem to have been
generated by an external disturbance which, however, occurred 166 years later
through the very act of digitization at the Swedish Media Conversion Center. Today, these
sentencescan be found in the newspaper database, Svenska dagstidningarat the
National Library. Many texts and words now part of similar digitized newspaper databases
share the same fate; and some are of a similar weird kind. What was never printed in old
newspapers has today become part of the historical record.
In this paper we argue that the digitization of historical newspapers is not a neutral
process where data are transferred from one medium to another. On the contrary, when
newspapers are digitized they are transformed. Like telegraphic signals they usually
resemble what was transmitted, but sometimes not. In this paper, we are consequently
interested in noisy media, and the ways that digitized nineteenth century Swedish
newspapers can today be perceived as a sort of waste(d) heritage. Likewise, there exists a
fascinating media historical analogy since basically all of the printed newspaper issues
reporting on the failing St Germain telegraph in 1847 were also turned into waste (or waste
paper) after a few days. And the very few copies that survived are now slowly
disintegrating in library repositories.
As is well-known, libraries all over the world are today digitizing their historical
newspapers for preservation, as well as making them digitally available for research
(and pleasure).The results of these effortsare useful databases that makeit possible to search
millions of newspaper pages online. Yet, as we argue in this paper, contemporary digitization
processes can also be seen as a continuation of the process turning newspapers into waste.
As the digitizednewspaper report from 1847 displays,digitization can generateits own sort of
waste usually in theform of digital noise. In addition, digitization today results in physical
paper copies being wasted, almost as soon as the pages have been scanned. In most cases
preservation of cultural heritage is the opposite of destruction (Assmann, 2008). But for
newspaper digitization, preservation and destruction goes hand in hand.
In this paper prompted by a media archeological interest to dig deeper we ask
ourselves in what ways are newspaper documents being transformed in the digitization
process? What kinds of errors do actually occur? If the digitized document is not the same as
the source document is it still a historical record, or is it transformed into something else?
How is it possible to practice source criticism when the mechanisms and algorithms for
selecting, capturing, processing and storing the historical data are hidden behind graphical
user interfaces? In short, the process of digitization, optical character recognition (OCR),
article segmentation, modes of presentation etc are all infrastructural settings that
transforms old newspapers into new objects with a media specificity different from the
original paper prints. As a consequence, the growing reliance on digital reproductions of old
newspapers raises questions from both a heritage and research perspective regarding
the function of such scanned documents, especially the relation between newspaper source
documents and digital reproductions (Mussell, 2012).
Media technologies are seen as user friendly as long as one does not have to bother about
the way the underlying technologies work. The archive, as Jussi Parikka (2012) makes clear,
could be seen as such a media technology, since it is the implicit starting point for so much
historical research that it itself, as a place and a media form, has been neglected, become
almost invisible( p. 113). It can thus be argued that digital archives are more invisible than
traditional archives, since the mechanisms regulating them are virtually hidden behind a
graphical user interface. This is obviously in conflict with historical methodologies
emphasizing on source criticism, and questions regarding the selection and processing of
sources. The way digitized data (as newspaper files) are created, stored, processed and
1229
Cultural
heritage as
digital noise

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT