The medium-term prospects for long-term storage systems

Date20 March 2017
DOIhttps://doi.org/10.1108/LHT-11-2016-0128
Published date20 March 2017
Pages11-31
AuthorDavid Stuart Holmes Rosenthal
Subject MatterLibrary & information science,Librarianship/library management,Library technology,Information behaviour & retrieval,Information user studies,Metadata,Information & knowledge management,Information & communications technology,Internet
The medium-term prospects for
long-term storage systems
David Stuart Holmes Rosenthal
LOCKSS Program, Stanford University, Palo Alto, California, USA
Abstract
Purpose Increasingly, the content that libraries collect is no longer on paper, a long-lived, medium whose
technology changes very slowly and with which they have centuries of experience. Instead, it is stored on
relatively short-lived digital media whose technology appears to change rapidly and with which they have
little history. The paper aims to discuss this issue.
Design/methodology/approach The storage media industry is highly competitive and is currently
evolving rapidly as flash, a solid state medium, displaces spinning disk from many applications. Long-term
archival storage is a small part of the total storage market. It typically re-uses media and systems intended for
more general bulk storage.
Findings What are the medium-term prospects for change in this market?
Originality/value Much of this material has appeared in blog posts and talks aimed at storage experts,
such as the recent DARPA workshop on future of storage. It is presented here for a librarian audience with
the necessary additional exposition and background.
Keywords Digital libraries, Data mining, Data storage, Digital preservation, Archiving, Digital storage
Paper type Technical paper
What is long-term storage?
The storage of a computer system is usually described as a hierarchy. Newly created
or recently accessed data reside at the top of the hierarchy in relatively small amounts of
very fast, very expensive media (Figure 1). As it ages, it migrates down the hierarchy to
larger, slower and cheaper media.
Long-term storage implements the base layers of the hierarchy, often called bulk
or capacitystorage. Most discussions of storage technology focus on the higher, faster
layers, which these days are the territory of all-flash arrays holding transactional databases,
search indexes, breaking news pages and so on. The data in those systems are always just a
cache. Long-term storage is where old blog posts, cat videos and most research data sets
spend their lives.
What temperature is your data?
If everything is working as planned, data in the top layers of the hierarchy will be accessed
much more frequently, be hotter,than data further down. At scale, this effect can be
extremely strong.
Subramanian Muralidhar and a team from Facebook, USC and Princeton have an OSDI
paper, f4: Facebooks Warm (Binary Large Object) BLOB Storage System (Muralidhar et al.),
describing the warm layer between Facebooks Haystack (Beaver et al.) hot storage layer
and their cold storage layers. Third section describes the behavior of BLOBs of different
types in Facebooks storage system. Each type of BLOB contains a single type of immutable
binary content, such as photos, videos, documents, etc. The rates for different types of
BLOB drop differently, but all nine types have dropped by two orders of magnitude within
Library Hi Tech
Vol. 35 No. 1, 2017
pp. 11-31
© Emerald PublishingLimited
0737-8831
DOI 10.1108/LHT-11-2016-0128
Received 9 November 2016
Accepted 29 November 2016
The current issue and full text archive of this journal is available on Emerald Insight at:
www.emeraldinsight.com/0737-8831.htm
The author is grateful to Seagate, and in particular to Dave B. Anderson, for (twice) allowing the author
to pontificate about their industry, to Brian Berg for his encyclopedic knowledge of the history of flash,
and Tom Coughlin for illuminating discussions and the graph of exabytes shipped. This is not to say
that they agree with any of the above. This work was supported by the institutional members of the
LOCKSS Alliance and the CLOCKSS Archive.
11
Medium-term
prospects
eight months, and all but one (profile photos) have dropped by an order of magnitude within
the first week.
The Facebook data make two really strong arguments for hierarchical storage
architectures at scale:
(1) that significant kinds of data should be moved from expensive, high-performance
hot storage to cheaper warm and then cold storage as rapidly as feasible; and
(2) that the I/O rate that warm storage should be designed to sustain is so different from
that of hot storage, at least two and often many more orders of magnitude, that
attempting to re-use hot storage technology for warm and even worse for cold
storage is futile.
The argument that the long-term bulk storage layers will need their own technology is
encouraging, because (see below) there is not going to be enough of the flash media that are
taking over the performance layers for bulk storage.
But there is a caveat. Typical at-scale systems such as Facebooks do show infrequent
access to old data.This used to be true in libraries and archives.But the advent of data mining
and other big dataapplications means that increasingly scholars want not to access a few
specific items, butinstead to ask statistical questions of an entire collection. The implications
of this change in access patterns for long-term storage architectures are discussed below.
How long is the medium term?
Iain Emsleys talk at PASIG2016 on planning the storage requirements of the 1PB/day
square kilometer array mentioned that the data were expected to be used for 50 years.
How hard a problem is planning with this long a horizon? Looking back 50 years can
provide a clue (Plate 1).
Computer Memory Hierarchy
small size
small capacity
processor registers
very fast, very expensive
power on
immediate term
power on
very short term
power off
short term
power off
mid term
power off
long term
processor cache
very fast, very expensive
random access memory
fast, affordable
flash/USB memory
slower, cheap
small size
small capacity
medium size
medium capacity
small size
large capacity
large size
very large capacity
large size
very large capacity
hard drives
slow, very cheap
tape backup
very slow,
affordable
Source: Wikipedia (XXXXe)
Figure 1.
Storage hierarchy
12
LHT
35,1

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT