Comparing “parallel passages” in digital archives

Document

Cited in

Pages	271-289
Published date	02 September 2019
Date	02 September 2019
DOI	https://doi.org/10.1108/JD-10-2018-0175
Author	Martyn Harris,Mark Levene,Dell Zhang,Dan Levene
Subject Matter	Library & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet

Comparing “parallel passages”

in digital archives

Martyn Harris, Mark Levene and Dell Zhang

Department of Computer Science and Information Systems,

Birkbeck, University of London, London, UK, and

Dan Levene

Department of History, Southampton University, Southampton, UK

Abstract

Purpose –The purpose of this paper is to present a language-agnostic approach to facilitate the discoveryof

“parallel passages”stored in historic and cultural heritage digital archives.

Design/methodology/approach –The authors explore a novel, and relatively simple approach, using a

character-based statistical language model combined with a tailored version of the Basic Local Alignment

Tool to extract exact and approximate string patterns shared between groups of documents.

Findings –The approach is applicable to a wide range of languages, and compensates for variability in the

text of the documents as a result of differences in dialect, authorship, language change over time and errors

due to inaccurate transcriptions and optical character recognition errors as a result of the digitisation process.

Research limitations/implications –A number of case studies demonstrate that the approach is practical

and generalisable to a wide range of archives with documents in different languages, domains and of

varying quality.

Practical implications –The approach described can be applied to any digital archive of modern and

contemporary texts. This makes the approach applicable to digital archives recording historic texts, but also

those composed of more recent news articles, for example.

Social implications –The analysis of “parallel passages”enables researchers to quantify the presence and

extent of text-reuse in a collection of documents, which can provide useful data on author style, text genres

and cultural contexts.

Originality/value –The approach is novel and addresses a need by humanities researchers for tools that

can identify similar documents and local similarities represented by shared text sequences in a potentially

vast large archive of documents. As far as the authors are aware, there are no tools currently exist that

provide the same level of tolerance to the language of the documents.

Keywords Digital libraries, Computer applications, Archives, Linguistics, Probabilistic analysis,

Language and literature

Paper type Research paper

1. Introduction

The term “parallel passage”refersto identical, or approximatetext patterns of variable length,

which could be regardedas semantically equivalent. “Parallel-passages”represent alternative

surface representations that exhibit identical wording, such as those representing reported

speech and direct quotations, or with some small variation in grammatical structure, or

vocabulary choice as a result of paraphrasing. On the one hand, differences in vocabulary

choice maybe the result of synonymy, or hyperonymywhere a general or higher-levelconcept

has been selected (Madnani and Dorr, 2010). On the other hand, paraphrasingon the part of

the author may provideevidence of text-reuse, or intertextuality (Fairclough,1992), where the

author has summarised the main concepts, or meaning, encoded by one or more texts that

preceded it.

Further differences between passages may arise due to a shift in authorship, dialect, the

natural evolution of language over time (Buchler et al., 2010), and errors introduced by

optical character recognition (OCR) during the digitisation. The task of comparing

equivalent or similar shared text patterns in text corpora stored in digital archives, has

become increasingly challenging and time consuming due to the current scale of digital text

data, which makes the task of comparing shared text patterns across multiple documents

Journal of Documentation

Vol. 76 No. 1, 2020

pp. 271-289

0022-0418

DOI 10.1108/JD-10-2018-0175

Received 29 October 2018

Revised 10 July 2019

Accepted 14 July 2019

The current issue and full text archive of this journal is available on Emerald Insight at:

www.emeraldinsight.com/0022-0418.htm

271

Parallel

passages in

digital archives

practically impossible to do manually. Identifying parallel-passages, such as those

exemplified by paraphrases, also supports a range of natural language tasks, including text

generation, information retrieval and extraction, and summarisation.

This paper presents an overview of the text mining tools developed to compare parallel-

passages, which were deployed in a system known as the Search And Mining Tools for

Language Archives (Samtla), which was developed to support the research of historic and

cultural heritage collections of documents stored in digital archives. The paper is organised

as follows, in Section 2, we review the related work. Section 3 describes the corpora used as

test cases to explore the results generated by our proposed approach. We provide a

description of the model used as a basis for extracting and scoring the contents of

documents according to their shared-text patterns in Section 4. In Section 5, we describe the

approach used for identifying related documents according to our proposed model, where

we measure the similarity of pairs of documents based on their character-level n-gram

probability distributions. Section 6 presents an approach for visualising local similarities

between the content of related documents in the form of variable length parallel-passages

extracted from the document content. We briefly discuss the motivation behind the user

interface in Section 7, and some of the language and corpus dependent issues that the

document comparison tool addresses to demonstrate the flexibility of the approach to

different domains, languages, authors, and time periods in Section 8. We conclude the paper

with a summary of the work in Section 9, and future research and development.

2. Related work

Books, web pages, articles and reports are all examples of unstructured text data where

relevant information exists potentially anywhere within the document. Unstructured text

data is often managed and retrieved via a search engine (Levene, 2010). Search engines

provide the means to retrieve information but not to analyse it, this is where text mining

techniques are useful, as they provide different views of the data to facilitate the discovery

and subsequent analysis of textual patterns (Aggarwal and Zhai, 2012). These patterns

can then be examined more closely through traditional research techniques such as

close-reading of the text, but this is generally only possible for small scale digital archives.

One text analysis problem that is of great interest to researchers, particularly those

analysing the content of digital archives, is to find parallel-passages –text segments

describing the same concept (entity or event, etc.) over large corpora. Parallel-passages are

semantically similar and could exhibit identical wording, but quite often they exhibit some

small variation in structure, or vocabulary choice. The differences are due to the normal

rephrasing of the text within the same context, but may also arise from the use of reported

speech, a change in authorship, dialect, the natural evolution of language over time and

errors introduced by OCR. In this paper, the concept of parallel passage is defined in the

general sense. Roughly speaking, one can regard parallel-passages as a variable length

structural text pattern. However, the term parallel passage probably originated from

Christian theology, where the comparison of parallel-passages, or hermeneutics, in the

context of the Bible is a major area of Biblical scholarship (see Strauss and Eliot, 1860).

The aim of hermeneutics is to render a translation of a text by comparing examples of the

same word, phrase, or context across several texts. The researchers’task is to identify

whether the differences present between one or more texts that are regarded as similar, is

significant or relevant to the research hypothesis whether the focus of research is on the

stylistic differences between authors, or providing evidence of the evolution of textual

sequences over time. The technique often involves comparing corresponding passages

located across more than one text, by laying out the texts side-by-side. The Bible often

describes the same event from different perspective across different canonical books, which

can yield a more complete picture of the event than a single passage, or point of view on

272

76,1

To continue reading

Request your trial

Subscribers can access the reported version of this case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the cited cases and legislation of a document.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a list of all the documents that have cited the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the revised versions of legislation with amendments.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see any amendments made to the case.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see a visualisation of a case and its relationships to other cases. An alternative to lists of cases, the Precedent Map makes it easier to establish which ones may be of most relevance to your research and prioritise further reading. You also get a useful overview of how the case was received.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Subscribers are able to see the list of results connected to your document through the topics and citations Vincent found.

You can sign up for a trial and make the most of our service including these benefits.

Request your trial

Why Sign-up to vLex?

Over 100 Countries

Search over 120 million documents from over 100 countries including primary and secondary collections of legislation, case law, regulations, practical law, news, forms and contracts, books, journals, and more.
Thousands of Data Sources

Updated daily, vLex brings together legal information from over 750 publishing partners, providing access to over 2,500 legal and news sources from the world’s leading publishers.
Find What You Need, Quickly

Advanced A.I. technology developed exclusively by vLex editorially enriches legal information to make it accessible, with instant translation into 14 languages for enhanced discoverability and comparative research.
Over 2 million registered users

Founded over 20 years ago, vLex provides a first-class and comprehensive service for lawyers, law firms, government departments, and law schools around the world.

Comparing “parallel passages” in digital archives

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users

You can sign up for a trial and make the most of our service including these benefits.

Why Sign-up to vLex?

Over 100 Countries

Thousands of Data Sources

Find What You Need, Quickly

Over 2 million registered users