Identification of social scientifically relevant topics in an interview repository: a natural language processing experiment
| Date | 13 October 2023 |
| Pages | 354-377 |
| DOI | https://doi.org/10.1108/JD-12-2022-0269 |
| Published date | 13 October 2023 |
| Subject Matter | Library & information science,Records management & preservation,Document management,Classification & cataloguing,Information behaviour & retrieval,Collection building & management,Scholarly communications/publishing,Information & knowledge management,Information management & governance,Information management,Information & communications technology,Internet |
| Author | Judit Gárdos,Julia Egyed-Gergely,Anna Horváth,Balázs Pataki,Roza Vajda,András Micsik |
Identification of social
scientifically relevant topics in
an interview repository: a natural
language processing experiment
Judit G
ardos, Julia Egyed-Gergely and Anna Horv
ath
Centre for Social Sciences, Budapest, Hungary
Bal
azs Pataki
Department of Distributed Systems, Institute for Computer Science and Control,
Budapest, Hungary
Roza Vajda
Centre for Social Sciences, Budapest, Hungary, and
Andr
as Micsik
Department of Distributed Systems, Institute for Computer Science and Control,
Budapest, Hungary
Abstract
Purpose –The present study isabout generating metadata to enhance thematic transparency and facilitate
research on interview collections at the Research Documentation Centre, Centre for Social Sciences (TK
KDK) in Budapest. It explores the use of artificial intelligence (AI) in producing, managing and processing
social science data and its potentialto generate useful metadata to describe the contents of such archives on a
large scale.
Design/methodology/approach –The authors combined manual and automated/semi-automated methods
of metadata development and curation. The authors developed a suitable domain-oriented taxonomy to
classify a large text corpus of semi-structured interviews. To this end, the authors adapted the European
Language Social Science Thesaurus (ELSST) to produce a concise, hierarchical structure of topics relevantin
social sciences. The authors identified and tested the most promising natural language processing (NLP) tools
supporting the Hungarian language. The results of manual and machine coding will be presented in a user
interface.
Findings –The study describes how an international social scientific taxonomy can be adapted to a specific
local setting and tailored to be used by automated NLP tools. The authors show the potential and limitations of
existing and new NLP methods for thematic assignment. The current possibilities of multi-label classification
in social scientific metadata assignment are discussed, i.e. the problem of automated selection of relevant labels
from a large pool.
Originality/value –Interview materials have not yet been used for building manually annotated training
datasets for automated indexing of scientifically relevant topics in a data repository. Comparing various
automated-indexing methods, this study shows a possible implementation of a researcher tool supporting
custom visualizations and the faceted search of interview collections.
Keywords Sociology, Research data repository, Natural language processing (NLP), Thesaurus,
Multi-label classification, Exploratory UI, Text visualization
Paper type Article
JD
80,2
354
The project presented in this publication, implemented by the Research Documentation Centre of the Centre
for Social Sciences (TK KDK) and the Department of Distributed Systems of the Institute for Computer Science
and Control (SZTAKI DSD), was supported by the European Union project RRF-2.3.1-21-2022-00004 within
the framework of the Artificial Intelligence National Laboratory. The authors are thankful to Veronika Lipp,
D
aniel Martin, Attila Marx, M
arton Matyasovszky-N
emeth, Enik}
o Meiszterics, M
aria Nem
enyi, Tam
as P.
T
oth, B
alint Sass and Melinda Siket for their valuable contributions to this project.
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/0022-0418.htm
Received 30 December 2022
Revised 11 September 2023
Accepted 16 September 2023
Journal of Documentation
Vol. 80 No. 2, 2024
pp. 354-377
© Emerald Publishing Limited
0022-0418
DOI 10.1108/JD-12-2022-0269
Introduction
In recent years, the various dimensions of the emerging open science framework have had a
lasting impact on discussions about archival practices. Most notably, from the point of view
of creating suitable and sufficient information for accessing archival documents, the
Findable, Accessible, Interoperable and Reusable (FAIR) guiding principles Wilkinson et al.,
2016) of scientific data management and stewardship have been gaining considerable
ground. Metadata, tailored to specific needs, plays a vital role in making the FAIR principles a
reality in oral history archives as well as other digital data environments. The “digital turn”
has transformed many aspects of scientific research too, including our ways of engagement
with its processes and results.
Digitalization has brought on a new era not only in conducting sociological and
historical research, but also in conserving and making such research available for the
interested audiences. Thus the discipline of digital curation has emerged. Data curators
and researchers, possessing accumulated knowledge in their specific field and, ideally,
also in other sciences, play a vital role in providing access to research data (Ch^
ateau et al.,
2012). Digital data curation and research focuses on the interplay between the
professional, academic and technical dimensions of the research process in order to
ensure that information created digitally remains accessible and useable (Higgins, 2018).
In this article, we explore how the thematic analysis and sharing of qualitative research
data can be achieved in a multi-language setting. The European social scientific research
field is a highly interconnected one through numerous internationalfunding schemes, yet
at the same time also fragmented due to limitations caused by language barriers. Most
prominently, non-English qualitative research data, such as interviews, cannot be used in
a way as the FAIR principles of data sharing would suggest. FAIR principles describe and
prescribe technical characteristics of research archives, however, the actual ways of
sharing data in smaller languages is still an unresolved issue –which has been tackled
recently in some projects concerning data in other small languages (like Finnish, Skenderi
et al., 2021).
Our archivists at the Research Documentation Centre (Kutat
asi Dokument
aci
os
K€
ozpont, KDK [1]) at the Centre for Social Sciences in Budapest, Hungary, have been long
working towards more and better metadata for the collections to serve researchers and
connect with other similar enterprises worldwide. KDK is a social scientific data repository
collecting, organizing and making digitally available the documents (interviews,
questionnaires, survey data, etc.) generated during research conducted in the four
institutes of the Centre for Social Sciences, the largest social scientific research facility of
the country. Its digital holdings cover various disciplines, including political science,
sociology, minority studies and law. KDK also hosts the Voices of the 20th Century
Archive and Research Group [2], which collects, digitizes and curates the dispersed
materials of qualitative sociological research in Hungary between 1960 and 2010 (mainly
interview transcripts but also photographs, drawings, video interviews, notes, study
drafts, etc.). The two online repositories together make tens of thousands of digital files
available free of charge for researchers and other interested audiences. In addition to
providing repository services, KDK also conducts research projects, participates in policy
making and actively promotes a culture of data management.
Anopportunityintestingnewmethodstoexplorewhat data sharing could look like for
qualitative social scientific research data appeared with a nationwide project, funded by
the European Union, encouraging the use of artificial intelligence (AI) in scientific research
[3]. We began in 2020 by laying the groundwork for a project that would, in the coming
years, grow to be a comprehensive effort to add contentual metadata to the documents and
collections in our archives. The staff has worked in cooperation with the computer
scientists at the Institute for Computer Science and Control (Sz
am
ıt
astechnikai
es
Social topics in
interview
repository
355
Get this document and AI-powered insights with a free trial of vLex and Vincent AI
Get Started for FreeStart Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting
Start Your Free Trial of vLex and Vincent AI, Your Precision-Engineered Legal Assistant
-
Access comprehensive legal content with no limitations across vLex's unparalleled global legal database
-
Build stronger arguments with verified citations and CERT citator that tracks case history and precedential strength
-
Transform your legal research from hours to minutes with Vincent AI's intelligent search and analysis capabilities
-
Elevate your practice by focusing your expertise where it matters most while Vincent handles the heavy lifting