How do the kids speak? Improving educational use of text mining with child-directed language models

DOIhttps://doi.org/10.1108/ILS-06-2022-0082
Published date19 January 2023
Date19 January 2023
Pages25-47
Subject MatterLibrary & information science,Librarianship/library management,Library & information services
AuthorPeter Organisciak,Michele Newman,David Eby,Selcuk Acar,Denis Dumas
How do the kids speak? Improving
educational use of text mining
with child-directed
language models
Peter Organisciak
Department of Research Methods and Information Science, University of Denver,
Denver, Colorado, USA
Michele Newman
Information School, University of Washington, Seattle, Washington, USA
David Eby
School of Information Sciences, University of Illinois at Urbana-Champaign,
Champaign, Illinois, USA
Selcuk Acar
Department of Educational Psychology, University of North Texas,
Denton, Texas, USA, and
Denis Dumas
Department of Educational Psychology, University of Georgia,
Athens, Georgia, USA
Abstract
Purpose Most educational assessments tend to be constructed in a close-ended format, which is
easier to score consistently and more affordable. However, recent work has leveraged computation
text methods from the information sciences to make open-ended measurement more effective and
reliable for older students. The purpose of this study is to determine whether models used by
computational text mining applications need to be adapted when used with samples of elementary-
aged children.
Design/methodology/approach This study introduces domain-adapted semantic models for child-
specic text analysis, to allow better elementary-aged educational assessment. A corpus compiled from a
multimodal mix of spokenand written child-directed sources is presented,used to train a childrens language
model and evaluatedagainst standard non-age-specic semanticmodels.
Findings Child-oriented language is found to differ in vocabulary and w ord sense use from
general English, while exhibiting lower gender and race biases. The model is evaluated in an
educational application of divergent thinking measurement and shown to improve on generalized
English models.
The authors thank Kelly Berthiaume, Maggie Ryan and the full MOTES team for additional
contributions and advice.
Funding: This study was funded by the Institute of Education Sciences (IES) (Grant No.
R305A200519).
Research data: The MOTES Corpus model as well as code for reproducing data collection and
modeling is available at https://osf.io/pwvda.
Educational
use of text
mining
25
Received22 June 2022
Revised30 September 2022
23November 2022
Accepted30 November 2022
Informationand Learning
Sciences
Vol.124 No. 1/2, 2023
pp. 25-47
© Emerald Publishing Limited
2398-5348
DOI 10.1108/ILS-06-2022-0082
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2398-5348.htm
Research limitations/implications The ndings demonstrate the need for age-specic language
models in the growing domain of automated divergentthinking and strongly encourage the same for other
educationaluses of computation text analysis by showinga measurable difference in the languageof children.
Social implications Understanding childrens language more representatively in automated
educational assessment allows for more fair and equitable testing. Furthermore, child-specic language
models havefewer gender and race biases.
Originality/value Research in computational measurementof open-ended responses has thus far used
models of language trained on generalEnglish sources or domain-specic sources such as textbooks.To the
best of the authorsknowledge, this paper is the rstto study age-specic language modelsfor educational
assessment.In addition, while there have been several targeted, high-qualitycorpora of child-created or child-
directed speech,the corpuspresented here is the rst developed with the breadth and scalerequired for large-
scale text modeling.
Keywords Educational data mining, Text mining, Learning, Assessment, Language modeling,
Divergent thinking
Paper type Research paper
1. Introduction
Recent advancements in natural language processing are enabling a key advancement in
education: the ability to parseopen-ended measurement responses reliably and consistently.
Doing so greatly opens the ability to measure knowledge and abilities that have
traditionally not been well suitedto close-ended testing, such as measures of originality and
divergent thinking, which have been costlyand uneven in the past (Acar et al.,2021;Dumas
and Dunbar, 2014;Dumas et al.,2020). However, realizing the possibility of computational
methods for improving assessment, particularly in contexts that involve children, requires
tools that meet the needs of educational domains while fullling expectations of
transparency and interpretability.
In this work, a corpus of child-directed language is developed and modeled to better
understand how the linguistic prole of childrens language differs from general English.
The MOTES Corpus, emergent from the Measurement of Original Thinking in Elementary
Students project, differs frompast childrens corpora by focusing on scale, which allows it to
be used in modern computationtext mining applications. This scale is achieved by focusing
on child-directed resources,which are more comprehensively available than child-spokenor
child-produced corpora. The child-directed model is compared to a general-language model
in three ways: through an empirical linguistic analysis, where notable differences in
language use are observed; through a bias analysis, where the childrens corpus is found to
lead to lower race and gender stereotyping; and in an applied context, on tests of childrens
divergent thinking from the MOTES project, where child-focused text models are found to
outperform traditionmodels.
Applications of natural language processing often rely on models of relationships in
general English. Entirelydifferent words may mean very similar or identicalthings, and the
relatedness of words needs to be represented for systems to understand that people talk or
write about similar concepts in varying ways. In other words, models of relationships
between words help applications focus on latent meanings rather than the specic words
used in conveying those meanings. The bestmodels are learned by observing great deals of
text, so it has become commonplace for scholars and practitioners to use pretrained, openly
distributed models in a process called transfer learning (Zhuang et al., 2020). Doingso aids
reproducibilitywhile avoiding the complex task of corpus buildingand model training.
In computational approaches to education and learning, there is reason to expect that
out-of-the-box models are insufcient, particularly when working with younger children.
ILS
124,1/2
26

To continue reading

Request your trial

VLEX uses login cookies to provide you with a better browsing experience. If you click on 'Accept' or continue browsing this site we consider that you accept our cookie policy. ACCEPT