How do the kids speak? Improving educational use of text mining with child-directed language models
DOI | https://doi.org/10.1108/ILS-06-2022-0082 |
Published date | 19 January 2023 |
Date | 19 January 2023 |
Pages | 25-47 |
Subject Matter | Library & information science,Librarianship/library management,Library & information services |
Author | Peter Organisciak,Michele Newman,David Eby,Selcuk Acar,Denis Dumas |
How do the kids speak? Improving
educational use of text mining
with child-directed
language models
Peter Organisciak
Department of Research Methods and Information Science, University of Denver,
Denver, Colorado, USA
Michele Newman
Information School, University of Washington, Seattle, Washington, USA
David Eby
School of Information Sciences, University of Illinois at Urbana-Champaign,
Champaign, Illinois, USA
Selcuk Acar
Department of Educational Psychology, University of North Texas,
Denton, Texas, USA, and
Denis Dumas
Department of Educational Psychology, University of Georgia,
Athens, Georgia, USA
Abstract
Purpose –Most educational assessments tend to be constructed in a close-ended format, which is
easier to score consistently and more affordable. However, recent work has leveraged computation
text methods from the information sciences to make open-ended measurement more effective and
reliable for older students. The purpose of this study is to determine whether models used by
computational text mining applications need to be adapted when used with samples of elementary-
aged children.
Design/methodology/approach –This study introduces domain-adapted semantic models for child-
specific text analysis, to allow better elementary-aged educational assessment. A corpus compiled from a
multimodal mix of spokenand written child-directed sources is presented,used to train a children’s language
model and evaluatedagainst standard non-age-specific semanticmodels.
Findings –Child-oriented language is found to differ in vocabulary and w ord sense use from
general English, while exhibiting lower gender and race biases. The model is evaluated in an
educational application of divergent thinking measurement and shown to improve on generalized
English models.
The authors thank Kelly Berthiaume, Maggie Ryan and the full MOTES team for additional
contributions and advice.
Funding: This study was funded by the Institute of Education Sciences (IES) (Grant No.
R305A200519).
Research data: The MOTES Corpus model as well as code for reproducing data collection and
modeling is available at https://osf.io/pwvda.
Educational
use of text
mining
25
Received22 June 2022
Revised30 September 2022
23November 2022
Accepted30 November 2022
Informationand Learning
Sciences
Vol.124 No. 1/2, 2023
pp. 25-47
© Emerald Publishing Limited
2398-5348
DOI 10.1108/ILS-06-2022-0082
The current issue and full text archive of this journal is available on Emerald Insight at:
https://www.emerald.com/insight/2398-5348.htm
Research limitations/implications –The findings demonstrate the need for age-specific language
models in the growing domain of automated divergentthinking and strongly encourage the same for other
educationaluses of computation text analysis by showinga measurable difference in the languageof children.
Social implications –Understanding children’s language more representatively in automated
educational assessment allows for more fair and equitable testing. Furthermore, child-specific language
models havefewer gender and race biases.
Originality/value –Research in computational measurementof open-ended responses has thus far used
models of language trained on generalEnglish sources or domain-specific sources such as textbooks.To the
best of the authors’knowledge, this paper is the firstto study age-specific language modelsfor educational
assessment.In addition, while there have been several targeted, high-qualitycorpora of child-created or child-
directed speech,the corpuspresented here is the first developed with the breadth and scalerequired for large-
scale text modeling.
Keywords Educational data mining, Text mining, Learning, Assessment, Language modeling,
Divergent thinking
Paper type Research paper
1. Introduction
Recent advancements in natural language processing are enabling a key advancement in
education: the ability to parseopen-ended measurement responses reliably and consistently.
Doing so greatly opens the ability to measure knowledge and abilities that have
traditionally not been well suitedto close-ended testing, such as measures of originality and
divergent thinking, which have been costlyand uneven in the past (Acar et al.,2021;Dumas
and Dunbar, 2014;Dumas et al.,2020). However, realizing the possibility of computational
methods for improving assessment, particularly in contexts that involve children, requires
tools that meet the needs of educational domains while fulfilling expectations of
transparency and interpretability.
In this work, a corpus of child-directed language is developed and modeled to better
understand how the linguistic profile of children’s language differs from general English.
The MOTES Corpus, emergent from the Measurement of Original Thinking in Elementary
Students project, differs frompast children’s corpora by focusing on scale, which allows it to
be used in modern computationtext mining applications. This scale is achieved by focusing
on child-directed resources,which are more comprehensively available than child-spokenor
child-produced corpora. The child-directed model is compared to a general-language model
in three ways: through an empirical linguistic analysis, where notable differences in
language use are observed; through a bias analysis, where the children’s corpus is found to
lead to lower race and gender stereotyping; and in an applied context, on tests of children’s
divergent thinking from the MOTES project, where child-focused text models are found to
outperform traditionmodels.
Applications of natural language processing often rely on models of relationships in
general English. Entirelydifferent words may mean very similar or identicalthings, and the
relatedness of words needs to be represented for systems to understand that people talk or
write about similar concepts in varying ways. In other words, models of relationships
between words help applications focus on latent meanings rather than the specific words
used in conveying those meanings. The bestmodels are learned by observing great deals of
text, so it has become commonplace for scholars and practitioners to use pretrained, openly
distributed models in a process called transfer learning (Zhuang et al., 2020). Doingso aids
reproducibilitywhile avoiding the complex task of corpus buildingand model training.
In computational approaches to education and learning, there is reason to expect that
out-of-the-box models are insufficient, particularly when working with younger children.
ILS
124,1/2
26
To continue reading
Request your trial