Elsevier

Infant Behavior and Development

Volume 46, February 2017, Pages 178-193
Infant Behavior and Development

Full length article
Acquisition of vowel articulation in childhood investigated by acoustic-to-articulatory inversion

https://doi.org/10.1016/j.infbeh.2017.01.007Get rights and content

Highlights

  • We estimated developmental changes in articulatory states during vowel production.

  • We applied an acoustic-to-articulatory inversion technique to recorded sounds.

  • The jaw and tongue apex contributed to production of all vowels by young children.

  • The development would go through gradual functionalization of articulators.

  • Initial states were differentiated and refined to adjust to native language.

Abstract

While the acoustical features of speech sounds in children have been extensively studied, limited information is available as to their articulation during speech production. Instead of directly measuring articulatory movements, this study used an acoustic-to-articulatory inversion model with scalable vocal tract size to estimate developmental changes in articulatory state during vowel production. Using a pseudo-inverse Jacobian matrix of a model mapping seven articulatory parameters to acoustic ones, the formant frequencies of each vowel produced by three Japanese children over time at ages between 6 and 60 months were transformed into articulatory parameters. We conducted the discriminant analysis to reveal differences in articulatory states for production of each vowel. The analysis suggested that development of vowel production went through gradual functionalization of articulatory parameters. At 6–9 months, the coordination of position of tongue body and lip aperture forms three vowels: front, back, and central. At 10–17 months, recruitments of jaw and tongue apex enable differentiation of these three vowels into five. At 18 months and older, recruitment of tongue shape produces more distinct vowels specific to Japanese. These results suggest that the jaw and tongue apex contributed to speech production by young children regardless of kinds of vowel. Moreover, initial articulatory states for each vowel could be distinguished by the manner of coordination between lip and tongue, and these initial states are differentiated and refined into articulations adjusted to the native language over the course of development.

Introduction

The speech sounds are generated by complex motor coordination among the articulatory organs. While the developmental process of speech production has previously been depicted mainly on the basis of evidence derived from acoustical phenomena and their consequences—such as spectral envelope, fundamental frequencies (Amano, Nakatani, & Kondo, 2006; Ishizuka, Mugitani, Kato, & Amano, 2007; Kent and Murray, 1982, Vorperian and Kent, 2007) and phonetic transcriptions (Ingram, 1974, MacNeilage, 2000, MacNeilage and Davis, 2000, Oller, 2000, Stoel-Gammon and Cooper, 1984) —the development of the articulatory system by which these acoustics are produced still remains an open question because of limitations on the measurement of the articulatory system, especially that of tongue movements. In the present paper, we investigated longitudinal changes in children's articulation by estimating the parameters of an articulatory model on the basis of the acoustical features of speech sounds.

The development of speech production during the first year of life has been characterized as following a particular course (Kuhl, 2004, Oller, 2000, Stoel-Gammon and Cooper, 1984). Infants are born able to produce spontaneous sounds, such as sneezing and crying. Infants then produce cooing, that is, quasivocalic sounds similar to vowels. Subsequently, coos expand into clear vowel sounds characterized by full resonance and wide variety. At an early stage of babbling, a large portion of sounds produced by infants can be heard as repetitions of the same consonant–vowel (CV) units such as “papapa” and “mamama.” After that stage, infants combine different consonant- and vowel-like sounds to produce variegated sequences. Finally, beginning around the end of the first year of life, infants produce meaningful speech.

Acoustical studies show that as children grow up, their vowel clusters become more distinct, and the fundamental frequency and spectral peaks (formant frequencies) of their utterances become lower (Amano et al., 2006, Ishizuka et al., 2007, Kent and Murray, 1982, Vorperian and Kent, 2007). Moreover, analyses of phonetic transcriptions show a modification process at work in infants' vocalizations (MacNeilage, 2000, MacNeilage and Davis, 2000). At the babbling stage, infants prefer to repeat three predominant CV sequences, that is, labial–central, coronal–front, and dorsal–back CV patterns. With development, children begin to chain variegative CVs, with a fronting tendency in which the first consonant in words has a more anterior place of articulation than the second one (Ingram, 1974). These phenomena are crosslinguistically observed (Amano et al., 2006, Ishizuka et al., 2007, Kent and Murray, 1982, MacNeilage, 2000, MacNeilage and Davis, 2000, Vorperian and Kent, 2007).

These changes are likely to be caused mainly by the development of vocal tract anatomy, respiration, and motor controls of articulators. In order to investigate the anatomical structure of the articulatory system and its dynamics during speech production, previous studies have adopted a variety of methods, such as radiographic imaging (Chiba and Kajiyama, 1942, Fant, 1960, Kiritani, 1986), electromagnetic articulography and electropalatography (Byrd and Tan, 1996, Hixon, 1971), magnetic resonance imaging (Fitch & Giedd, 1999; Masaki et al., 1999; Vorperian, Kent, Gentry, & Yandell, 1999; Vorperian et al., 2005), ultrasound (Geddes, Kent, Mitoulas, & Hartmann, 2008; Zharkova, Hewlett, & Hardcastle, 2011), and motion-capture systems (Green, Moore, Higashikawa, & Steeve, 2000; Green, Moore, & Reilly, 2002; Goffman & Smith, 1999; Nip, Green, & Marx, 2009). With regard to anatomy, previous studies reveal that children's vocal tracts, especially during the first year of life, are shaped differently from those of adults (Fitch and Giedd, 1999, Goldstein, 1980; Sasaki, Levine, Laitman, & Crelin, 1977; Vorperian et al., 1999, Vorperian et al., 2005). Infants' vocal tracts are not only smaller than adults', but they have a relatively lager oral cavity than pharyngeal one, a flat tongue, and a more gradually sloping pharyngeal tract. These properties of the infant vocal tract should raise formant frequencies and lead to less clear vowel clusters. In addition, the limited range of tongue movement prevents complex consonantal articulations. While these anatomical changes in vocal tract are certainly responsible for the changes in the filter properties of speech sounds, their phonation is conversely affected mostly by the development of respiration (Boliek, Hixon, Watson, & Morgan, 1996; Reilly & Moore, 2009). For instance, decrease in the compliance of the chest wall results in more rapid modulation of respiratory muscle movements.

As for the development of motor control of articulators, transcription analysis suggests that infants have relatively independent control over their jaw and that ability to carry out tongue movements depends largely on jaw control (MacNeilage, 2000, MacNeilage and Davis, 2000). On the basis of these findings, it has been convincingly argued that mandibular oscillations have a crucial role in the early development of articulation. One study using motion capture partly supports this idea by reporting that jaw movements mature earlier than lip ones (Green et al., 2002, Nip et al., 2009). Another study, using electromagnetic articulography and acoustical analysis, reports that fronting tendencies that are predominant in both adults and children are caused by coordination among articulators (Rochet-Capellan & Schwartz, 2007).

Thus, as described above, the acoustical analysis and empirical measurement of the articulatory system reveals much about the development of speech production. Taking into consideration that vowel production accounts for a large portion of speech by young children, tongue movements would play a crucial role in development of speech production. However, many aspects of the development of articulation, especially tongue movements during speech production until the second year of life, still remain an open question. This is because of limitations to the empirical measurement of articulatory movements in young children.

Another approach to investigate articulatory movements is to estimate articulatory states from acoustical features; this is called acoustic-to-articulatory inversion (Atal, Chang, Mathews, & Tukey, 1978; Hiroya & Honda, 2004; Ménard, Schwartz, & Boë, 2004; Ouni and Laprie, 2005, Shirai, 1993; Toda, Black, & Tokuda, 2008; Uchida, Saito, Minematsu, & Hirose, 2015; Uria, Renals, & Richmond, 2011; Wakita, 1973). This technique relies on a mapping function from acoustical to articulatory space. Previous studies have proposes several such mapping functions (Atal et al., 1978, Hiroya and Honda, 2004, Ouni and Laprie, 2005, Shirai, 1993, Wakita, 1973) and, on their basis, articulatory models (Maeda, 1990, Mermelstein, 1973, Story, 2009). When it comes to applying this technique to sounds produced by infants, however, some problems arise. First, because of anatomical differences between infants' vocal tracts and those of adults, the articulatory model used must be scalable to the child's vocal tract size. Second, we cannot calculate a mapping function from acoustical to articulatory features, since it is impossible to pair acoustical features with empirically obtained articulatory features in this case. Third, although the model should approximate the vocal tract shape, it is desirable to have a smaller number of parameters.

Taking into consideration the need for scalability of the vocal tract and parameters to specify articulatory states, we adopted Maeda's model (Maeda, 1990, Ménard et al., 2004; Serkhane, Schwartz, Boë, Davis, & Matyear, 2007). This model was proposed to approximate midsagittal slices of the vocal tract during adult' vowel productions (Maeda, 1990). Subsequent studies (Ménard et al., 2004, Serkhane et al., 2007) propose two scaling factors to incorporate growth data (Goldstein, 1980) into the model and apply it to non-adult-sized vocal tracts. A previous study (Serkhane et al., 2007) compares simulated formant frequencies with actual ones produced by infants at 4 and 7 months of age and argues that the jaw plays only a minor role before the babbling stage but a major role at the onset of rhythmic syllable-like output in canonical babbling.

We hypothesized that initial articulatory states for vowels in babbling periods would be not well clustered, and the states would be later differentiated and refined into the clusters adjusted to the native language. In order to verify this hypothesis, although we cannot empirically measure articulatory movements of children, we estimated articulatory states based on an acoustic-to-articulatory inversion technique using the scalable Maeda's model, with seven articulatory parameters. Note that, because of one-to-many relationships between articulatory and acoustical spaces, the precise estimation of articulatory parameters from acoustical ones is an ill-posed problem. However, it is possible to reveal a possible range of articulatory states underlying properties of acoustical distribution of young children’s sounds within the assumed articulatory model. For materials, we used the vowel-like sounds of Japanese, which consist of high-front /i/, mid-front /e/, low-center /a/, high-back /u/ and mid-back /o/, produced by three children over time from ages 6–60 months. Especially, we analyzed longitudinal changes in combinations of multiple articulatory organs to show how flexible coordination of multiple articulatory organs develops.

Section snippets

Materials

We used the NTT Japanese infant speech database (Amano et al., 2006, Amano et al., 2009, Ishizuka et al., 2007) for this study. This database contains the utterances of five normally developing children and their parents, recorded with 16-bit quantization at a sampling rate of 16 kHz. This database also provided time-series of fundamental frequencies (F0), phoneme labels and property tags. In order to attach phoneme labels, two well-trained transcribers segmented and labeled the speech data in

Validation of inversely estimated articulatory parameters

The averages and standard deviations of differences between formant frequencies extracted from vowel sounds and those forward-transformed by the inversely estimated area function from the formant frequencies were as follows: (mean ± 1S.D.); F1: 8.4 ± 20.2 Hz, F2: 9.4 ± 15.1 Hz, and F3: 15.0 ± 30.6 Hz.

We also evaluated the inversion technique based on the area functions. Fig. 4 suggests that the area functions generating formant frequencies were similar to the inversely estimated ones from the formant

Discussion

In the present study, we have described developmental changes in articulatory state during vowel production on the basis of the acoustic-to-articulatory inversion technique.

As shown in the longitudinal changes in the mean values of the articulatory parameters, the distribution of the articulatory parameters was biased toward positive or negative values in early development and became closer to zero with ages. These biased distributions would disagree with the assumption of the previous study (

Conclusions

We described the application of an acoustic-to-articulatory inversion technique to identify and analyze the development of vowel articulation. First, we validated inversely estimated articulatory parameters. Although the classical study in this area proposed that infants start by vocalizing all possible speech sounds of the world’s languages (Jakobson, 1968), other studies have shown that infants produce only limited kinds of speech sounds (MacNeilage, 2000, MacNeilage and Davis, 2000, Oller,

Acknowledgements

The study was supported by a Grant for the Fellows of the Japan Society for the Promotion of Science (No. 12J08436) awarded to H.O. and Japan Society for the Promotion of Science Grand-in-Aid for Scientific Research (No. 20670001 and No. 24119002) awarded to G.T. The authors declare that they have no competing financial interests.

References (65)

  • T. Toda et al.

    Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model

    Speech Communication

    (2008)
  • H.K. Vorperian et al.

    Magnetic resonance imaging procedures to study the concurrent anatomic development of vocal tract structures: preliminary results

    International Journal of Pediatric Otorhinolaryngology

    (1999)
  • W. Zuidema et al.

    The evolution of combinatorial phonology

    Journal of Phonetics

    (2009)
  • P. Alku et al.

    Formant frequency estimation of high-pitched vowels using weighted linear prediction

    The Journal of the Acoustical Society of America

    (2013)
  • S. Amano et al.

    Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings

    The Journal of the Acoustical Society of America

    (2006)
  • B.S. Atal et al.

    Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique

    The Journal of the Acoustical Society of America

    (1978)
  • C.M. Bishop

    Pattern recognition and machine learning

    (2007)
  • L.J. Boë et al.

    Anatomy and control of developing human vocal tract: a response to Lieberman

    Journal of Phonetics

    (2013)
  • P. Boersma et al.

    Praat: doing phonetics by computer [Computer program] Ver. 5.2.35

    (2013)
  • T. Chiba et al.

    The vowel: its nature and structure

    (1942)
  • T.M. Cover et al.

    Elements of information theory

    (2006)
  • B. de Boysson-Bardies

    How language comes to children: from birth to two year

    (1999)
  • B. de Boer et al.

    Computer models of vocal tract evolution: an overview and critique

    Adaptive Behavior

    (2010)
  • B. de Boysson-Bardies et al.

    A crosslinguistic investigation of vowel formants in babbling

    Journal of Child Language

    (1989)
  • G. Fant

    Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations

    (1960)
  • T.W. Fitch et al.

    Morphology and development of the human vocal tract: a study using magnetic resonance imaging

    The Journal of the Acoustical Society of America

    (1999)
  • J.L. Flanagan

    Speech analysis, synthesis and perception

    (1972)
  • L. Goffman et al.

    Development and phonetic differentiation of speech movement patterns

    Journal of Experimental Psychology: Human Perception and Performance

    (1999)
  • U.G. Goldstein

    An articulatory model for the vocal tracts of growing children. Dissertation

    (1980)
  • J.R. Green et al.

    The physiologic development of speech motor control: lip and jaw coordination

    Journal of Speech Language and Hearing Research

    (2000)
  • J.R. Green et al.

    The sequential development of jaw and lip control for speech

    Journal of Speech Language and Hearing Research

    (2002)
  • J.M. Heinz et al.

    On the relations between lateral cineradiographs, area functions, and acoustic spectra of speech

    (1965)
  • Cited by (4)

    • Vowel context effects on consonant repetition in early words

      2021, Journal of Speech, Language, and Hearing Research
    View full text