Full length articleAcquisition of vowel articulation in childhood investigated by acoustic-to-articulatory inversion
Introduction
The speech sounds are generated by complex motor coordination among the articulatory organs. While the developmental process of speech production has previously been depicted mainly on the basis of evidence derived from acoustical phenomena and their consequences—such as spectral envelope, fundamental frequencies (Amano, Nakatani, & Kondo, 2006; Ishizuka, Mugitani, Kato, & Amano, 2007; Kent and Murray, 1982, Vorperian and Kent, 2007) and phonetic transcriptions (Ingram, 1974, MacNeilage, 2000, MacNeilage and Davis, 2000, Oller, 2000, Stoel-Gammon and Cooper, 1984) —the development of the articulatory system by which these acoustics are produced still remains an open question because of limitations on the measurement of the articulatory system, especially that of tongue movements. In the present paper, we investigated longitudinal changes in children's articulation by estimating the parameters of an articulatory model on the basis of the acoustical features of speech sounds.
The development of speech production during the first year of life has been characterized as following a particular course (Kuhl, 2004, Oller, 2000, Stoel-Gammon and Cooper, 1984). Infants are born able to produce spontaneous sounds, such as sneezing and crying. Infants then produce cooing, that is, quasivocalic sounds similar to vowels. Subsequently, coos expand into clear vowel sounds characterized by full resonance and wide variety. At an early stage of babbling, a large portion of sounds produced by infants can be heard as repetitions of the same consonant–vowel (CV) units such as “papapa” and “mamama.” After that stage, infants combine different consonant- and vowel-like sounds to produce variegated sequences. Finally, beginning around the end of the first year of life, infants produce meaningful speech.
Acoustical studies show that as children grow up, their vowel clusters become more distinct, and the fundamental frequency and spectral peaks (formant frequencies) of their utterances become lower (Amano et al., 2006, Ishizuka et al., 2007, Kent and Murray, 1982, Vorperian and Kent, 2007). Moreover, analyses of phonetic transcriptions show a modification process at work in infants' vocalizations (MacNeilage, 2000, MacNeilage and Davis, 2000). At the babbling stage, infants prefer to repeat three predominant CV sequences, that is, labial–central, coronal–front, and dorsal–back CV patterns. With development, children begin to chain variegative CVs, with a fronting tendency in which the first consonant in words has a more anterior place of articulation than the second one (Ingram, 1974). These phenomena are crosslinguistically observed (Amano et al., 2006, Ishizuka et al., 2007, Kent and Murray, 1982, MacNeilage, 2000, MacNeilage and Davis, 2000, Vorperian and Kent, 2007).
These changes are likely to be caused mainly by the development of vocal tract anatomy, respiration, and motor controls of articulators. In order to investigate the anatomical structure of the articulatory system and its dynamics during speech production, previous studies have adopted a variety of methods, such as radiographic imaging (Chiba and Kajiyama, 1942, Fant, 1960, Kiritani, 1986), electromagnetic articulography and electropalatography (Byrd and Tan, 1996, Hixon, 1971), magnetic resonance imaging (Fitch & Giedd, 1999; Masaki et al., 1999; Vorperian, Kent, Gentry, & Yandell, 1999; Vorperian et al., 2005), ultrasound (Geddes, Kent, Mitoulas, & Hartmann, 2008; Zharkova, Hewlett, & Hardcastle, 2011), and motion-capture systems (Green, Moore, Higashikawa, & Steeve, 2000; Green, Moore, & Reilly, 2002; Goffman & Smith, 1999; Nip, Green, & Marx, 2009). With regard to anatomy, previous studies reveal that children's vocal tracts, especially during the first year of life, are shaped differently from those of adults (Fitch and Giedd, 1999, Goldstein, 1980; Sasaki, Levine, Laitman, & Crelin, 1977; Vorperian et al., 1999, Vorperian et al., 2005). Infants' vocal tracts are not only smaller than adults', but they have a relatively lager oral cavity than pharyngeal one, a flat tongue, and a more gradually sloping pharyngeal tract. These properties of the infant vocal tract should raise formant frequencies and lead to less clear vowel clusters. In addition, the limited range of tongue movement prevents complex consonantal articulations. While these anatomical changes in vocal tract are certainly responsible for the changes in the filter properties of speech sounds, their phonation is conversely affected mostly by the development of respiration (Boliek, Hixon, Watson, & Morgan, 1996; Reilly & Moore, 2009). For instance, decrease in the compliance of the chest wall results in more rapid modulation of respiratory muscle movements.
As for the development of motor control of articulators, transcription analysis suggests that infants have relatively independent control over their jaw and that ability to carry out tongue movements depends largely on jaw control (MacNeilage, 2000, MacNeilage and Davis, 2000). On the basis of these findings, it has been convincingly argued that mandibular oscillations have a crucial role in the early development of articulation. One study using motion capture partly supports this idea by reporting that jaw movements mature earlier than lip ones (Green et al., 2002, Nip et al., 2009). Another study, using electromagnetic articulography and acoustical analysis, reports that fronting tendencies that are predominant in both adults and children are caused by coordination among articulators (Rochet-Capellan & Schwartz, 2007).
Thus, as described above, the acoustical analysis and empirical measurement of the articulatory system reveals much about the development of speech production. Taking into consideration that vowel production accounts for a large portion of speech by young children, tongue movements would play a crucial role in development of speech production. However, many aspects of the development of articulation, especially tongue movements during speech production until the second year of life, still remain an open question. This is because of limitations to the empirical measurement of articulatory movements in young children.
Another approach to investigate articulatory movements is to estimate articulatory states from acoustical features; this is called acoustic-to-articulatory inversion (Atal, Chang, Mathews, & Tukey, 1978; Hiroya & Honda, 2004; Ménard, Schwartz, & Boë, 2004; Ouni and Laprie, 2005, Shirai, 1993; Toda, Black, & Tokuda, 2008; Uchida, Saito, Minematsu, & Hirose, 2015; Uria, Renals, & Richmond, 2011; Wakita, 1973). This technique relies on a mapping function from acoustical to articulatory space. Previous studies have proposes several such mapping functions (Atal et al., 1978, Hiroya and Honda, 2004, Ouni and Laprie, 2005, Shirai, 1993, Wakita, 1973) and, on their basis, articulatory models (Maeda, 1990, Mermelstein, 1973, Story, 2009). When it comes to applying this technique to sounds produced by infants, however, some problems arise. First, because of anatomical differences between infants' vocal tracts and those of adults, the articulatory model used must be scalable to the child's vocal tract size. Second, we cannot calculate a mapping function from acoustical to articulatory features, since it is impossible to pair acoustical features with empirically obtained articulatory features in this case. Third, although the model should approximate the vocal tract shape, it is desirable to have a smaller number of parameters.
Taking into consideration the need for scalability of the vocal tract and parameters to specify articulatory states, we adopted Maeda's model (Maeda, 1990, Ménard et al., 2004; Serkhane, Schwartz, Boë, Davis, & Matyear, 2007). This model was proposed to approximate midsagittal slices of the vocal tract during adult' vowel productions (Maeda, 1990). Subsequent studies (Ménard et al., 2004, Serkhane et al., 2007) propose two scaling factors to incorporate growth data (Goldstein, 1980) into the model and apply it to non-adult-sized vocal tracts. A previous study (Serkhane et al., 2007) compares simulated formant frequencies with actual ones produced by infants at 4 and 7 months of age and argues that the jaw plays only a minor role before the babbling stage but a major role at the onset of rhythmic syllable-like output in canonical babbling.
We hypothesized that initial articulatory states for vowels in babbling periods would be not well clustered, and the states would be later differentiated and refined into the clusters adjusted to the native language. In order to verify this hypothesis, although we cannot empirically measure articulatory movements of children, we estimated articulatory states based on an acoustic-to-articulatory inversion technique using the scalable Maeda's model, with seven articulatory parameters. Note that, because of one-to-many relationships between articulatory and acoustical spaces, the precise estimation of articulatory parameters from acoustical ones is an ill-posed problem. However, it is possible to reveal a possible range of articulatory states underlying properties of acoustical distribution of young children’s sounds within the assumed articulatory model. For materials, we used the vowel-like sounds of Japanese, which consist of high-front /i/, mid-front /e/, low-center /a/, high-back /u/ and mid-back /o/, produced by three children over time from ages 6–60 months. Especially, we analyzed longitudinal changes in combinations of multiple articulatory organs to show how flexible coordination of multiple articulatory organs develops.
Section snippets
Materials
We used the NTT Japanese infant speech database (Amano et al., 2006, Amano et al., 2009, Ishizuka et al., 2007) for this study. This database contains the utterances of five normally developing children and their parents, recorded with 16-bit quantization at a sampling rate of 16 kHz. This database also provided time-series of fundamental frequencies (F0), phoneme labels and property tags. In order to attach phoneme labels, two well-trained transcribers segmented and labeled the speech data in
Validation of inversely estimated articulatory parameters
The averages and standard deviations of differences between formant frequencies extracted from vowel sounds and those forward-transformed by the inversely estimated area function from the formant frequencies were as follows: (mean ± 1S.D.); F1: 8.4 ± 20.2 Hz, F2: 9.4 ± 15.1 Hz, and F3: 15.0 ± 30.6 Hz.
We also evaluated the inversion technique based on the area functions. Fig. 4 suggests that the area functions generating formant frequencies were similar to the inversely estimated ones from the formant
Discussion
In the present study, we have described developmental changes in articulatory state during vowel production on the basis of the acoustic-to-articulatory inversion technique.
As shown in the longitudinal changes in the mean values of the articulatory parameters, the distribution of the articulatory parameters was biased toward positive or negative values in early development and became closer to zero with ages. These biased distributions would disagree with the assumption of the previous study (
Conclusions
We described the application of an acoustic-to-articulatory inversion technique to identify and analyze the development of vowel articulation. First, we validated inversely estimated articulatory parameters. Although the classical study in this area proposed that infants start by vocalizing all possible speech sounds of the world’s languages (Jakobson, 1968), other studies have shown that infants produce only limited kinds of speech sounds (MacNeilage, 2000, MacNeilage and Davis, 2000, Oller,
Acknowledgements
The study was supported by a Grant for the Fellows of the Japan Society for the Promotion of Science (No. 12J08436) awarded to H.O. and Japan Society for the Promotion of Science Grand-in-Aid for Scientific Research (No. 20670001 and No. 24119002) awarded to G.T. The authors declare that they have no competing financial interests.
References (65)
- et al.
Development of Japanese infant speech database from longitudinal recordings
Speech Communication
(2009) - et al.
Vocalization and breathing during the first year of life
Journal of Voice
(1996) - et al.
Saying consonant clusters quickly
Journal of Phonetics
(1996) - et al.
Tongue movement and intra-oral vacuum in breastfeeding infants
Early Human Development
(2008) X-ray microbeam method for measurement of articulatory dynamics: techniques and results
Speech Communication
(1986)Vocal tract anatomy and the neural bases of talking
Journal of Phonetics
(2012)- et al.
Early speech motor development: cognitive and linguistic considerations
Journal of Communication Disorders
(2009) - et al.
Infants’ vocalizations analyzed with an articulatory model: a preliminary report
Journal of Phonetics
(2007) Estimation and generation of articulatory motion using neural networks
Speech Communication
(1993)- et al.
Mid-sagittal cut to area function transformations: direct measurements of mid-sagittal distance and area with MRI
Speech Communication
(2002)
Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model
Speech Communication
Magnetic resonance imaging procedures to study the concurrent anatomic development of vocal tract structures: preliminary results
International Journal of Pediatric Otorhinolaryngology
The evolution of combinatorial phonology
Journal of Phonetics
Formant frequency estimation of high-pitched vowels using weighted linear prediction
The Journal of the Acoustical Society of America
Fundamental frequency of infants’ and parents’ utterances in longitudinal recordings
The Journal of the Acoustical Society of America
Inversion of articulatory-to-acoustic transformation in the vocal tract by a computer-sorting technique
The Journal of the Acoustical Society of America
Pattern recognition and machine learning
Anatomy and control of developing human vocal tract: a response to Lieberman
Journal of Phonetics
Praat: doing phonetics by computer [Computer program] Ver. 5.2.35
The vowel: its nature and structure
Elements of information theory
How language comes to children: from birth to two year
Computer models of vocal tract evolution: an overview and critique
Adaptive Behavior
A crosslinguistic investigation of vowel formants in babbling
Journal of Child Language
Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations
Morphology and development of the human vocal tract: a study using magnetic resonance imaging
The Journal of the Acoustical Society of America
Speech analysis, synthesis and perception
Development and phonetic differentiation of speech movement patterns
Journal of Experimental Psychology: Human Perception and Performance
An articulatory model for the vocal tracts of growing children. Dissertation
The physiologic development of speech motor control: lip and jaw coordination
Journal of Speech Language and Hearing Research
The sequential development of jaw and lip control for speech
Journal of Speech Language and Hearing Research
On the relations between lateral cineradiographs, area functions, and acoustic spectra of speech
Cited by (4)
Computing low-dimensional representations of speech from socio-auditory structures for phonetic analyses
2018, Journal of PhoneticsVowel context effects on consonant repetition in early words
2021, Journal of Speech, Language, and Hearing ResearchWhat acoustic studies tell us about vowels in developing and disordered speech
2020, American Journal of Speech-Language PathologyAn age-dependent vocal tract model for males and females based on anatomic measurements
2018, Journal of the Acoustical Society of America