Automatic Documentation of Faetar’s [i]: A Methodology for Discovering Vowel Space Using Artificial Neural Networks

Consider a huge, untagged speech corpus from a language without a written tradition. How can we quickly and accurately measure vowel space, without expending large amounts of labour and funds? We present a methodology that can be used to measure probabilistic variation across large corpora of natural spoken languages, particularly useful for under-resourced and lesser-documented languages. Using a heuristic function, the optimal vowel sample for any given phone category can be found. This heuristic is trained through machine learning, in this case, an unsupervised neural network. This process allows us to test large amounts of raw data, and create a vowel space, without the need to hand-tag many hours of recordings. We aim to model how speakers from different dialect groups speak—what are the phonetic patterns they are most likely to show, and can we differentiate and categorize unknown samples using these models created from natural language? This work uses spontaneous speech data in the endangered language Faetar, from the Heritage Language Variation and Change Corpus.