Grapheme to Phone Conversion

Researcher National Unversity of Singapore May - Dec 2012
Conditional Random Model, Joint Multigram Model, Speech Recognition Python, Linux, Shell, C++

Grapheme-to-phone conversion (G2P) refers to predicting pronunciation of a word given its orthography. It has important application in human language technologies, especially in speech synthesis, speech recognition and open vocabulary sopken term detection. We present a hybrid system that combines the Joint Multi-gram Model (JMM) and the Conditional Random Field (CRF) classifier to solve the G2P conversion problem. The system is tested on CMUDict and CELEX databases show that the proposed hybrid system consistently outperforms the individual JMM and CRF system. We also incorperate the syllabic features into CRFs as additional features and achieve further performance improvement.

JMM-CRF hybrid system

JMM is a generative language model for the n-grams of the joint letter-phoneme units. JMM is able to model longer phonetic contextual information. However, it is difficult to incorporate complex features, such as syllabification structures, to JMM. On the other hand, CRFs can be used to perform G2P by formulating the task as a sequence-labeling problem. CRFs are discriminative classifiers that can incorporate complex feature functions. However, modeling in CRFs requires the alignment between letters and phonemes. Furthermore, traditional linear chain CRFs usually only employ bigram output information for practical reasons, which is not sufficient for this task. In our proposed system, JMM and CRFs are combined in tandem to yield the JMM-CRFs hybrid system that benefits from both of the individual approaches.

The JMM-CRF system works by first performing G2P conversion using the generative JMM model to obtain multiple hypotheses in the form of a lattice. Multiple letter-phoneme alignments are then extracted from the JMM lattice. For each alignment, a linear-chain CRF model is used to obtain a confusion network. A CRF lattice can then be constructed by taking the union of these confusion networks. Finally, the JMM and CRF lattices are combined together to form a final decoding lattice from which the best phoneme sequence is obtained.


Xiaoxuan Wang and Khe Chai Sim, Integrating Conditional Random Fields and Joint Multi-gram Model with Syllabic Features for Grapheme to Phone conversion, INTERSPEECH, 2013, Lyon France.