Estimation of HMM-GMM Parameter for Tamil Words

—An isolated word based Hidden Markov Model(HMM) is built for Tamil spoken words. Tamil utterances of monosyllable, disyllable and trisyllable words of Tamil speech utterances are considered. The Hidden Markov Tool Kit(HTK) is used to build and estimate the HMM parameters of the acoustic units. The Mel-Frequency Cepstral Coefficients (MFCC) are extracted from the speech utterances and Multivariate Gaussian Mixture Model with different number of components is used to estimate the state emission probabilities of the HMM. Finally, Viterbi Decoder employed to recognize the test speech utterances. The performance of the models is measured in terms of percentage of correctness, percentage of accuracy and Word Error Rate (WER) and found that five state model with four Gaussian components per state produced the best result of monosyllable words comparing with disyllable and trisyllable words of Tamil Language.


I. INTRODUCTION
Speech recognition System (SRS) is a technique which aims to convert a speaker's spoken utterance into a text string [1]. Even though many researchers involved in speech recognition systems/application, it is still far from the solved problem.Continuous HMMs are the best choice for the construction of the acoustic model for speech recognition system [2]. In this statistical model, if a sufficient amount of training data is available, it yields a large number of parameters which leads to the better accuracy in recognition systems. Evaluation of more number of elementary Gaussians during recognition leads to the degradation of the speed of the system. Further, if the training data are less or not sufficient, the system performance will be degraded [3].
Subspace Gaussian Mixture Model (SGMM) in which the state dependent GMM parameters are derived from global shared model subspace and low-dimensional state-dependent vectors. This type of statistical model is suitable for the low resource speech recognition systems [4] [5].
Indian languages like Tamil, is syllabic in nature. It also has a close relation between what is spoken to what is written. A word that consists of a single syllable is called a monosyllabic word [6]. A word containing two and three syllables is called as disyllabic word and trisyllabic word respectively. Polysyllable refers either to a word contains more than three syllables or to any word of more than one syllable.
II. RELATED WORK Even though developing speech recognition system has enormous issues and challenge, there are many research works have been grown and some of the related works are presented here. Riedhammeret. al. [7]. compared classic and multiple-codebook semi continuous models using diagonal and full covariance matrices with continuous HMMs and subspace Gaussian mixture models. They experimented on the RM and WSJ corpora and found that a classical semi continuous system does not perform as well as a continuous one, multiple-codebook semi-continuous systems can perform better, particularly when using full-covariance Gaussians.
Poveyet. al. [4] described an acoustic model in which all phonetic states share a common Gaussian Mixture Model structure. The means and mixture weights vary in a subspace of the total parameter space and called as a Subspace Gaussian Mixture Model (SGMM) in which globally shared parameters define the subspace. This type of acoustic model suits when the training data is less since it allows for a compact representation of the signal and gives better results.
Many researchers are involved in the research for Tamil speech recognition/synthesis system. The Phoneme Recognition System [8] is improved by using language models at the recognition phase of the system. Speech signals were segmented using language models and recognition was done using similarity measure, based on the acoustic characteristics of the phoneme signal. The errors in the recognized phoneme sequence were detected and corrected using the integrated model of variable length phoneme model and inter-word hybrid language model.
Karpagavalli and Sabitha [9] implemented a small vocabulary, speaker independent isolated word recognition system trained with ten Tamil words uttered five times by 20 speakers and achieved 93.5 percentages of accuracy III. HIDDEN MARKOV MODEL Hidden Markov Model is a statistical framework and it can be defined as a finite set of states, each of which is associated with a probability distribution. Transitions among the states are governed by a set of probabilities called transition probabilities. In a particular state an outcome or observation can be generated, according to the associated probability distribution[10] [11]. Problem 1 is known as evaluation problem. Problem 2 is used to test and find the optimal sequence fit in the probabilistic model. Problem 3 adjusts the parameter of the HMM which is also known as training HMM.

C. Continuous Density HMM(HMM-GMM)
The observations generated by all distinct states of this type are infinite and continuous, i.e. = { | ∈ } The observation probability distribution in state j, = ( ) , is a continuous probability density function and is often a mixture of L dimensional Multivariate Gaussian distributions given by Eq. 1.
... (1) where Mis the number of Gaussians assigned to state j, and the and Σ are the means and covariance matrices of the mixtures.
is the mixture coefficient for the k th mixture in state j.

A. Data
Data are recorded with the help of a high quality microphone in a closed room using a recording tool audacity. The recorded audiofiles are saved as HTK transcription. The sampling rate used for recording is 16 kHz. 30 isolated Tamil words in each category (Monosyllable, Disyllable and Trisyllable) spoken by five native speakers (3 Male and 2 Female) were recorded. Each word is uttered five times by each speaker. 80 % of the utterances is used for training and 20 % of the utterances used for testing. Sample words uttered by the speaker is presented in the table I.

B. Feature Extraction
The steps involved in the feature extraction are pre-emphasis, frame blocking, windowing, filter bank analysis, logarithmic compression, and discrete cosine transform [12] [13]. The overall process of the MFCC feature computation is illustrated in fig. 2. The input speech data are pre emphasized with a coefficient of 0.97 using a first order digital filter. It is then segmented into twenty millisecond frames with an overlap of fifty percent between adjacent frames and windowed using Hamming window. The filter bank analysis is performed to convert each time domain frame of samples into the frequency domain. The log Energy and twelve MFCC coefficients are thereafter extracted. Delta and acceleration coefficients were also derived. Models are constructed with twelve MFCC coefficients with log energy (MFCC), twelve MFCC coefficients with log energy and its velocity (MFCC+D) and twelve MFCC coefficients with log energy, velocity and acceleration coefficients (MFCC+D+A). C. HMM-GMM MFCC feature vectors extracted from the frames of the speech signal corresponding to each word are given as input to estimate the parameters of HMM. With variant HMM parameters, the number of states (N) and the number of Gaussian mixture per state were used to test the performance of the each case of the implemented system. Left-right model with 4 states, 5 states and 6 states were implemented in which, the state 1 and state N are treated as non emitting states. Fig 3 shows the Bakis model with 6 states in which state 1 and state 6 act as non emitting states.
Initially, a flat start mechanism is used for initialization of HMM-GMM model with one Gaussian mixture per each emitting state of 4 state model, then the parameters for each word is generated. Once an initial set of models is created, re estimation of the entire training set is done. Each of the models in each case is then re estimated by increasing the mixture component until it reaches six. The same procedure is followed to evaluate 5 state and 6 state model.

D. Experimental Results
The performance of the system is tested with trained speaker. The formula for evaluating performance of monosyllable, disyllable and trisyllable words of Tamil utterances is given in eqn.
Percentage of accuracy of the isolated syllabic words of Tamil words with MFCC with log energy and the number of Gaussian mixture is four is presented in the table II. The result shows that five state model with double data acceleration coefficient better suits to design a syllable based Tamil recognition systems. V. CONCLUSION Automatic Speech Recognition is a challenging area of research in developing the interaction between man and machine. Since Tamil language is syllabic in nature, this work is carried out to fix the HMM parameters for developing syllable based Tamil Speech Recognition System using the Gaussian Mixer Model. In this paper, experiments are carried out to estimate the HMM model parameters for the construction of the syllabic based speech recognition system for the Tamil language. So, HMM for monosyllable, disyllable and trisyllable Tamil words are constructed and tested with variant MFCC feature sets, states and Gaussian mixtures. The performance is evaluated and the result shows that the five state model with four mixture component is enough to implement syllable based Tamil isolated word recognition systems.