Feature Reduction Method for Speaker Identification Systems Using Particle Swarm Optimization

— Feature selection (FS) is a process in which the most informative and descriptive characteristics of a signal that will lead to better classification are chosen. The process is utilized in many areas, such as machine learning, pattern recognition and signal processing. FS reduces the dimensionality of a signal and preserves the most informative features for further processing. A speech signal can consist of thousands of features. Feature extraction methods such as Average Framing Linear Prediction Coding (AFLPC) using wavelet transform reduce the number of features from thousands to hundreds. However, the vector of features involves some redundancy. In addition, some features are similar and do not give discrimination to classes. Taking such features into consideration in the classification process will not help to identify certain classes; conversely, they will only serve to confuse the classifier and inhibit identification of accurate classes. This paper proposes an FS method that uses evolution optimization techniques to select the most informative features that maximize the classification rates of Bayesian classifiers. The classification rate is also maximized by modeling the features with the proper number of Gaussian distributions. The results of comparative analysis conducted show that the selection based individual speaker model gives the best classification rate performance.

features in the speaker model [11]. Therefore, valuable features as well as some redundancy should be available in the selection to improve performance. In this context, searching for such features requires a heuristic random search and evolutionary computation techniques.
In general, the Gaussian Mixture Model [23] is used to model features in speaker models for speaker identification, usually by means of 5-10 Gaussian densities. In this method, all features are forced to have a specific number of Gaussians. However, the classification rates vary according to the number and type of Gaussian. An iterative process for choosing the final estimation with the highest likelihood [24] has also been investigated. Various other methods have also been proposed [25]- [27] to estimate the optimal number of mixtures; however, they estimate the optimal number of mixtures only between two values (minimum and maximum). According to the nature of features, some features require higher/lower Gaussian densities than other features for accurate modeling.
Feature extraction methods transform the speech signal from the time domain into another feature space domain, with the coefficients in the new domain being less than the frame size of speech signal. The process of transformation produces a set of features representing the speech signal; however, some features produced in the speaker models do not give effective distinction to classes. This paper proposes a method that improves classifier performance by removing and preserving redundancy in features within the same class and among all classes in speaker identification systems, regardless of the feature extraction method employed. The proposed feature selection method is based on the available features that maximize the classification rate. The selection can be achieved individually (each class has its own set of features) or globally (classes settle on a set of featuresas a group).
Determining the effective set of features is achieved by using the Particle Swarm Optimization (PSO) evolutionary optimization technique. We also use PSO to determine the optimal number of Gaussian densities in order to model each feature in the feature vector space that leads to a better classification rate when coupled with wavelet-based feature extraction methods [28] and the Bayes classifier. PSO was proposed in [29], [30] for feature selection in speaker verification systems using a binary classification process. In that work, the selection was considered from different aspects of the speaker identification system, and the selection process was realized on all speaker models, and the best featuremodels chosen.
The length of the feature vector at the input of the classifier is crucial in the classification process; these features contribute the most to recognition rate in the classification. There is no gain to consider extra feature in the classification process unless they are informative. In this paper, we propose a method that selects the most informative features for a given feature vector. Two selection approaches are presented: first, all classes settle on a group of features that maximize classification rate; second, each speaker model selects its own set of features that are different from other speakers. Further, features are modeled as one or two Gaussian densities considering extraction of features using AFLPC [28]. Consequently, selection of the exact number of Gaussian densities that maximize classification rate is also considered in this paper.
The remainder of this paper is organized as follows.Section IIdiscussesthe wavelet-based feature AFLPC. Section III outlines the proposed feature selection method. Section IVpresents the experimental results obtained.Finally, Section V concludes this paper.
II. THE AFLPC FEATURE EXTRACTION METHOD Wavelet packets can be used to extract additional features to guarantee a higher recognition rate. Avciet al. [31] proposed a method that calculates the entropy value of the wavelet norm in digital modulation recognition. A robust speech recognition scheme that uses wavelet-based energy as a threshold for denoising estimation has also been proposed for noisy environments [32]. Wu and Lin [33] proposed a method that uses the energy indexes of Wavelet Packet (WP) for speaker identification. Entropy calculation for the waveforms at the terminal node signals obtained from Discrete Wavelet Transform (DWT) has also been used in speaker identification [34].
Avci [35] investigated a feature extraction method for speaker recognition based on a combination of three entropy types (sure, logarithmic energy, and norm) was investigated. Daqrouq and Al Azzawi [28] and Wu and Lin [36] also proposed using DWT w instead of the Discrete Cosine Transform (DCT) to solve the problem of high frequency artifacts being introduced as a result of abrupt changes at window boundaries. The features based on DWT were chosen to evaluate the effectiveness of the selected feature for speaker identification [28], [37]. Several levels of DWT approximation sub-signals exhibited good performance in the presence of Additive White Gaussian Noise (AWGN) [37].
Before the feature extraction stage, the speech data are processed by a silence-removing algorithm followed by application of a pre-processing technique. In AFLPC, features are extracted from the frames of each WT speech sub-signal: where is the number of considered frames (each frame of 20ms duration) for the WT sub-signal ( ) and is the discrete time. The average of the LPC coefficients calculated for the frames of ( )is utilized to extract a wavelet sub-signal feature vector as follows: The feature vector of the entire given speech signal is = { , , … . , } (3) In this paper, the combination of AFLPC and WP is denoted WPAFL and that of AFLPC and DWT is denoted DWTAFL.

III. FEATURE SELECTION A. Feature Modeling
In AFLPC, features are extracted from the speech signalbased on wavelet transforms. A naïve Bayes classifier can then be used to perform recognition, and the distribution of features can be modeled as a Gaussian-the distributions of some selected features are shown in Fig. 1(a). However, some features can also be modeled as two Gaussians, as shown in Fig. 1(b). These two models are sufficient for realization of classification after determining the indices of the features in order to consider the exact model, as will be shown in the results obtained. In AFLPC, some features should be modeled as one Gaussian and other features with more than one. The choice is crucial for building likelihood functions as better representation of features leads to better classification rates. For the Bayesian fusion process, let , … be features that have been produced from AFLPC, and , … be the available classes. The probability of , … ⁄ is calculated using Bayes rule: ( where Q is the total number of features in AFLPC, M is the total number of available classes, and , … ⁄ is the likelihood function. Surprisingly, the Naive Bayes model performs well, even in situations where independence assumptions are clearly false [38]. Using the assumption of conditional independence to reduce the number of parameters, we get The Posterior , … ⁄ is computed by multiplying the feature probabilities in the speaker signal.

B. Evolutionary Optimization of the Classifier
Choosing the number of Gaussians for the feature vector cannot be achieved by observing every single feature and deciding on the number of suitable Gaussians in the feature model. There should be an iterative process to settle on the most suitable number of Gaussians. With this in mind, we considered more advanced methods that focus on minimization of the classification error (maximization of classification rate). Consequently, we decided on evolutionary optimization or population-based optimization owing to the flexibility of the fitness function supported by the nature of the optimization process.
To make the fitness function fully reflective of the performance of the classifier (so that the classifier can be effectively optimized), we determine the number of Gaussian mixture distributions (one or two) in the feature vector , … for a given speech signal in the classes − , where The classification error is minimized or the classification accuracy ( )is maximized. In other words, we look for the number of Gaussians in each feature, in each speech signal, in the training set that maximizes the classification rate . (In our experiments, we use PSO owing to its simplicity, relatively low computing overhead, and high effectiveness.
In certain locations of features , … , the statistics for these features are alike for all classes and some of them do not give discriminations to classes. Considering such features in the classification process will not enhance recognition decisions but instead may minimize the probability of some classes or increase the probability of other classes in which misclassification might occur.
We wish to maximize the classification based on the available feature set. Therefore, a process that will select the most effective set of features that discriminate all classes is required. Reducing the number of features while increasing the classification rate reduces system complexity. Thus, the question is whether to consider f or not in the classification process. The number of features is minimized in all classes, but these features are considered more representative in the sense that the probability of the desired class will increase and the probability of other classes will decrease. PSO is also used for maximization problems.
= arg max ⁄ . , … It should be noted that the indices of Fs are the same for all speakers. The posterior of all speakers m (Eqs. 5 and 6) will be calculated for given selected features Fs. The number of selected features nis determined by the optimization process. It can also be noted that there are distinctive features for individual speakers that are different in indices from other speakers. With this in mind, each speaker (class) can have its own selected features that maximize the classification rate for that particular class. In this case, the selection process is more complex in terms of number of features as each class has its own selected features compared to the problem in Eq. 7. To reduce the optimization complexity, the number of selected features is first predetermined in each class. Then, PSO can be executed to maximize R, with the result being a group of selected features for each class. It is not necessary for the indices of the features to be the same for all classes.
We considered three cases: 1) Features are modeled as one Gaussian: In this case, R is maximized with regards to features , … . The most representative features are considered by running PSO on the training set. Feature is either selected and its ( ) ⁄ considered in Eq. 5 or ignored and ( ) ⁄ considered as one for all . In the worst case, if the selection does not produce a better classification rate, it will at least give the same performance with fewer features, which in the end affect system complexity.
2) Features are modeled as one or two Gaussians: In this case, is maximized with regards to features , … , as in case 1, and when considering the suitable models (one or two Gaussians) is also considered in the optimization process.
3) Considering feature modeling in case 2, each class has its own selected feature set. Here, the number of features is given before running PSO. If n features of each class are assumed to be selected among , … , then the number of parameters that have to be optimized is × .
Note that considering more than two Gaussians will not improve the classification rate as most features are distributed normally and few of them have the nature of bimodal Gaussian distribution, as shown in Fig. 1.

IV. EXPERIMENTS AND RESULTS
The experimental setup was as follows: speech signals were recorded via a PC sound card at a spectral frequency of 4000 Hz and sampling frequency of 8000 Hz. Forty-seven persons participated in the recordings. Each participant recorded a minimum of 20 different utterances in Arabic. The age of the speakers ranged from 20 to 45 years and the participants comprised 25 male and 22 females. The recording process was provided in normal university office conditions. The speech signals data set was split into training set (496) and test set (530). Two speech signals in the training set for all classes were left for feature selection using PSO and the rest were used to form likelihood functions. It should be noted the presented method improve the classification rate of existing feature extraction methods to provide the informative features to the classifier.
A neural network was also used to evaluate the feature selection performance. The same optimization process was used to optimize the number of considered features at the input of the neural network. The features produced from AFLPC were fed directly into the input of a probabilistic Neural Network.
For PSO, the size of the population was set to 25 and the number of iterations was set to 100; we experimentally found that this number was sufficient to achieve the convergence of the method (Fig. 2). PSO has been shown to be robust and effective in solving the optimization problem. The values of the cognitive acceleration coefficient and the social acceleration coefficient were set to 0.5 and 1.25, respectively. These two values are commonly recommended in the literature (see [39]). PSO is used because it outperforms other heuristic search methods (i.e., Genetic algorithms) in terms of convergence speed and complexity [29].  Table 1 shows the classification rates of various feature extraction methods considering two classifiers, Bayes classifier (BC) and GMM, and the probabilistic neural network (PNN), along with the number of features obtained for PSO. The performance of all the feature extraction methods improved using feature selection by considering fewer features. In our evaluation of the proposed method, several published methods were analysed.
Feature extraction such as Wavelet packet and Shannon entropy (WPS) [37], wavelet packet and Log energy entropy (WPLE) [35], MFCC and GMM (MFGMM) [23], WPAFL, DWAFL, as well as the fusion between WPAFL and DWAFL (FWAFL) were used to test the proposed method. In general, there were improvements in the classification rates regardless of the feature extraction method and the type of classifier. AFLPC can be reduced by as much as 50%, with at least the same performance in terms of classification rate. The choice of classifier has no effect as feature selection improves the classification rate for NN. The speaker identification system is consequently less complex as the number of features at the input of the NN is approximately 50%. MFCC with GMM has the best performance with classification rate reaches 0.9815. B. Case 2: Table 2 shows the feature selection performance achieved for one Gaussian, two Gaussians, and a group comprising one and two Gaussians. The optimization process determines the number of Gaussians that should be considered in the realization. As in case 1, the feature reduction is as much as 50% for the AFLPC methods, with improvements of 3.04% and 1.55% for DWTAFL and WPAFL, respectively. Note that the more features of the AFLPC (FWAFL) method there are, the better the performance with respect to classification rates, more information is provided to classes. For feature selection in this situation, we not only get better performance but also a reduction in the number of features used in the classification process. It was found that modeling the features with more than two Gaussians will not improve the preface of classification rate as the best models of most features will be modeling with only one Gaussian.

C. Case 3:
In this case, the number of parameters (selected features) was increased as each class had n number of selected features. In another words, each class settled on a set of features that discriminate it the most from other classes; the selected features in each class may overlap with other selected features in other classes (overlap means that the indices of the selected feature could be the same). Table 3 shows the performance of the AFLPC methods at n = 160 for DWTAFL and WPAFL and n = 180 for FWAFL, the performance is improved when each class selects its own set of features (local) in contrast to all classes settling on a set of features (global). Allowing the class to choose its selected features offer advantages over features that are selected from all classes, as can be seen in Fig. (3). It should be noted that the solution of the global feature is considered to be the initial population for the local features. Note also that there will be slight improvements in R after n = 80 for DWTAFL and WPAFL and after n = 100 for FWAFL. The complexity is increased as the number of features in the optimization process is increased (47× n). I PSO does not guarantee an optimal solution, but the results show that there groups of features that improve the classification rate when they are selected exist regardless of the feature extraction method and the type of classifier used. The best performance was 98.59% for FWAFL in local feature selection at n = 180 for each class, with a total number of features at 8460. Genetic algorithm (GA) is also tested for FWAFL, the best performance was 97.98% when n = 240 (11280).

V. CONCLUSION
In this paper, features that are more informative among classes were selected and considered in the classification process. Selection of features resulted from maximizing the classification rate. An evolutionary optimization technique (PSO) was used in the selection process. Feature selection guarantees at least the same performance with fewer features, especially when there is redundant information in the features. Feature extraction methods such as AFLPC produces features with redundant information. Further, some features are uninformative, in that they do not enhance the classification rate. Consequently, ignoring such features enhanced the performance of the Bayes classifier. Bayes classifier is sensitive when modeling features, but choosing the right number of Gaussian models in the feature selection, eventually improved its performance. In AFLPC, a 50% feature reduction rate was achieved with no impact on the classification rate. Each class was also identified by its own set of features. The best classification rate was 98.59% for FWAFL in local feature selection.