Low-Complexity Face Recognition using a Multilevel DWT and Two States of Continuous HMM to recognize Noisy Images

— Face recognition has become an important subject in modern life, especially in security and surveillance applications. This work introduces a face recognition method, which is characterized by high-speed, low-complexity, and high-efficiency in a noisy environment. The performance of this method is greatly improved by using a median filter, such that each image is filtered to eliminate the influence of noise and light illumination. Multiple levels of discrete wavelet transform are applied to the filtered image to reduce size and eliminate further noise. Subsequently, the resultant image is scanned using a window with a predefined overlap in raster fashion to construct a sequence of observation vectors used as the basis of a model. The model consists of two states of a continuous hidden Markov model, a unique model in the face recognition field that has not yet been used by other researchers, which interprets the novelty and low complexity of the method. In spite of the presence of 0.15, 0.4, and 0.25 measurement values of impulse noise density in images stored in the ORL, Yale, and EURECOM Kinect face databases, respectively, the proposed work has achieved a recognition rate of 100%.

In Section II, a brief description of the related work is introduced. Section III presents theoretical backgrounds of the main topics. The proposed method is described in Section IV. Experimental results are discussed in Section V. Finally, Section VI documents our conclusions and future work.

II. RELATED WORK
Several attempts to solve the face recognition problem using the HMM have been presented in the last 20 years. These approaches differ in the structure of the HMM and the technique to generate the observation vectors. In [9], the authors proposed the first HMM face recognition system. They used two types of a 1D-HMM; ergodic and left-to-right, with eight and five states, respectively. Pixel intensities are the elements of the observation vectors produced from a given image. The disadvantage of using pixel intensities is the associated high level of computational complexity; therefore, the tendency is to employ low-complexity methods. The discrete wavelet transform (DWT) [10], local binary pattern (LBP) [11], singular value decomposition (SVD) [12,13,14] and discrete cosine transform (DCT) [15] are the most widely methods used for dimensionality reduction.
Different types of wavelets have been used in the application of HMM to face recognition. In [10,16,17], the Haar wavelet was used to extract features from face images. The 3 rd level of wavelet transform (WT) and a continuous 1D-HMM classifier with five states and two mixture components were used in [18]. The sensitivity of the system to white Gaussian noise was examined using two types of Daubechies wavelets: db4 and db10. In [19], the authors used a DWT instead of a DCT, because of their belief that the DCT may not be the best technique to describe an image and wavelets may be a better choice. Three types of wavelets were used in [20]: Haar, Biorthogonal 9/7, and Coif3 wavelets with structural HMM. In [21], the authors used a WT and a PCA with a five-state HMM classifier. A discrete Gabor wavelet transform (DGWT) was used in [22] to obtain observation vectors from face images with a seven-state 2D-HMM. The same authors used a DWT and a sevenstate 1D-HMM in [23], but the experiments were conducted on a different database. In [24], the authors presented an HMM face recognition system based on Gabor wavelet for features extraction and an LDA for dimensionality reduction. They used 15 states for the ORL database and 9 states for the Yale database.

III. BACKGROUND
The theoretical backgrounds of main topics that are relevant to the proposed work are explained in the following sections.

A. Median Filter
A median filter is a special kind of nonlinear spatial filter that is used to improve image processing such as noise elimination. It surrounds the pixels by sliding the center point of a mask window over the image from top to bottom and left to right in steps of one pixel. Then, the window matrix is converted to a 1D vector, and all elements are arranged in descending or ascending order, where the mid-value replaces the center of the window [25]. The following example illustrates the operation of the median filter using a mask window of size (3 × 3) pixels. The following matrix represents a sample of block image before filtering: The mask window is overlaid on , generating the pixel values indicated by the red numbers in positions ( − ), and the center value of the window is 61. Thereafter, the window is converted to a descending vector : = 80 80 79 76 64 63 61 60 60 The mid-value of is 64, which represents the median value of that window, and it replaces the previous value (i.e. 61). After median filtering, the matrix will be: =   0 60  61 64  58 53  76 78  53 0  73 72  60 63  53 63   71 71  71 59  73 46  47 44  41 50  0 33  50 38  33 36 44 38 36 0

B. Discrete Wavelet Transform (DWT)
The DWT was invented by Alfred Haar in 1909; it decomposes a data signal into approximation and detailed coefficients. The DWT applies two functions: scaling and wavelet [25]. The scaling function is associated with a low-pass filter , while a high-pass filter corresponds to the wavelet function. The wavelet function ( ) and scaling function ( ) defined by the Haar wavelet are, respectively, described as follows: (1) Filter banks decompose the signal into different frequency levels using multi stages of high-pass and lowpass filters. The desired resolution is obtained by further splitting the signal at the stage. The procedure of a 1D-DWT can be viewed as a convolution process between the signal and the impulse responses and of the low-pass and high-pass filters, respectively. The approximation coefficients are obtained as follows [26]; In the same manner, the detailed coefficients are generated using the following convolution equations: The outputs of each filter are then downsampled by 2, and the corresponding convolution expressions can be written as follows; The image is a 2D signal and two separate 1D-DWT instances that must be used to achieve a 2D-DWT. Initially, the decomposition is applied to the rows of the image, and then, the outputs are decomposed along the columns. This operation splits the image into four sub-images, denoted by LL, LH, HL, and HH. The multiresolution level is obtained by further splitting LL into new four sub-images, and so on, as shown in Fig. 1.

C. Hidden Markov Model (HMM)
The HMM is a one-dimensional model, which is associated with interconnected non-observable (hidden) states, manifested by the observable vector sequence. The parameters of an HMM are [7]: The model is indicated by the following notation: For continuous output observation densities of the hidden states, the observation matrix elements satisfy the following equation: where is the observation vector, is the coefficient of the m th mixture in state , and the function is any log-concave or elliptically symmetric density (e.g. Gaussian), with mean vector and covariance matrix . The mixture gain must satisfy the stochastic constraint:

D. HMM Face Recognition 1)
1D-HMM Face Recognition: The configuration of a 1D-HMM model accepts vertical observation vectors (1D), which are obtained by sampling each face image using a sliding window. This approach was used in [9,10,12,13,16,17,18,21,23], where a sequence of observation vectors is constructed from each image. The size of the sequence depends on the size of the images, and the length of each vector varies according to the size of the scanning window. Fig. 2 shows an example of a two-state 1D-HMM topology.

2D-HMM Face Recognition:
The configuration of a 2D-HMM model is a composition of two 1D-HMM models, where each model is overlaid onto the other. It consists of a number of superstates, which represent the main 1D-HMM, and each superstate consists of a number of states of the 1D-HMM model. In [11,15,22], 2D-HMMs were employed with a variation in the number of states and also the number of superstates. Fig. 3 shows a structure of a left-to-right 2D-HMM.

IV. THE PROPOSED WORK
The proposed face recognition system consists of four steps: (1) pre-processing, (2) feature extraction, (3) training, and (4) testing. For each individual, a proper subset of images is selected from the database for training and the remaining images are used for testing. The accuracy of the system is usually affected by the training set selection; thus, the best recognition rate may be achieved by using different poses from the training images. The main operations of the system are shown in Fig. 4.

A. Pre-processing
A pre-processing step is used to enhance the operation of the system, which includes image cropping and median filtering. Image cropping is used only for large images to reduce the data to be processed while maintaining sufficient data to represent the entire face. The images in the ORL face database [27] remain uncropped because the size of these images is 112 × 92, which may not be a large size, whereas the cropping is performed on the images in the Yale database (243 × 320) [28] and the EURECOM Kinect database (256 × 256) [29]. Every image within the Yale and EURECOM Kinect databases contains a dispensable background that is not relevant to the face; the important image region is extracted and resized to 192 × 192 pixels. On the other hand, application of median filtering is very relevant as it contributes to the elimination of noise and any illumination in the images. This step greatly improves the performance of the system, especially in noisy environments.

B. Feature extraction
The filtered image is decomposed by applying either the third or fourth level of the DWT according to the size of the image. The third level is applied to images of size 112 × 92 (ORL database) and the resultant image size is 14 × 12, whereas images of size 192 × 192 (Yale and EURECOM Kinect databases) are decomposed to the fourth level to produce images of size 12 × 12. It should be noted that detailed coefficients are excluded and only the approximation coefficients are retained. For each image, a sequence of blocks is generated as shown in Fig. 5, using a sliding window of size 5 × 4 for the ORL database and 5 × 5 for the two other databases. The maximum overlap between two consecutive blocks is selected, with a height and width less than one pixel. Then, the blocks are converted to column vectors and sequentially concatenated. For a given person, the number of sequences is equal to the number of training images, arranged according to the accepted form by the HMM.

C. Training process
The proposed method is distinct from previous HMM face recognition methods, where it uses only two states of the ergodic HMM, as shown in Fig. 2. The continuous output observations are obtained using the Gaussian probability density function with one mixture component, such that = 1 in (15) and (16). One model is trained for each individual using the Baum-Welch algorithm [30]. The model is initialized with random parameters according to the constraints cited in Sec. III-C. The training process converges when the possible difference between two successive iterations reaches any value below (10 ); otherwise, it converges in four iterations.

D. Testing process
A sequence of observation vectors is obtained from the image to be tested after using a median filter and a DWT, as previously described. The Viterbi algorithm [31] is used to calculate the probability of this sequence of vectors using all HMM-trained models. The unknown image is identified as that person with the highest probability index. Several types of wavelet families are examined to determine the proper type that provides the best results. The wavelet families used are: Daubechies, Coiflet, Biorthogonal, Reverse biorthogonal, and Symlet. The test encompasses all subfamilies, and it appears that the Haar wavelet is the most suitable type. It is worth noting that the wavelet and scaling functions of the Haar wavelet described in (1) and (2), respectively, are exactly the same as those of db1, Biorthogonal 1.1 (bior1.1), Reverse biorthogonal 1.1 (rbio1.1) and Symlet1 (Sym1) wavelets. Therefore, these wavelets also produce a high recognition rate (100%).

A. Experiment 1
This experiment is performed to evaluate the sufficient number of training images that achieves the highest recognition rate. If the number of images for each person in the database is and the number of training imagesis , the number of testing images is − . The best recognition rates obtained from varying the number of training images taken from the ORL, Yale and RGB in session 2 of EURECOM Kinect databases are tabulated in Table I. It is evident from Table I that the adequate number of training images for each person employed from the ORL and Yale databases that achieves a free of errors testing is five, whereas six training images for each individual in the EURECOM Kinect database are appropriate for achieving a recognition rate of 100%. This is explained by observing the nature of the images of each database. The ORL and Yale databases contain primarily frontal images with some facial and lighting effects, while the EURECOM Kinect database includes non-frontal images and partially occluded images. The recognition process is highly dependent on the nature of the training images, and the heterogeneous nature of the database poses may be such that a pose that is out of the (assumed) normal distribution may be selected to maintain a successful discriminant process.

B. Experiment 2
To examine the efficiency of the system in the case of addressing noisy images, impulse noise (also called salt and pepper noise) and white Gaussian noise are added to the testing images, whereas the training process is conducted in an environment that is more tightly controlled, the noise-free images are used. To address images with real noise, we were unable to compute the noise density that the image contained, so this attempt failed to determine the efficiency of the system, whereas the addition of noise to the images with a predefined noise density may contributes significantly to the determination of the effectiveness of the system to noise level. One particular test was conducted without using the median filter for the purpose of examination of its usefulness in noise reduction. Fig. 6 shows the results of using impulse noise and white Gaussian noise (zero mean and different variance) with and without median filtering. According to the results illustrated in Fig. 6, it is clear that the median filter is more efficient in removing the effect of impulse noise from the images. The recognition rate remains at 100% while the noise density increased to 0.15 for images of the ORL database, 0.4 for images of the Yale database, and 0.25 for images extracted from the EURECOM Kinect database. The disparity among the rates between the databases appears as a result of the differences in their sizes and images nature. It should be noted that the testing sets of the system include: 200 images from the ORL database (five images for each of 40 persons), 90 images from the Yale database (six images for each of 15 persons) and 156 images from the EURECOM Kinect database (three images for each of 52 persons). Impulse noise turns the pixel value on or off, i.e. the pixel value assumes a value of either 255 or zero, and, unless the noise density is high, these values do not represent the median of the mask window. To explain this observation, a mask window of size 3 × 3 was, via experiment, overlaid onto the image and the following pixel values were fetched from predefined positions: The index of the mid-value is 5, and the median value of is 196. The window may be infected as a result of low noise density (number of infected pixels in is less than five) or by high noise density (number of infected pixels in is greater than or equal to five). In the case of low noise density, the median value either remains unchanged or it is rounded to the nearest value depending on its position in the arranged vector, and also on its state of infection. The following descending vectors show some possibilities that illustrate infection, such that the noise and median values are indicated by blue and red numbers, respectively: It can be concluded from the above example that the image may be infected by a certain degree of noise density, the median value of which is non-deterministic. Therefore, the experiment is repeated four times and the average rate is considered. Hence, it is clear that the median filter maintains the characteristics of the image despite the presence of some degree of noise density. This leads to the conclusion that the use of a median filter greatly improves the performance of the system, especially in the case of impulse noise.
White Gaussian noise differs somewhat from impulse noise as it modifies the pixels to values ranging between 0 and 255, such that the median value is probably changed. Therefore, the usefulness of a median filter in the case of white Gaussian noise is limited.

C. Comparative results
The advantage of the proposed work relative to other research can be observed through a comparison performed under the condition of the similarity between these methods, including the use of the HMM and the face database.
In the case of noisy images, the research presented in [18] used the third level of the WT in conjunction with the HMM, and several tests were performed on the ORL database. The authors used white Gaussian noise with a mean value of zero and a maximum variance value of 0.1 to examine the efficiency of their system. Fig. 7 shows a comparison between the results reported in [18] and the results of the proposed work with and without the use of a median filter.  Fig. 7, it can be observed that the proposed work has achieved the best results, either using a median filter or without filtering. This leads to the conclusion that Haar is the best wavelet type for noise elimination. Noise may be regarded as a high frequency component, and can be filtered out using a low pass filter. As mentioned previously, the DWT splits the image into approximation (low frequency) coefficients and detailed (high frequency) coefficients. The approximation coefficients form the base for the feature extraction step, while the detailed coefficients are ignored. Therefore, the DWT can be considered as a good noise eliminator.
It is more suitable to discriminate the best HMM face recognition by a comparison with other work. Table II shows comparative results relative to the ORL database.  [14] is much closer to that of this work; however, there are three major differences between our results and those presented in [14], as explained below: 1. In the case of updating the database, the entire system is required to be retrained in [14], whereas the proposed work trains only the model that requires an update while all other models remain unchanged. 2. In [14], the quantization and the weight vectors are chosen experimentally using a trial and error method, and this process consumes a large amount of time, contrary to the proposed work, which uses a different feature extraction method that requires no additional time.
3. In case of noisy images, the proposed work outperforms the work introduced in [14] due to the use of a DWT, which is an efficient noise eliminator. Fig. 8 shows comparative results of using the two types of noise.
( a ) ( b ) Fig. 8. Comparative results between the proposed work and the method employed in [14], where noisy images are tested: (a) for Impulse noise, and (b) forwhite Gaussian noise with zero mean The high margin between the two curves in both cases illustrated in Fig. 8 is evident, and this demonstrates the fact that there is no doubt that the proposed work can be considered the best face recognition system in noisy environments.
Some of the HMM approaches employ the Yale face database; therefore, a comparison is made in Table III to show the effectiveness of the proposed algorithm applied to the Yale database. As can be observed in Table III, the proposed work achieves 100% recognition rate with five training images, whereas the other methods reach this rate with six training images. Additionally, the proposed work is more efficient, consuming fewer processor cycles and less memory due to the use of only two states of the HMM, compared to the other methods that contain seven or nine states.
Based on our knowledge, the EURECOM Kinect database has not yet been used in the HMM face recognition approach, as it has only recently been recognized as competitive within the face recognition field. In [29], the authors presented the EURECOM Kinect database and employed four types of face recognition methods: PCA, LBP, scale-invariant feature transform (SIFT), and local Gabor binary patterns (LGBP). Table  IV shows a comparison with the results reported in [29] related to 2D RGB images of session 2. In [29], only the recognition rates are listed and the processing times are unrecorded. Therefore, based on the available information, the proposed algorithm outperforms the methods used in [29], where it achieves the maximum recognition rate (100%), as is clear in Table IV. The training time of one image from the EURECOM Kinect database is 0.017 (s) including the median filtering and the DWT, and the elapsed time required to test a single image is 0.07 (s).
The experiments were performed using MATLAB (R2016a) included the Bayes Net Toolbox for the HMM [32]. The operating system was Windows 7 (64-bit) executed on a machine with a double-core processor of 2.53 GHz, 2528 Mhz speeds and 4 GB RAM.
VI. CONCLUSION A noisy face recognition method is presented using the median filter and a multilevel DWT in combination with the HMM. The novelty of the algorithm can be viewed in the use of two states of the ergodic HMM, which is a unique model in the face recognition field and has not yet been used by other researchers. Moreover, the use of the median filter together with a multilevel DWT achieves 100% recognition rate in spite of the presence of 0.15, 0.4, and 0.25 noise densities on the images of the ORL, Yale, and EURECOM Kinect databases, respectively. In addition to noise elimination, a high reduction in image size is achieved using the third and fourth levels of the DWT. The processing time and the memory utilization are extremely minimized due to the reduction in image size and the number of states, which is interpreted as simplicity and low complexity of the system. Comparative results show the superiority of the proposed technique over others.
In the future, a discrete HMM will be used instead of the continuous model; we also plan to use only one state of HMM.