A Novel Pose and Light Invariant Face Recognition System in Videos Using Advanced Local Directional Patterns

—Face recognition system have now been widely adopted in wide range of applications starting from mobile authentication to attendance system. Due to availability of the camera with various devices, and ease of development and deployment of such system. The accuracy of many of the modern day commercial and research face recognition system are as per the other biometrics like fingerprint biometric. Alongside face biometric systems, video based face recognition systems are also gaining popularity every day in wide range of applications starting from content serving to surveillance. Such techniques are affected by uncontrolled environment, pose variation, light intensity variations, motion artefacts, and blurring and high frame rate. Therefore existing isolated facial biometric systems do not work well in the context of video based face recognition system. Such a systems are computationally expensive and demand a fast recognition technique along with accuracy as detection and recognition task need to be performed on every frame. Though some of the past research works have proposed different techniques in video based face recognition systems, they are yet not being as widely standardized and accepted as isolated face biometric system. This paper proposes a novel technique for real time face recognition using local directional pattern (LDP). Our method selects a frame as reference frame and extracts the face in the frame which is recognized by matching the database frames. Faces in the subsequent frames are matched with the reference frame only, reducing the number of comparison significantly. When the face in a video frame is found to not matching with the current reference frame, it is matched against the database frames and the recognized face is considered as the current reference face. We validate our technique by constructing random video sequences of one minute at 2fps using Yale face dataset. The techniques are validated against two popular video based face recognition techniques: Ensemble based technique and Exemplar based technique and a traditional isolated face recognition technique in the form of PCA based face biometric. Results show that the proposed technique outperforms the other peers in terms of both speeds of recognition as well as recognition accuracy


Fig 1: A Generic Face Recognition System
This method is capable of detecting a face extremely fast by combining the results of many weak classifiers which classifies the objects in a scene as face or non face from a learning dataset. Therefore this technique also utilized parallel processing like none other before. Depth images can eliminate background with least computation overhead. Further, efficient face detection and recognition techniques were proposed by combining depth and colour images [10]. Once a face is localized and segmented from the rest of the image, the next big task is to represent the face with high dimensional features. Gabor and LDA based technique [11], for their efficient and invariant face representation properties [12] Local Binary Pattern (LBP) based face descriptor gained popularity over last decade. Even though several variations of LBP has been successfully used in the past for facial recognition, this area have now being extensively adopted for facial expression recognition due the dependency of the facial expressions on local textures, and the ability of the LBP based techniques to distinguish between local texture variations. However several studies have shown that LBP can also be a good candidate for face recognition. As, one of the major challenges in any real time face recognition is expression independent face recognition, LBP can be used effectively used. It is shown that local derivative pattern (LDP) presents a better model over LBP in face recognition context, LDP based face descriptor has been proposed in the past by Zhang [13]. Zhang's work has compared various degrees of binary and derivative pattern representation and corresponding accuracy. A moderate accuracy was observed for constrained FERT face database for Local derivative pattern. Local directional number pattern (LDN) [14] is another variant of local binary pattern that encodes the local texture variations using directional number and signs.
The extracted features of a given face image with that of the features of images in the database are matched. Recognition of the class or label associated with a high dimensional feature vector is also known as a classification problem. Various classifiers have been proposed in the past for face recognition systems. Convolution Neural Network [15], distance based classifier [16], SVM [17] and Fuzzy logic [18] are some of the widely used classifiers.
Increased focuses on camera networks and surveillance system in last few years have lead to increased work in video face recognition system. Along with the traditional challenges faced in face recognition problems, video based face recognition also needs to deal with temporal and motion variations [19].
In related work section, our approach is elaborated in detail, we investigate the issues and challenges of real time face recognition in video in more detail. A problem is defined by comparing a past solution of these challenges.

II. RELATED WORKS
Gaurav Agarwal et al. [20] argues that comparing the feature space distance of the test frame with the gallery of face frames is not sufficient due to underneath sub space dependency. Hence they use an ARMA based model to describe the features from a set of face image sequences corresponding to a face. ARMA represents correlation of motion face between than Eigenspace model. Hadid's comprehensive analysis of the performances between image based and video based face recognition technique shows that the video based techniques are less sensitive to image quality in comparison to static image based techniques [21]. This work also shows that the video based face recognition is more dependent on the length of the sequence where a face is appeared and can be effectively represented by a spatio-temporal model. This line of thought has provided a theoretical base to use a face video sequence from gallery as a training model for a face rather than an individual face sequence. Exemplar based method [22] is one such technique that provides a relational model between gallery of face sequence with a real time video. Recognizing a face in video based set of sequence of frames containing a particular face with that of database face video sequence is further explored by Rowden et. al. in their work of Video to Video face matching [23]. This work presents another novel approach for face recognition. Firstly this technique locates all the unique faces present in the video along with their spatio temporal information of the occurrence. These faces are used as reference in consecutive frames to track the faces. As this technique is based on a priori technique, it is not applicable for live video. However a very important observation can be drawn from this work that a video based face matching can be modelled as still image face matching in a single frame and subsequent matching can be performed with already recognized faces. In case a face is not similar enough to already matched face, it can trigger a new face recognition event from database. Most of the techniques mentioned above stresses on one common problem in real time face recognition: reducing the computational complexity. Various methods have been proposed to solve this particular problem. The popularity of GPU based programming and state of art parallel programming has made it possible to model face recognition with GPU based techniques. CUDA based face recognition [24] is an important milestone in this direction. Real time face recognition has been gaining popularity in two major sets of applications: face based attendance system and face recognition in surveillance. Not much empirical results are available for face based attendance system. Face recognition in surveillance on the other hand have several commercial-semi commercial grade applications and methods. A surveillance system adds another important complexity in the form of multiple camera frames in to already challenging unconstrained face recognition system. An, Le, [25] has used dynamic Bayesian network to resolve multiple-camera frame dependency in face recognition problem. The inter-dependency of the faces in consecutive sequences and low quality face data in surveillance system is addressed through ensemble based technique [26].

Contribution:
We propose a face recognition technique that is capable of recognizing faces in video sequences with superior accuracy and speed. Unlike past works which are mainly either set based recognition or sequence based recognition we split our facial recognition technique in two distinct phases: global facial recognition and Local face similarity matching In local similarity matching, whenever a face appears in a video sequence, it is checked for the similarity with the current reference face extracted from the previous sequences. When the reference face is null (first video frame with a face), the face is matched against the complete dataset to recognize the face. This face along with its facial class (person ID) is used as the reference frame. As subsequent video sequences do not differ significantly from its previous frames, probability of appearance of same face in consecutive sequences is extremely high. Subsequent faces are matched with the current reference frame. If the similarity distances of the faces in the subsequent sequences are small then faces in these sequences are identified as the face with the current reference face label. When the reference distance varies significantly, the current reference set is cleared and the new face is matched against the database faces. We use a feed forward neural network for matching the face against the database faces. This new strategy retains the charm of classical isolated face recognition techniques in video based facial recognition. Our system is targeted towards specific personalized systems like face based content serving, locking PC when the user leaves the PC, face based USB access, so on. In this applications, every frame have at most one face in the frame, number of users will be limited, face image will be high quality and recognition needs to be fast to trigger an event in the presence/absence of a particular face in the live feed. Document security using face recognition [27] is one such applications of interest. Hence our first contribution is to design a system for a specific event trigger supporting face recognition system. Our method searches for the ensemble of LDP extracted from Eigen vectors of previously detected face first and then loops back to database search. This reduces the computational complexity significantly. Therefore our contribution can be summarized as solving pose and scale variance face recognition problem and overcoming computational complexity through adaptation of ensemble based learning by converting video based face recognition into a combination of local facial similarity matching and global face recognition system. The overall system is presented in detail in the methodology section.

Fig 2: Training Phase
We divide the entire process into two distinct phases: Training Phase and Testing Phase. Complete training phase process is demonstrated in figure 2. In training phase, multiple face sequence of a person is extracted from the database, Local Directional Pattern is obtained from each of the sequences and LDP histogram is extracted as the feature vector. Feature vectors are stored in the database along with face class. LDP creates LDP face by calculating local derivative. LDP features are extracted by applying Kirsh's mask to 3x4 block of pixels and then replacing the pixel value with corresponding binary code. We use second order LDP proposed by equation (8) of [13]. This is achieved by concatenating all first order LDPs given from equation (2) to equation (5) of [13]. The testing process presented in figure 3. As we have learnt from the related work section that video based face recognition can be performed in two different approaches. a) Sequence Based recognition b) Set based recognition In a sequence based learning, once a frame is captured, face detection is triggered,. From the detected face features are extracted and the features are classified against the database. This technique is same as that of image based face recognition, but applied on every frame. This method is not very popular in video face recognition because they fail to utilize spatio temporal information from their previous frames. In a video sequence if a face appears at a point P(X,Y) at frame number n, then the face will appear in many subsequent frames n+1,n+2..N and in positions P(X±δx,Y±δy). This property is extremely important in any video processing including object tracking, motion detection and so on.
In order to leverage this spatio-temporal relationship, most of the state of art video based face recognition technique uses Set based recognition. In set based recognition a set of consecutive face images is matched against test set of consecutive face image to take temporal dependency into context. Matching algorithm sets spatial boundary for the detection. So if knowledge sets of faces appear in top left corner of the screen, then test set can not appear on the bottom right, if the sets are continuous over time. As, the relationship doesn't hold true for sequences which do not posses temporal continuity, an update algorithm is necessary to keep updating the reference spatio-temporal space. Now, defining the relationship of the face sequences within a set becomes the next big challenge. Papers have proposed model based representation of this set (like [20]). It is argued that sequences in a face set will preserve spatial correlation between the face geometry points. However, this belief is based on the assumption of pose consistent face in a sequence. In a real time scenario, a user may present different posses or posture during face acquisition. Thus, if the face orientation, emotion, posture or alignment differs, then by default the model fails as the underneath assumption fails. This is the most important contribution of our paper. Our work eliminates the need of a gerometric correlation amongst the feature points in a set of reference faces by introducing a local matching technique which only has one face as the training face. Once a face in a video frame is recognized as a face in the database, the subsequent faces are checked only for similarity with this face. This eliminates the need of triggering complex matching cycles for every frame.
When the spatial relationship does not hold in a set, temporal dependency becomes automatically obsolete. It simply means that the reference is a single faces which is either entirely different from the current face class in the current sequence or represent the same face class. One to one similarity measure greatly overcomes the recognition accuracy that generally suffers from pose, posture and intensity variation in a video sequence.
Therefore, combining the principle components of the sequences of reference set to obtain a single vector can provide reference feature for the set. Based on this theory we obtain an Eigen face of the faces in a sequence. Texture features extracted from the Eigen face literally represent a pose independent face model. As, LDP by definition is near color-invariant, the descriptor automatically becomes a pose, posture, color invariant descriptor which is one of the most essential properties for feature representation of a real time video face sequence.
We obtain such sequence from each user who is supposed to use our system. When the live cam starts, the system acquires some subsequent frames. Eigen face is extracted from this test sequence. LDP features extracted from this Eigen face is classified against the LDP features from all reference training sets.
Different classifiers can be used as we discussed in related work section. However, as the feature vectors are non linearly separable due to their generation set complexity, distance based classifier fails. Support vector machines and Neural Networks with optimization are most suited design choices for this specific work. Due to exponential complexity of SVM with increase in number of classes, it is not well suited in real time face recognition problem where recognition speed plays as much an important role as the accuracy.
Once the current sequence is recognized as one of the learnt class, this is used as reference set. Once a new fame is captured and a face is tracked, its spatial correlation with recognized set is checked. If the face appears at a considerably large distance in 2D space than the recognized sequence, then the recognized sequence is discarded and a new sequence is constructed starting from the current face. If this face is located close enough in the space to the recognized face sequence, then new frame is added to the recognized face sequence and the starting sequence is removed.
The new face sequence may either be of the same person whose sequence is already recognized or may be a different user. By comparing Euclidian distance between Eigen vectors of the recognized set and the test face and thresholding, we can find out if this face belongs to recognized set or not. If the new sequence is classified as the person of the recognized set, then the recognized set sequences are replaced by the new sequence, else new sequence is classified again. If the set is recognized as the same user that of first recognized set, then first recognized set is replaced by the new set. Else the new set is kept in the memory as another user sequence. This process is continued. Therefore our work takes into account of both spatio-temporal correlation as well as pose-posture variation in a live video.
We analyze the results in detail in next section and compare the technique with current state of art.

IV. RESULT ANALYSIS Dataset
We work with Yale Dataset Version 1 and Version B. Firstly individual faces of the dataset are extracted and a set of unique person is created. Then the set is divided into training and testing set by separating 5 sequences of each faces into training and testing sets.
A Runtime video sequence is created during training and testing by randomly extracting the faces from training and testing sets during training and testing process respectively. We use a Gaussian distribution to randomly pick the faces from the database such that overall standard deviation of appearance of the faces becomes zero. Once a face id is uniquely generated, sequence of faces between 2-5 sequences are extracted which ensures that our video at least have minimum two frames where each face appears individually. The Video is generated in the runtime for a overall period of 1 minute with 2fps rate. The video is saved and rendered as AVI image.

Apparatus
Five Yoga devices with Intel 4th generation i7Core processor with 8GB RAM running Windows 10 operating system were used for the tests. The methods are tested in Matlab14.

Design and Procedure
We compared our technique with PCA based real time face recognition by Raikoti [44], Exemplar based technique [22], and Ensemble based technique [26]. Raikoti's method is sequence based recognition while Exemplar based technique and Ensemble based techniques are sequence based recognition.
All the codes were modelled in Matlab to compensate for the performance variations in different programming languages. We analyzed the techniques for a) time complexity b) recognition rate c) FAR and FRR.

Tests
We first created database of two users with each user's five sets of training sample. Each set of training sample contained ten faces each samples at .5fps to incorporate enough expression and pose variations in the training sample. The result is presented in figure 4.
Increase in number of classes expectedly resulted in reduced accuracy. However, proposed system's accuracy was found to be better than all the other techniques. As PCA doesn't take into account of the spatio-temporal variations, it was least efficient of all. Exemplar based technique on the other hand did not have dynamic update. Different expressions resulted in bad recognition rate in Exemplar based technique. Next we analyzed the effect of training on the overall accuracy. We tested the system for five users. Number of users were constant, we varied number of training sets from two to ten with each sample containing ten instances each. Results are shown in Figure 6. Performance of PCA was better for low number of training sample in comparisons to all other methods. Proposed system performance was found to be extremely poor under less training data. Almost all the methods showed optimum accuracy for six training set. Increase in number of training sample beyond that reduced accuracy.
False acceptance and rejection rates are one of the most important factors in determining the acceptability of any biometric system. False acceptance rate is defined as number ratio of total instances of a user being detected as desired user to the total number of tests. False rejection on the other hand is a measure of the ratio of number of times a correct user is rejected by the system to the total test instances. For a good biometric rate it is desirable to have higher FRR in comparison to FAR. We measured FAR and FRR over a two minute's test of 5 sequences par set of testing for flat ten user's database. We started with five sequences par training set with five training set to start with a low FAR, and gradually decreased the training data to increase FAR. Figure 7 represents FAR v/s FRR graph of all the methods. We can clearly see that proposed method presents better FAR v/s FRR performance. Due to local crosschecking of the detected faces, proposed system presents a very good rejection ratio. As PCA doesn't obtain correlation between the sequences in a test set, its performance was found to be least acceptable among the methods. Another observation was that with increased FAR, FRR reduced. When FAR is close to one (almost 100% false recognition) the rejection rate is close to zero As, one of the primary objectives of this work was to propose a fast and real time system for limited number of users, speed test is an essential component of the analysis. Exemplar based method did not adopt face detection in the recognition face. It mapped the generated Exemplars in the database directly in the video sequence. All other techniques, including the proposed technique detected and segmented faces in each of test sequences. We compare the performances of all the methods against varying number of users. Time complexity was calculated as absolute average drop in frame rate when capture frame rate is set to 2fps. This drop was calculated as percentage of difference between actual and average frames per seconds. The results are presented in figure 8. PCA was found to be fastest for two user scenario. Frame rate in every method dropped with number of users. However, PCA was found to be computationally most expensive for ten user case. Proposed technique showed least exponential increase in computation time out of all the methods tested. Dividing the problem space into smaller problems and solving for less number of users every time helped the proposed technique to offer better frame rate performance.
V. CONCLUSION Face biometric has been one of the most adopted forms of low cost and yet efficient biometric systems. From mobile based authentication system to attendance system in small organizations, they have been adopted widely. However, most of the past research works and commercial products in facial recognitions are focussed towards facial authentication and verification in facial images. With growing adaptation and usage of live video streaming, surveillance, video content streaming, facial recognition in video sequences is gaining popularity. Most of the popular facial biometric systems applicable for standalone face biometric for images can also be used in video based face biometric in a per-frame consideration. Such design usually requires every acquired frame to be matched against set of database faces. This not only drops significant frame rate (as we saw in figure  7), because they do not explore the inter-frame correlation of the faces. In this work we have investigated existing state of arts for video based content face recognition techniques and found out the design issues with these techniques. We have also proposed a unique local directional pattern based facial recognition system for video sequences which not only provides better accuracy in comparison to both ensemble based video face recognition and exemplar based video face recognition, but also results in better accuracy (74% for 1o faces at 30fps in comparison to exemplar based method with 72% and ensemble based technique with 66% )but also is more efficient in terms of time complexity(An improvement of about 1.7fps over current state of art) . Though the present research is aimed at fast detection of the faces in video sequences in a comparatively small dataset, the method provides a good benchmark and standard platform for video based face recognition. We believe that better models for correlation of the facial as well as general frame information between the frames can improve the performance of the system significantly. Instead of matching a set of sequences against a set of reference sequences, a standalone face model is used to recognize the face in first reference frame and then in consecutive frames faces can be matched with recognized face in the reference frame. A matching against the database is triggered only when a face is found to be not matching with the recognized face in the reference frame. Once a new face appears, this frame can be used as new reference frame. However, such a design is helpful in scenarios where number of users is limited in the video sequences. In a complex video frame with multiple users, such techniques will not hold good. Our system is more suitable than the current state of arts in terms of speed and accuracy for a limited database size. However, with present testing, we were not able to find out the effect of motion blurring and artefacts on the recognition accuracy.