Hand Written Telugu Character Recognition Using Bayesian Classifier

— Identifying handwritten Telugu is difficult way in machine vision systems because of the complex shape of the individual characters and the size of Telugu character set. In this paper, an efficient algorithm is presented to identify individual handwritten Telugu characters based on HOG features and Bayesian classification. The proposed system utilizes features of Telugu scripts for identifying handwritten Telugu text efficiently. The recognition rate for Telugu script is 87.5%.

. The different areas of character U. Pal and B.B. Chaudhuri [2] suggested an enhanced quadratic classifier for identification of handwritten numerals in Devnagri, Bangla, Telugu, Oriya, Kannada and Tamil scripts. The primary idea behind automatic character recognition is to first train the machine classifier with instances of patterns that may occur and their appearance. In OCR these patterns are alphabets, numerals and some special symbols like commas, question marks etc., Using these samples the machine constructs a prototype or a description of each class of characters. Now, for recognition, the new sample of characters are compared to the previously obtained descriptions and assigned the class that gives the best match. In most commercial systems for character recognition, the training process is performed.A typical OCR system consists of several components. In figure 2 a common setup is illustrated. The first step in the process is to digitize the analog document using an optical scanner. First inside the document, we need to locate the regions containing text and then extract them through a segmentation process.

Figure 2. Components of an OCR system
Each symbol can be identified by comparing the extracted features with descriptions of the symbol classes created previously in the training phase. Finally, textual information is used to construct the words from the original text. The next section describes these steps and methodologies in detail.During the scanning process may contain a certain amount of noise. The characters may get broken or smeared based on the poor resolution of the scanner or the methodology applied for thresholding. The most widely accepted methodology for smoothing, moves a window over the binary image of the character, using specific rules to the contents of the window for smoothing.
The aim of feature extraction is to find the basic properties of the character set, and it is largely acknowledged that this is the most challenging issue in pattern recognition. The most obvious way of describing a character is by its original raster image. Another method is to calculate specific features that still describe these characters, but eliminate out the insignificant attributes. The classification is the process towards distinguishing each character and allocating it to the suitable character type. There are many advantages of OCR, however to be specific it increases the proficiency and effectiveness of office work. The capability to search through a document instantaneously is very useful, more so in an office premise where one has to deal with high volume scanning or high document inflow. This paper describes the method of recognizing characters in a document image. Section II describes Telugu character set and existing methods for handwritten. Section III describes the experimental date and classifier used. Section IV describes the step by step algorithm. Section V reports the final conclusions.

II. TELUGU SCRIPT
There are 18 characters vowels and 36 characters consonants in Telugu language. Telugu characters and their pronunciations are given by

III. EXPERIMENTAL DATA
The flow chart is mainly containing if two parts:  Training of image.  Testing of image.

A. Pre-processing:
The fundamental standard of recognition-based character division is to utilize a portable window is used to give the speculative divisions which are affirmed (or not) by the grouping. Picture obtaining and pre-processing are the two moderately basic stages, which are introduced first. Picture securing is at the picture representation level of pattern recognition(PR). It is the methodology of securing a machine representation of an archive to be perceived. A digital scanner is utilized at this stage to secure 200 dpi, 8-bits light black-level pictures. Preprocessing is at the picture-to-picture conversion level and repaying a low quality-unique and/or low qualityexamining. There are two techniques to improve the obtained picture in the proposed framework, which are binarization and smoothing. Character division could be taken as recognition-based procedure. Dissection implies the decay of the image into a grouping of sub-images utilizing general characteristics. Each one subpicture is dealt with as a character for recognition. It is worth saying that arrangement of characters is completed at a later stage. Projection analysis, connected component preparing, and white space and pitch discovering are a percentage of the regular dissection strategies utilized by OCR frameworks. These procedures are suitable for scripts which have large character spaces in between then. The dissection system is utilized for cursive scripts, a more `intelligent and particular dissection strategy for the specific script is required and no surety that high segmentation exactness might be accomplished.
The essential guideline of recognition-based character segmentation is to utilize a portable window of variable width gives experimental divisions which are affirmed (or not) by the order. Characters are by-results of the character recognition for frameworks utilizing such a guideline to perform character segmentation. The principle point of interest of this strategy is that it sidesteps genuine character segmentation issues. On a basic level, no particular segmentation algorithms for the particular script is required and recognition failures are principally because of disappointments throughout the classification stage. Consequently, more cursive script OCR frameworks utilize this procedure for enhancing the recognition correctness. This methodology is otherwise called without division recognition because of the virtual nonattendance of the character partition stage.
Binarization is an extraordinary instance of thresholding, of which there are just two states of yields in the ensuing picture, either dark or white. It diminishes the computational necessities of the framework and may empower evacuation of some noise. A document could be binarized comprehensively or adaptively. Unless the archive is considered as uneven shaded paper, worldwide thresholding is sufficient to do the binarization. Two worldwide thresholding calculations were concentrated on and executed.

B. Feature Extraction: Histogram of Oriented Gradients
HOG is a feature [13] selection method gives the parameters as shown in Table I. The parameters of these features are flexible and are represented without any bias or variance.HOG starts by detecting the boundary or shape of an character and normalizing them into gray-scale. The gradient is calculated for the intensity of the detected characters. Weighted voting is applied to spatial and orientation cells. The normalization was performed on overall areas are made using L2 norm. The extracted features are rescaled in therange[0,1] by using the formulae: where D is a row vector containing the features, min is the minimum value of D and max is the maximum value in D.

BAYESIAN CLASSIFIER: A. Bayesian Network:
Bayesian classification is based on Bayes' Theorem. Bayesian classifiers are the statistical classifiers. Bayesian classifiers can predict class membership probabilities such as the probability that a given tuple belongs to a particular class. where X is data tuple and H is some hypothesis. According to Bayes' Theorem, P(H/X)= P(X/H)P(H) / P(X) Bayesian networks represent a set of variables in the form of nodes on a directed acyclic graph (DAG). It maps the conditional independencies of these variables (D. Heckerman, 1995). Bayesian networks bring us four advantages as a data modeling tool Firstly, Bayesian networks are able to handle incomplete or noisy data which is very frequently in image analysis (D. Bellot 2002). Secondly, Bayesian networks are able to ascertain causal relationships through conditional independencies, allowing the modeling of relationships between variables. The last advantage is that Bayesian networks are able to incorporate existing knowledge, or pre-known data into its learning, allowing more accurate results.
where V. EXPERIMENTAL RESULTS The below given figures gives the procedure of the program and outputs of the program at certain levels, and explained clearly. V. CONCLUSION This paper suggested a method for application of HOG and Bayesian Network classification for Telugu characters. The experimental results shows, the performance characteristics of the Bayesian network. Furthermore, Bayesian Network can produce higher recognition rates using Nearest Neighbourhood method. The efficiency of Bayesian network based model has its average of standard deviations of testing recognition rate less than the standard deviations of NN-based classifier. The recognition accuracy was calculate based on the number of positives and negatives in the training and testing dataset.