An Analysis and Comparison of Various Missing Data Imputation Tools and Techniques

—The missing data and noisy data are common in a data set and the finding the effect it causes on the accuracy is very important to be determined. In statistics, missing data, or such values, occur when no data value is assigned for a field in a dataset. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data given or taken from warehouses. Missing data reduce the representativeness of the sample and can therefore distort/deviate inferences & conclusions about the population. This study aims at calculating the effect of missing values on Naïve Bayes algorithm by using two data sets that are lymphoma and breast cancer. The values are skipped in certain order of both the data set and accuracy is computed and results were compared in a table. Naïve Bayes is based on probalistic model.


III. METHODOLOGY
The naıve Bayesian classification gives the class label of a tuple, the values of the attributes are assumed to be conditionally independent of one another. The naïve Bayesian classifier is the most correct in comparison with all other classifiers [6]. Bayesian classification is the statistical classifiers and for each new sample they provide that the sample belongs to a class. For ex: -sample john (age=27, income=high, student=no, credit_rating=fair).It uses the concept of joint conditional probability distributions, which makes class conditional independencies to be defined between subsets of variables and also it have a graphical model of relationships, by which learning can be performed. A belief network has directed acyclic graph and set of conditional probability tables' components, Each and every node in directed acyclic graph represents a random variable which can be discrete-or continuous-valued. They may be actual attributes given in the data or can be "hidden variables" understood to form a relationship. In this directed acyclic graph each and every arc represents probabilistic dependence. If an arc is drawn from a node A to a node B, then A is a parent or immediate predecessor of B, and B is a descendant of A. Every variable in the graph is independent of its nondescendants, gives its parents [12]. The reasons behind choosing Bayesian network among all the various classification techniques are: 1. The probabilistic nature of the Bayesian network is calculation the probabilities for hypothesis, among the most practical approaches to certain types of problems.
2. The incremental nature of Bayesian network i.e. each and every training example can be incrementally increase or decrease the probability that a hypothesis is correct.
3. Various types of past knowledge can be combined with observed data.
Let assumed that there is a sample called A, the probability of a hypothesis h, P(h|A) follows the Bayes theorem stated mathematically as the following equation

| |
During the training of a belief network, various scenarios are possible. The network topology is constructed by human experts or inferred from the data and the network variables may be observable by some training tuples. The hidden data case is which is also called as missing values. Several algorithms exist for learning the missing values from the training data. The problem is one of discrete optimization. For solutions, Bayesian classification is the best choice. When the pattern of missing is known and the variables of training tuples is given are observable, then training the dataset is candid. It consists of computing the continuous probability table (CBT) entries, as is similarly done when computing the probabilities involved in naıve Bayesian classification [11]. In Bayesian network, Naïve Bayesian classifier is also used in which we assume that attributes are conditionally independent. The data sets were taken from UCI-repository. The data set is of breast cancer and other is of lymphograpy. First data set has 699 rows and 9 attributes whereas other dataset has 142 rows and 18 attributes. The data set was converted to CSV (comma delimiter file) and imported to PHPMyAdmin in form of RDBMS tables. The IDE used to create is Net Beans 8.0 and JDK 6.0. The language used is JSP (java Servlet packages).
VI. RESULT ANALYSIS The data set is varied as reducing 5%, 10%, 20%, 30% and 45%. For 5% rows are deleted and for greater than 20% some attributes are deleted. Both of them showed that the accuracy initially increases and start decreasing after a certain point. The accuracy is calculated by no. of values matching the correct label by the total number of values. VII. CONCLUSION Imputation missing value is one of the major tasks of data pre-processing when performing data mining. Simply removing the records which contain missing value from the original datasets can bring more problems than solutions. A suitable method for imputing the missing value can help to produce good quality datasets for better analysing trials. Mean/mode imputation, fuzzy unordered rule generation algorithm for imputation, decision tree imputation and other machine learning algorithms are used for imputing the missing value and the final datasets are classified using K-Mean clustering. The experiment shows that performance is improved when the fuzzy unordered rule induction algorithm is used to predict missing attribute values. According to the results and observations it can be seen that initially the accuracy increases up to a certain point and then it started decreasing gradually. As Naïve Bayes algorithm is based on probabilistic model and all the values are considered while testing hence reducing some values can have a positive impact but up to some limit.