Prediction of Diabetes Disease Using Classification Data Mining Techniques

-Data mining uses important techniques and classification is one of them. Classification is also an accepted technique in analyzing huge databases. It is used for solutions in several fields of Science, Business and Industry. Classification is implemented by finding rules that classify data. There are several classification and Statistical methods. This paper demonstrates the use of Decision Tree algorithm for classification and predict Diabetes in patients. Further, it attempts to develop a Decision Tree Algorithm for diabetes prediction in patients


DATA MINING TECHNIQUES
The aim of any predictive model can be achieved using a number of data mining techniques [7].

Classification: Classification is based on categories and this technique depends on a supervised learning.
It is based on training set and values.

Regression:
Regression is used to map a real data item into a valued prediction [8]. Regressions are used for predicting results.

Clustering:
Clustering is grouping similar data objects. Finds similarities in data.

Association Rule:
Association is an important data mining tool and finds most frequent item sets. It discovers patterns in the database based on relationships between transactions [8].

Prediction:
It is a technique which discovers relationships between an independent and dependent variables [6] 2.1.6 Time Series Analysis: Time series analysis is a statistical technique used to model and explain dependency of data points based on time [9]. 2.1.7 Summarization: Summarization are abstractions of data. They provide an overview of data and a set of relevant tasks.
3. RELATED WORKS Mohd Fauzi bin Othman and Thomas Moh Shan Yau [10], examined the performances of different classification and clustering methods with a large set of data. Kawsar Ahmed, et.al [11], in their work proposed a system to detect Lung cancer in patients. Kaur [12], reviewed six clustering techniques of data mining namely K-means clustering, Hierarchical clustering, DBSCAN clustering, OPTICS, and STING. Bharat Chaudharil, Manan Parikh [13], analyzed the performances of clustering algorithms are K-Means, Hierarchical clustering and Density based clustering algorithm using weka tools. Manish Verma, et. al [14], analyzed six clustering techniques k-Means Clustering, Hierarchical Clustering, DBSCAN clustering, Density Based Clustering, Optics and EM Algorithm, in their study. Shraddha K.Popat, et.al [15], surveyed different clustering techniques in their work. P. Thangaraju and B.Deepa [16], surveyed on preclusion and discovery of skin melanoma in patients by using clustering techniques.Khaled Hammouda, Prof. Fakhreddine Karray [17], reviewed four off-line clustering algorithms in their study. Pradeep Rai and Shubha Singh [18], in their study surveyed and provided a comprehensive review of different clustering techniques of data mining. Amandeep Kaur Mann and Navneet Kaur [19], also reviewed different clustering techniques in data mining. Dr.N. Rajalingam, K. Ranjini [20compared implementations of Hierarchical clustering algorithms, agglomerative and divisive clustering for various attributes.
4. DATA SET DESCRIPTION MV dataset, collected from various districts is used to predict diabetes Disease using Data Mining Classification Techniques. It contains 1024 complete instances with 26 Parameters. The data was gathered from answers to Questionnaires given during the research work. The main objective of the questionnaire was to converse on a set of parameters for diagnosis of diabetes in patients. Table 1 lists the characteristics of MV data sets, while Table 2. Describes the attributes used in the study.

METHODOLOGY 5.1.1 Classification
Classification is categorical and data is classified based on the training set and values, for prediction of Diabetes.
 Multilayer Perceptron (MLP): This is a commonly used neural network classification algorithm. Architectures that use MLP during simulations on PIDD dataset consist of three layer feed-forward neural network namely input, hidden, and output layer.  BayesNet -Bayesian networks are used with the presumptions that attributes are nominal with no missing values.  JRip -JRip to RIPPER is a basic and important algorithm in data mining classification.  C4.5 -Is a Decision tree classifier to classify a new item and needs to create a decision tree using the training data.  Fuzzy Lattice Reasoning (FLR) -This Classifier is used descriptive and decision-making.
6. PERFORMENCE ANALYSIS OF ALGORITHMS MV dataset with 1024 instances found 272 persons with Diabetes. The data set was split into Training and Test Data sets as listed in Table 3. Table 4 lists the results of the classification algorithms performances on multiple factors.  It is evident from the above table that BayesNet, MLP, and FLP has lowest computational on the MV dataset. A Confusion matrix was obtained to calculate the specificity, sensitivity, and accuracy, since confusion matrix is a representation of authenticity in results. The results of accuracy is depicted in Figure 1.

CONCLUSION
Five data mining classification techniques were compared on multiple factors on the same set of attributes in the MV database. They results were obtained for MLP, BayesNet, JRip, FLP, and C4.5 classification techniques. The techniques were compared on time, accuracy, recall and error rate. It was found that BayesNet, MLP, FLP had lower computation time. On accuracy, C4.5 and JRip had accuracy above 85%. Thus this work concludes that C4.5 and JRip are the most suited algorithms for prediction using classification on datasets with diseased kidney patients. Medical predictions need higher accuracy levels and accuracy above 85% is good for early detection/prediction of diabetes, thus helping doctors take preventive and early actions on treatment 8. REFERENCES