PREDICTION OF HEART DISEASE USING K-MEANS and ARTIFICIAL NEURAL NETWORK as HYBRID APPROACH to IMPROVE ACCURACY

— The heart is important organ of human body part. Life is completely dependent on efficient working of the heart. What if a heart undergoes a disorder, cardiovascular diseases are the most challenging disease for reducing patient count. According to survey conducted by WHO, about 17 million people die around the globe due to cardiovascular diseases i.e 29.20% among all caused death, mostly in developing countries. Thus there is a need of getting rid of the this complicated task CVD using advanced data mining techniques, in order to discover knowledge of Heart disease prediction. In this paper, we propose an efficient hybrid algorithmic approach for heart disease prediction . This paper serves efficient prediction technique to determine and extract the unknown knowledge of heart disease using hybrid combination of K-means clustering algorithm and artificial neural network. In our proposed model we considered 14 attribute out of 74 attributes of UCI Heart Disease Data Set [19]. This technique uses medical terms such as age, weight, gender, blood pressure and cholesterol rate etc for prediction. To perform grouping of various attributes it uses k-means algorithm and for predicting it uses Back propagation technique in neural networks. The main objective of this paper is to develop a prototype for predicting heart diseases with higher accuracy rate.


II. RELATED WORK
In this section, Data mining techniques used for decision making in heart disease are analysed. Ankita Dhewan and Meghana Sharma proposed a methodology of hybridizing two data mining techniques like Artificial Neural Network and Genetic Algorithm which was implemented to achieve high accuracy with least error [1].

Limitations:
The very big disadvantages of GA are unguided mutations. The mutation operator in GA functions like adding a randomly generated number to a parameter of an individual of the population [10]. This is the only reason of a very slow convergence of genetic algorithm. The time consumed for optimization is much high.
M.Akhil jabbar, B.L Deekshatulua proposed algorithm into two parts i.e first part deals with evaluating attributes using genetic search and second part deals with building classifier and measuring accuracy of classifier. In this paper it compares the accuracy of datasets with and without GA. Results shows that accuracy is increases by 5% when this two are combined.
Limitations: Accuracy is very low with K-nearest neighbour and genetic algorithm takes much more time for optimization [7].
Rovina Dbritto, Aniruddha has given three data mining techniques viz. Naïve Bayes, Support Vector Machine, K-nearest neighbour and Logistic regression. Results shows that Naïve Bayes gives more accuracy compared to other classifier even.

Limitations:
The disadvantage is that the Naive Bayes classifier makes a very strong assumption on the shape of your data distribution, i.e. any two features are independent given the output class. Due to this, the result can be very bad. Dependencies among attributes cannot be modelled using Bayesian classifier [2].
Humar Kahramanli, Novruz Allahverdi, used a hybrid neural network that includes artificial neural network and fuzzy neural network. A datasets of 303 samples were taken from patients with heart disease which give 87.4% accuracy on attributes of UCI repository [5] [12].
Limitations: When fuzzy system is combined with neural network, fuzzy systems need to be tuned which is very time consuming and error-prone.
Sudha, Sarath Kumar proposed two algorithms i.e KNN and K-means [15]. Their accuracy was measured which shows that KNN achieve 100% accuracy for different cluster with nearest value while K-means achieves 100% accuracy when value of K number of cluster have is very high.

Limitations:
Computation cost is very high as we have to calculate the distance of each query instance to all training samples [8] Mai Shouman, Tim Turner used a single data mining techniques on different datasets which shows that results can't be compared because of use of different datasets. When single and hybrid data mining techniques on Cleveland datasets in heart disease diagnosis results shows that hybrid techniques shows better results than single techniques. The hybrid technique used was Neural Network ensemble [3].
Limitations: Ensemble training is several times slower than traditional neural network. When solving some rare problems, the ensemble error is greater than error of a traditional neural network.
M Limitations: PLS-DA is a complex algorithm which is very difficult to use [4].
III. PROPOSED SYSTEM In this section we mentioned about the system architecture. fig 1 represents the overview of systems architecture. The core modules of the proposed system consist of : a) Understanding the input data and selecting the attribute related to heart disease. b) Data Preparation: transformation and pre-processing of missing data is carried out. c) Processing Module: it specifies about the algorithmic approach applied over the system to obtain high accuracy result. Pre-processing modules are separately discussed in upcoming section. d) Evaluation and deployment: Final Analysing modules provide information related to generated output. It compares and conclude about measurable resultant artefacts like sensitivity, accuracy etc. For diagnostic purpose we have considered these 14 attributes [14]: Age in years, sex (male, female), chest pain type, resting blood pressure, serum cholesterol in mg/dl , fasting blood sugar, resting electrocardiographic results, maximum heart rate achieved, exercise induced angina, ST depression induced by exercise relative to rest, the slope of the peak exercise ST segment, thalassemia, number of major vessels and angiographic disease status, etc.

IV. ALGORITHMIC DESCRIPTION A. K-means Algorithm
The main goal of using Kmeans clustering technique is that it organizes the data into classes such that there is  high intra-class similarity  low inter-class similarity K-means algorithm [15] [16] is famous clustering algorithm widely used in data mining project. The main aim of this clustering is to find the positions µ i , i=1...k within-cluster to minimize sum of squares distance from the centroid. K-means algorithm depends on k clusters, and it may stuck for different solutions. So to remove such dependency, modified or improved k-means was proposed. Kmeans is accompanied with Lloyd's algorithm to get rid of dependencies. Using this method the results show the quality of clusters is not compromised.
Steps for K-means algorithm are [15]: 1. Initialize the center of the clusters from n data points x i , i=1...n that have to be partitioned in k clusters 2. Attribute the closest cluster to each data point using Euclidean distance 3. Set the position of each cluster to the mean of all data points belonging to that cluster 4. Repeat steps 2-3 until convergence In our system Kmeans algorithm plays a crucial role in order to obtain the appropriate number of data groups. Using this algorithm along with Euclidean distance centroids are calculated for different patient attribute. Mean value is taken into account for sample data and henceforth it is judgemental to predicate the patient status. If the mean value of the patient is nearest to the sample mean value, the patient more likely to be affected by heart disease.

B. Artificial Neural Network
An artificial neural network (ANN), usually called neural network (NN). ANN is a mathematical model or computational model that is inspired by the structure and/or functional aspects of biological neural networks [17] [18]. There are three input layers are present in ANN: input layer, hidden layer also called as intermediate layer and output layer. Hidden layers are present in between input and output layer.

Input layer:
The input units present in this layer shows the raw information that is fed into the network. Hidden layer: The activity of each hidden unit is based on the activity of each input unit and weights on the connection between them.
Output layer: The activity of each output unit is based on the activity of each hidden unit and weights on the connection between them.
The ANN algorithm follows: 1. The data from input layer is given to hidden layer. 2. Input values from input layer are used and modified using some weight value and sent to output layer. 3. The value is again modified by some weights from connection between hidden and output layer. 4. This information is processed and output layer gives final output. Finally, this output is processed by activation function. ANN follows trial and error method in order to get optimal solution. The structure of neural network is shown in Fig 2 [ Where, y j represents output neuron.
x i is input neuron w ij is the weight connecting x i and y j ∑ is sigmoidal function As mention in figure 2, ANN consist of three layers input layer, hidden layer also called as intermediate layer and output layer. In this system former clustered normalized data groups are feed as input to neuron. The patterns vital to heart attack prediction are selected on basis of the computed significant weightage. Weightage are provided based on the range decided for the selected attribute from the dataset. For example  sex : (1 = male; 0 = female)  fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)  V. CONCLUSION As heart disease patients are increasing every year, huge amount of medical data is available. Researchers are applying data mining techniques on this data to diagnosis heart disease. It is analysed that artificial neural network algorithm is best for classification of knowledge data from large amount of medical data. Population is growing in exponential way. Death rate due to cardiovascular diseases is also increasing. The only solution to control this is to predict the heart disease and medicate it before it gone worse. Our hybrid approach gives higher accuracy rate of 97% of disease detection than earlier proposed method.