Efficient Outlier Detection by Integration of Clustering and Classification

—This paper addresses a new method which integrates the clustering method DBSCAN with the classification method KNN to improve the quality of real data sets by removing noise data. DBSCAN algorithm is a versatile density based clustering algorithm which is employed in this paper.The proposed method consists of first applying DBSCAN algorithm to the data set and secondly KNN algorithm is applied.The experiments conducted proves that the proposed approach is better method and increases the accuracy.

In the same year a new method in which for each object in the data set a Local Outlier Factor(LOF), was determined along with indication of degree of outlierness was proposed by Breunig et.al. [12], It followed the process of determining how far a point considered as outlier differ from the other set of points in the data set.It is considered to be more efficient than the instance based scheme.Hence the conclusion is that categorization of outliers and normal data is not done explicitly by the density based scheme. Subsequently, In the year 2002 a new method which consisted of identification of noisy data by considering all the objects in the neighborhood was proposed by Angiulli and Pizzuti [13]- [15].Sum of distances of k nearest neighbors were considered to rank all the points. There are other clustering methods helpful in determining unwanted data in a data set for example CLARANS [17], DBSCAN [18], BIRCH [19] and CURE [20].But the primary objective of these methods is to improvise the clustering method and not unwanted data detection.
III. DBSCAN This is a data clustering algorithm. In this algorithm given a set of points grouping of those points is done which are packed together and have many nearby neighboring points.Where as the points having less density i.e whose neighborhood point are at a large distance are marked as noisy data. In the year 2014, this density based clustering algorithm was given the test of time award (an award given to those algorithms which have got high recognition in theory and practice) at the highly recognized conference of data mining KDD [21].The DBSCAN algorithm requires two inputs or parameters,first is Epsilon(eps) which specifies the radius within which the points are to be considered and the second parameter is the least number of points required to form a dense region(minPts).To begin the algorithm,a random point is taken to be the starting point,then the neighborhood of the point within an epsilon radius is checked and the number of points present are retrieved,if the point contains sufficient number of neighborhood points than a cluster is started or else that point is marked as a noise. Epsilon value can be chosen by plotting a graph known as k-distance graph.The distance is plotted to the k=(minimum points) nearest neighbors. The point where the plot shows a strong bend as shown in figure-1 below is the most appropriate value of epsilon.Choosing the epsilon to be too small results in leaving large number of data with out being clustered and on the other hand high value results in merging of two or more clusters and also putting majority of the objects in the same cluster.However small value is generally preferred. 1. Unlike the k-means clustering the value of k need not be specified before hand. 2. As opposed by other clustering algorithms like K-means and K-medoids it can find arbitrary shaped clusters and also clusters connected by thin lines and totally surrounded by other clusters. 3. It is also robust to noise or outliers. 4. Ordering of points in the database has no effect on DBSCAN and it just requires two parameters epsilon and minimum points. 5. The design of DBSCAN is such that it can process and run region queries for example,the R* tree. 6. The setting of the parameters can also be done by the domain expert if the data in the data set is easily understandable.

7.
B. Drawbacks of DBSCAN 1. The algorithm is sometimes undeterministic i.e the points on the border can be included in more than one cluster and thus they can belong to either cluster.The order in which the data is processed determines this,however it does not impact much the core and noise points. 2. The accuracy of the algorithm is dependent on the distance measure used,also it becomes very difficult and complex for high dimensional data. 3. Choice of appropriate distance threshold ε is difficult if one has not understood properly the data and scale required.

IV KNN Algorithm
It is one of the classification method used for detection of unwanted data in a data set.In this method a fraction of training examples in feature space is taken as input, and a class name can be obtained as an output.The notion of closest neighbor and k-distance is used here,the set of k-closest data points of a point p is the k-neighborhood of that point.It calculates the k-closest data points of all the points present around it and orders them in sorted i.e decreasing order of these values with the first n data points being considered as outliers.It is very critical to determine the value of k. There will be higher impact of noise on the result for small value of k,on the other hand large value of k is computationally expensive.The most simplest and widely used approach to determine the value of k is to set k=n^(1/2).
A.Advantages of KNN 1. KNN is robust to data with noise especially if we use inverse square of weighted distance as the distance measure. 2. It can be effectively applied to large data set.

B. Drawbacks of KNN
1.Parameter k needs to be determined in advance. 2.The choice of the distance measure and the attribute which is to be used, so that it will give the best result is difficult. 3.Also it is required to determine whether we shall use all the attributes or some attributes. 4.Here we have to compute the distance of each testing sample to all training samples,hence the computation cost is very high.

V Proposed Algorithm
The proposed algorithm combines two techniques of data mining.The first phase consists of application of density based clustering algorithm DBSCAN on the data set,after using DBSCAN the clusters obtained are marked as separate classes and each class is marked as Cl k (k=1,2,…,m).KNN algorithm is then applied by considering the noise points obtained from the DBSCAN algorithm as the testing set and the clusters obtained as the training set.

Input: The original data sets DS, given radius eps, the minimum number of objects in neighborhood mpt, an appropriate value of K is taken according to the data set.
Output: The set n o include all the noise detected from original data sets DS. Setp1: Given the value of eps and mps.. Setp2: Select a point randomly from data sets DS, and count the number of points in its neighborhood. If the number is higher than mpts, the point is marked as core object. Else, it is marked as noise.
Step3: if the point is a core object, create a cluster with radius eps and the core point. Then add the objects in the cluster into a list container, and check the objects recursively. If the object in the container is a core object, classify it to the same class as the point and add its neighborhood points into the container. Else, mark and delete it from the list container.
Step4: repeating Step 2 and Step 3 until the objects in data sets DS are marked as a class or as noise points if it is found not belonging to any class.
Step6: Take an appropriate value of K.

Step7: train and build KNN on DS t ,
Step8: Consider randomly a point which is not included in any of the data set, find out the k-nearest neighbors of the point, the following three situations may arise: i) If 50% or more than 50% of its k neighbors belong to class i include the point in class i ii) If 50% of its neighbors belong to class i and another 50% belongs to class j then it can be included to any class. iii) If there is no class to which at least 50% of the neighbors belong such a point is marked as outlier and included in N. Step9: Output the set N which contains the noise data. DBSKNN algorithm can effectively reduce error samples of DBSCAN clustering by introducing KNN algorithm, and significantly improve the clustering accuracy.

VI Result Analysis
We have performed experiments to verify the effectiveness of the DBSKNN algorithm and only DBSCAN algorithm without introducing KNN.The algorithms are implemented using r-programming. The UCI data set Iris is used as the experimental data.The data set consists of 50 samples from three species of Iris flower which are categorized based on four features they are (i)Length of Sepal(ii)Width of sepal (iii)Length of Petal (iv)Width fo Petal.The class label identified in the dataset is used to test the efficience and effectiveness of DBS-KNN noise data detection algorithm.75% of the data set are taken as training set and the remaining 25% is taken as testing set. The basic flow of the experiment is that conduct clustering on the training set by DBSCAN algorithm at first.Then according to the results of the clustering, train with KNN algorithm on each category which is identified and then get discriminant model for each category. The resulting models are used to classify training and testing sets. DBS-KNN algorithm can be judged by the result of the experiment as shown in table1. It can be observed from the result the accuracy of the algorithm increases with the increase in the Minimum points and the Epsilon value and the the number of noise points also decreases.also the value of k for the KNN algorithm can be varied.

VII CONCLUSIONS
The new method based on DBSCAN and KNN proposed in this paper is more efficient as it minimizes the error of DBSCAN algorithm.As evident from the experimental output the algorithm improves the accuracy of the clustering results.The parameter those are required to be selected are (i) minimum points (ii) Epsilon (iii) the parameter k.Manually we can adjust the parameters according to the application area.Future work will consist of finding out a method for automatic selection of the optimal parameters.