Comparative Analysis of R Package Classifiers Using Breast Cancer Dataset

— Breast cancer is a very common cancer among women. It remains as the number one form of cancer among women around the globe. Lack of awareness and detection of cancer at an advanced level put patient’s life at a very high risk among the cancer affected women. Of the two types, non-invasive and invasive, invasive cancer has the potential to spread to other parts apart from the affected part. This paper attempts to perform breast cancer data analysis using R package. Decision tree is one of the data mining algorithms for classification due to the reason that it is fast, scalable and distributable. Among many tools available for data analysis, R is observed to be better in analyzing the data as it has become popular among the data analysts recently for the study of their large, unstructured and dynamic datasets. The three classifiers taken for study are ‘rpath’, ‘ctree’ and ‘randomforest’. The algorithms are studied based on their performance measures such as accuracy, precision, recall, sensitivity and specificity. Based on the results, the best classification approach that suits better for cancer data analytics is recommended.


I. INTRODUCTION
Cancer is a very important disease challenging the human lives all over the world. Out of the cancer affected community, more than 90% of the people have very remote chance of survival. Breast Cancer has been identified as the second primary cause of death among women worldwide. It is also observed to be a common type of cancer among women in both developed and the developing countries. A recent study says that one woman dies in every 13 minutes due to this disease. Breast cancer detection at an early stage has bright chances of saving women; because the survival rate of cancer affected patients is only 5%. Cancer data present in different sources are heterogeneous in nature and many systems are developed to collect, analyze and learn from data for cancer care. Many open-source tools are available for data analysis. Due to the volume of data involved, data mining techniques are applied for early detection of cancer. Cancer detected at the early stage can promote better diagnosis, treatment and can improve the survival data of patients. The fundamental data mining methods are classification, clustering, regression, artificial intelligence, association rules and decision tree. The classification methods are the most intensively used methods of data mining in health care. The classification model devised for specific applications is trained with the existing set of data. This learning can help the model in assisting the prediction and classification of new data. This concept can be successfully applied for cancer research as the diagnosis and treatment process involved past data set. With this idea and motivation, decision tree classification methods available in R package are studied. Three significant classification methods namely, 'rpath', 'ctree' and 'randomforest' algorithms are implemented using R. An analysis of the above three methods is done based on the specific performance metrics such as accuracy of detection, its precision, recall, specificity and sensitivity. The breast cancer data set available in 'mlbench' package of CRAN is taken for testing.
Significant contributions of this paper: i) Study of the three classification methods namely, 'rpath', 'ctree' and 'randomforest'.
ii) Data analysis using performance metrics for the breast cancer data set taken.
The entire work is organized as follows: "Related Work" Section discuses about the significant literatures available in breast cancer data analysis. "Methodology" Section presents the proposed methodology. "Results and Discussions" Section explains the experimental set up and the results due to experimentation.

II. RELATED WORK
There has been many works done in the area of Breast Cancer data analysis of which few significant ones have been reviewed here. The predictive models are discussed in [1] and based on the analysis of their results, it is evident that the integration of multidimensional heterogeneous data, combined with the application of different techniques for feature selection and classification can provide promising tools for inference in the cancer domain. The study [2] clearly shows that data mining techniques is a good method to predict breast cancer recurrence and they present an efficient pre-classification method and discover the information of breast cancer recurrence of SEER dataset. The authors in [3] compare three classification techniques in Weka software and comparison results show that Sequential Minimal Optimization (SMO) has higher prediction accuracy i.e. 96.2% than IBK and BF Tree methods. The work in [4] presents a diagnosis system for detecting breast cancer based on RepTree, RBF Network and Simple Logistic. The outcome of the research in [5] is justified that k-means clustering algorithm and FF algorithm are helpful to early diagnosis of the breast cancer patients. The purpose of the research in [6] is to develop a novel prototype of clinical problem regarding to diagnose and manage patients with breast cancer. Different methods for breast cancer detection are explored in [7] and their accuracies are compared and with their results, they infer that the SVM are more suitable in handling the classification problem of breast cancer prediction. Given all these literature review it is understood that not much exploration has been so far made using the classification techniques in the R Package which may reveal more interesting performance improvements. Hence, our study focuses on these approaches.

III. METHODOLOGY
The major objective of the work is to perform the breast cancer data analysis using specific classifiers. The secondary objective of the work is to compare the performance and identify the best suited decision tree classification algorithm out of the three algorithms taken for study. Fig. 1 shows the methodology used for study. The work has been implemented using the decision tree algorithms 'rpart', 'ctree' and 'randomforest' available in R Tool using the BreastCancer Dataset of the 'mlbench' Package. The reason for choosing decision trees among the available classification techniques in R is that the results are fast, scalable and distributable.
A. About R R is a software environment used by data analysts for data mining and statistical analysis. R has recently become more popular among data analysts for the study of their large and dynamic databases. R was originally developed by GNU and its source code was written in C, Fortran and R itself. R has both command line interface and graphical interface. R Tool has by default many packages installed in it and each of these packages has a set of functions that performs certain analysis or graphical representations. B. Package "mlbench" Some packages in R Tool are available by default and some needs to be installed by using the Install Package option in R. The package 'mlbench' needs to be installed in R and this package has only datasets and no functions in it. This package consists of a set of artificial and real-world machine learning datasets with many of them taken from the UCI repository. C. About the Dataset The dataset used for this work is the BreastCancer dataset that has 683 observations and it has the below list of attributes as shown in Table I. The dataset is divided into the training and the test dataset in the ratio of 80% and 20% respectively. D. About 'rpart' The decision tree technique 'rpart' means recursive partitioning for classification and regression trees. These algorithms build classification or regression models of a very general structure using a two stage procedure. E. About 'ctree' The decision tree technique 'ctree' means conditional inference trees that are available in the package 'party'. It is a non parametric class of regression trees embedding tree-structured regression models into a well defined theory of conditional inference procedures. F. About 'randomforest' The decision tree technique 'randomforest' means classification and regression based on a forest of trees using random inputs. This method implements Breiman's random forest algorithm for classification and regression. It can also be used in unsupervised mode for assessing proximities among data points.   The code for the 'ctree' classification method goes as below. Before executing the code in the R Tool it is required to install the package 'party'.
Level The result for this code is shown in the Fig. 4. The interpretation of the resultant graph is given using the confusion matrix below that shows the class error and the number of records that were classified correctly and those that were classified wrongly (error rate).
V. CONCLUSION Early detection of breast cancer can be predicted accurately by using classification methods in R Package. This may result in the decrease of health cost and may enhance time required for a patient to receive treatment. In this paper three decision tree classification methods were compared to suggest one best that can have better performance than others when considering breast cancer data analysis. The software tool used for this purpose is R, which is one of the most popular among data analysts of the current days and that has tremendous performance so that the run times are significantly very less. It is observed that among the three techniques compared namely 'rpart', 'ctree' and 'randomforest', 'randomforest' is the best based on the performance measure precision which distinctly differs from the rest of measures used. The method of 'randomforest' shows the highest precision, which is 1 and hence we conclude saying that usage of 'randomforest' method best suits cancer analytics. As a future scope we shall be exploring the other classification techniques such as SVM, KNN and Naïve Beyes and comparing their performances with that of 'randomforest' decision tree approach.