A Framework for the evolutionary clustering of dynamic heterogeneous information network based on Hadoop MapReduce

- Evolutionary clustering is a way of dynamic clustering which takes the time period into account which is the most suitable one for clustering the dynamic social networks and information networks. Comparing with the traditional static clustering, evolutionary clustering will provide the information instantly considering the time stamps, which will be helpful to diagnose the entire dynamic network and to retrieve the data as per the user need. The dynamic networks are huge in size which makes it difficult and time consuming to cluster the data whether it may be supervised or unsupervised clustering. In this paper, we have proposed a framework to overcome this difficulty by using hadoop map reduce. This framework combines the evolutionary algorithm and the mapreduce frame work. The map reduce module process the input and splits the data into fragments, and reduces the data size. The evolutionary algorithm, groups the data, satisfying the given criteria in parallel. The map reduce model benefits the clustering technique not only in reducing the size of data and storage but also in processing the data simultaneously among the cluster of computers which facilitates in producing the output even faster.


EVOLUTIONARY CLUSTERING OF DYNAMIC INFORMATION NETWORKS BASED ON HADOOP MAP REDUCE
The framework consists of following steps: Step 0: Store the input dataset in HDFS.
Step 1: The sorted key, value pairs are distributed to mappers which apply map function.
Step 2: The output of map function is consolidated per key using reduce.
Step 3: The similarity function is applied to determine the clusters based on the labels provided.
Step 4: The output clusters are evaluated with the evolutionary patterns IV. EXPERIMENTS AND RESULTS Experiments were performed with the four area dataset of DBLP network. Papers from the area DB, DM, IR, ML from the year 2010 to 2016 are used for experiment. This is called as four area dataset. Our experiment is done with 10K papers of four area dataset, initially. It consists of data objects such as author, paper, conference and terms. Later we increase the size of dataset progressively 20K,30K,40K,50K. In the first step the input is stored in HDFS in the form of key, value pairs;<First Author,Paper Title>. The sorted key value pairs are distributed to mappers by job tracker, the map() will determine, count of papers written by certain author and the reduce function as per parameter provided sum the total number of papers written by an author removing redundant records. To determine the clusters of papers belonging to identical research area based on the labels such as DB, DM, a similarity function is applied and the output clusters are written to a stable storage.
This experiment is performed in hadoop architecture of version 1.1.0, consists of one master node and two slave nodes each of which have 7.5GB memory. In many of the evolutionary algorithms the running time is directly proportional to the data size. However, the hadoop framework , by subdividing the work among multiple nodes increases the speed of execution time even though it encounters the overhead of job tracking and task tracking. Scaling scaling Figure1 shows the running time of the different size of datasets. The execution time of 50K dataset is reduced comparing with the 10K of dataset. This shows that the framework is suitable for handling large datasets and the execution speed will be increased when the size of the dataset increases. Figure2 shows the measure of scaling for our experiment. As the result shows, the performance increases as the job size increases. This is the result of distributed architecture of storing and processing data in hadoop. The results prove that the framework based on hadoop mapreduce for evolutionary clustering of heterogeneous dynamic network is suitable for huge social networks.
V. CONCLUSION In the broad area of evolutionary clustering various techniques were proposed so far. However, to handle the huge data set of social information networks need a technique to reduce the execution time and space. In this paper, we propose a framework to implement the evolutionary clustering based on hadoop mapreduce. The results show that the execution time is much reduced as a result of splitting the work among multiple nodes and the generation of clusters over time periods instantly provides a promising effect in the field of evolutionary clustering.