Multi-View Data Visualization Based Big Data Analysis Using Clustering Similarity Measure

---In big data, data visualization is an impressive concept to represent data for efficient data analysis to handle high dimensional data. In data visualization, there are three main properties i) to represent without loss of data patterns ii) without any changes in data pattern change the attributes iii) data visualization with structure and unstructured data attributes for data analysis. There are many types of data visualization are presented practically to define data analysis (i.e. topic based data visualization, attribute based data visualization, audio based data visualization and text based data visualization in different data sets). Parallel co-ordinate is an efficient and effective data visualization tool to analyze and handle multi attribute high dimensional data. It is based 5Ws density sending and receiving data visualization, it also read data patterns and attributes with reduces the overlapping to data patterns. Similarity measure is a categorization property to represent data with relationship objects in data set evaluation with different pair of attributes. We need to improve parallel coordinate tool to support multi-attribute object relations, so we propose and implement novel method i.e. (Similarity Measure Centered with Multi Viewpoint (SMCMV)) approach and related clustering approaches to represent data. Using multi-viewpoint, we can achieve assessment based similarity index with data visualization. Using multi viewpoint, we present theoretical analysis based on multi attributes presentation. Our experimental results gives best data representation in data visualization with efficient similarity measure on real time document evaluation with different known collected clustering approaches.


I. INTRODUCTION
Structured and unstructured data in big data visualization contains different forms of data like image, audio and video and collected this data from different multiple data sets based on the size time and space complexity evaluation. For example Face book generates 25 GB of data which contains following user's personal details and their sharing data with mutual and personal friends. Thousands, hundreds of different dimensional attributes by monthly providing different data to analyze multiple attribute dimensions to handle data visualization. Because of increasing rapid usage of big data in various applications, different authors proposed different association and classification and clustering to analyze high dimensional data. Parallel coordinate data visualization is one of the promising approaches to represent data without change their data patterns from overall data. Sample data visualization with different dimensions as shown fig 1. Neighbor attribute partitioning is as shown in fig 1 with different node axes values in dimensional set. Some of the researchers and introduced to define effective data visualization with different node data presentation with same attributes. Mainly data visualization consists three data representations in real time data presentations, topic based data visualization, which consist about particular topic with algorithm process like network traffic visualization cloud data visualization. Data type based data visualization, which consists accurate type of data like text based data visualization, audio and video data visualization with different formations. Data set visualization, which consists particular data sets like social and network, oriented data sets with different data patterns. To represent data in these three ways, traditionally develop Parallel coordinate 5W s density model, in that analyze data attributes and represent those attributes in parallel axes presentation for multiple data set with data types and topics evaluation in real time data set presentation. To measure similarity between attributes in data visualization with relationships then parallel coordinate 5W density presentation not satisfied in data evaluation. So in this paper, we propose and develop novel Similarity Measure Centered with Multi Viewpoint (SMCMV)) approach and related clustering approaches to represent data. This approach follows multi-view data representation with dimensions in attribute relationship. In that clustering is an aggressive concept and topic in data retrieval based on attributes, in that we intrinsic the structure data formation and formulate them into required and meaningful data presentation. So our proposed approach follows clustering properties to read and present data in different dimensions with attribute relations. And also our approach exclusively follows multiview data presentation with respect attribute presentation. We also calculate similarity measure in attribute partitioning in data set exploration; similarity measure plays important aspect in success and failure of data representation in clustering procedure. Main objectives of our proposed approach as follows: 1. We present and propose an approach to find the similarity between data objects with different relations in high dimensional data evaluation. 2. Proposed similarity measure with different clustering calculations with provable quality and performance consistent. 3. Display multi-view data visualization with different data patterns. 4. Give efficient data visualization with multiple attributes with clustering calculations. Remaining of this section organized as follows: Section 2 relates related work about visual data presentation techniques, section 3 discuss about parallel coordinate density model with data visualization. Section 4 describes proposed approach i.e. SMCMV and it's implementation procedure. Section 5 formalizes the computational and performance evaluation of proposed approach with real time data sets and plot the results, section 6 concludes overall conclusion.

II. RELATED WORK
Comparative compose plots, as a standout amongst the most prominent strategies, was first suggested by Inselberg [3] and Wegman recommended it as something for extraordinary viewpoint data inquire about [4]. Directions of n-dimensional data can be authorized upon in parallel tomahawks in a 2-dimensional airplane and connected by straight sections. As demonstrated out in the scholarly works [5], numerous strategies have been prescribed to offer comprehension of multivariate information utilizing engaging creation strategies. Comparable facilitate plots, as a straightforward however capable geometrical high-dimensional information creation strategy and symbolizes N-dimensional data in a 2-dimensional zone with factual meticulousness. Visual grouping, pivot reordering and point of view concentrating are normal approaches to reduce jumbles running in parallel fits. Dasgupta et al. [6] suggested one in view of screen-space measurements to pick the tomahawks structure by enhancing sets up of tomahawks. Huh et al. given a related region between two nearby tomahawks as opposed to the proportional territory in regular PCP parallel tomahawks. Additionally, the shapes having a few measurable property associating data considers on close-by tomahawks are portrayed in artistic works [7] too. Zhou et al. [8] changed over the straight-line sides into shapes to moderate up the obvious chaos in grouped creation. They additionally utilized the splatting structure [2] to recognize gatherings and lessen noticeable wreckage. With the go for averting over-plotting and ensuring thickness data, Kai Lun Chung and Wei Zhuo [2] outlined two visual explanatory apparatuses: decision outlines and respects diagrams, to lessen visual chaos running in parallel blends. The decision outline is a brushing gadget which helps clients underlines the regions picked. The respects diagram masterminds gatherings and offer communications for clients to see more about the associations between gatherings. Julian Heinrich et al [9] planned BiCluster Audience that consolidates heat maps and parallel blends plots to find information designs. The BiCluster Audience contains numerous intelligent elements, for example, pivot obtaining, go shading, or cruising that diminish data filling in noticeable diagram. Matej Novotny and Helwig Hauser [8] organized the data point of view into anomalies, and at that point inclined and focused on the viewpoint in masterminded parallel directions to moderate up the filling issues. Xiaoru Yuan et al [7] spread components running in parallel blends to union parallel facilitates and scatterplot climbing, which diminished information swarming. Clients can reorder the blends by pulling the tomahawks in their interface for better obvious watcher. No past work, to the best of our insight, has made two extra tomahawks by utilizing the densities for the parallel sort out creation. We analyzed the data design to begin with keeping in mind the end goal to procure the of SD and RD, and after that pictured the data hubs. These data styles diminished information over-lapping and populating. 5Ws strength parallel directions have considerably decreased information filling for Big Data analysis and creation.

5Ws Parallel Coordinate Model
Main suggestions of this model represent as follows 2.2.1. Dimensional Model: As the name (5Ws Dimensions) suggest that When data occur, Where data from, What data contains, Why data occur and Who receive data. 5Ws dimensions illustrated with following axes to define data in various conditions. i) represents what data contain iv)  This is example presentation of density levels with data patterns p= , x= , y= , z= and q= .
the mapping functions to represent these parameters with following function i.e. ( , , , , , ) f t p x y z q . Where t | T { } is sufficient time seal for each information occurrence. p |P{ } symbolizes where the information came from, such as "Twitter", "Face book" or "Sender". x | X{ } symbolizes what the data content was, such as "like", "dislike" or "attack". y | Y{ } represents how the information was moved, such as "by the Internet", "by phone" or "by email". z | Z{ } symbolizes why the data occurred, such as "sharing photos", "finding friends" or "spreading a virus". q | Q{ } symbolizes who obtained the information, such as "friend", "bank account" or "receiver".

Parallel Data Visualization
Axes: make two extra tomahawks by proposing two densities SD ( ) and RD ( ), which each have an incentive for every information design, for parallel facilitate representation so as to enhance exactness in parallel organize representation. The estimations of SD ( ) and RD ( ) in both tomahawks speak to the information stream designs appeared as poly-lines among the 5Ws measurements. This decreases information jumbling in the chart since one subset has just a single polyline. 5Ws thickness parallel tomahawks, consolidated with the in sequential order tomahawks and numerical tomahawks, have given more scientific techniques for Big Data representation. No information designs have been lost amid the investigation and perception handle. 2.4. Re-order with Clustering: The 5Ws density parallel directions with re-requesting and grouped perspectives give visual structures and examples to outline the cozy connection between the tomahawks in a realistic format. It obviously exhibits Big Data designs for various datasets, diverse themes and distinctive information sorts in perception.

III. SYSTEM DESIGN & IMPLEMENTATION
In this section, we discuss about our proposed approach similarity measure procedure with different attributes and relations and indexes in practical examples. Consider the procedure discussed in section 3, to represent data with multi-view cluster based on similarity measure. To design this implementation then following modules are required to define efficient attribute relations.
3.1. Related work: Based on term and document frequency in uploaded data sets, we calculate Euclidian distance between words and similarity between documents with attribute relations. Description of different parameters used in our approach shown in table 1. Sim(di, dj) = cos(di-0, dj-0) = (di-0) (dj-0) Where o and 0 represents vector with origin point in different data point evaluation, the evaluate requires 0 as one and only blueprint. The likeness between two documents di and dj is established w.r.t. the angle between the two factors when looking from the original source. To build a new idea of likeness, it is possible to use more than just one referrals factor. We may have a more precise evaluation of how near or remote a couple of factors are, if we look at them from many different viewpoints. A assumption of group subscriptions has been created before to the evaluate. The two things to be calculated must be in the same group, while the points from where to determine this statistic must be outside of the group. We refer to this as offer the Multi-Viewpoint based Similarity. Similarity measure for different documents presentation with attributes as follows: The likeness between two factors di and dj inside cluster Sr, considered from a factor dh outside this group, is equal to the item of the cosine of the position between di and dj looking from dh and the Euclidean ranges from dh to these two points. 3.3. Implementation: In this section, we present the implementation procedure of our proposed approach to define efficient data presentation in different dimensions with effective similarity measures between data objects. Multi view point similarity measure for structure documents as follows: Compare two similar documents with attributes relations for all documents, MVS (d i , d j ) and MVS (d i , d l ), papers dj is more similar to papers di than the other papers dl is, if and only if. Implementation procedure of the MVS with similar attributes as show in following figure 3.  Fig. 3. First of all, the external blend w.r.t. each category is determined. Then, for each row ai of A, i = 1, . . , n, if the happy couple of records di and dj, j = 1, . . . , n are in the same category, aij is measured as in range 10, Fig. 3. Otherwise, dj is believed to be in di's category, and aij is measured as in range 12. This is the similarity matrix procedure to define different attributes in data sets.

Cluster Label Data Presentation:
Two genuine report datasets are utilized as cases in this legitimacy test. The first is reuters7, a subset of the renowned gathering, Reuters-21578 Distribution 1.0, of Reuter's newswire articles1. Reuters-21578 is one of the most broadly utilized test gathering for content order. In our legitimacy test, we chose 2,500 archives from the biggest 7 classifications: "acq", "rough", "intrigue", "acquire", "cash fx", "ship" and "exchange" to shape reuters7. A portion of the archives may show up in more than one classification. The second dataset is k1b, an accumulation of 2,340 website pages from the Yahoo! subject progression, including 6 points: "wellbeing", amusement", "brandish", "legislative issues","tech" and "business". It was made from a past review in data recovery called WebAce [6], and is presently accessible with the CLUTO toolbox [9]. The two datasets were preprocessed by stop-word expulsion and stemming. In addition, we evacuated words that show up in under two reports or over 99.5% of the aggregate number of archives. At last, the reports were weighted by TF-IDF and standardized to unit vectors. The full attributes of reuters7 and k1b are displayed in Fig. 4. The validity test has shown the prospective benefits of the new multi-viewpoint centered likeness evaluate in comparison to the cosine evaluate.

IV. COMPUTATIONAL EVALUATION
In this section, we discuss performance evaluation procedure regarding data visualization for both parallel coordinate density model and our propose approach Similarity Measure Centered with Multi View Point for different data objectives. For that we are taking different software parameters like JDK 1.8 and Net Beans 8.0 for user interface construction to upload data sets and process data sets using different parameters in reliable data stream evaluation with respect to data presentation in different formats. 4.1. Data sets collection: The information corpora that we used for tests include of 20 conventional papers datasets. Besides reuters7 and k1b, which have been described in information previously, we involved another 18 written text selections so that the study of the clustering techniques is more thorough and comprehensive. Just like k1b, these datasets are offered together with CLUTO by the toolkit's writers [19]. They had been used for trial examining over the documents, and their resource and resource had also been described in information. Table 2 summarizes their features. The corpora existing a variety of dimension, variety of sessions and category stability. They were all preprocessed by conventional techniques, such as stop-word removal, arising, elimination of too unusual as well as too regular terms with normalization.  [13] Our MVSC-IR and MVSC-IV applications are implemented in Coffee. The controlling aspect α in IR is always set at 0.3 during the tests. Nothing unless there are other options calculations are ensured to discover worldwide ideal, and every one of them are introduction subordinate. Henceforth, for every strategy, we performed grouping a couple times with haphazardly instated values, and picked the best trial as far as the relating target work esteem. In every one of the analyses, each trial comprised of 10 trials. In addition, the outcome detailed here on each dataset by a specific bunching technique is the normal of 10 trials. Figure 6 show the accuracy of our proposed approach with different data sets evaluation procedure on text oriented documents with feasible parameters with values shown in table-3.   figure 6 with respect to time efficiency in real time data set processing. Finally, we describe and conclude SMCMV approach gives better and efficiency results than 5Ws density model for different types of documents related to different types of documents.

V. CONCLUSION
In this paper, we present to discuss about data visualization with different data sets, and also discuss about parallel coordinates data visualization in data representation based on topic, type of data and data sets. For similarity measure of different data objects in data sets, for that we propose to develop novel method i.e. Similarity Measure Centered with Multi Viewpoint (SMCMV) with cosine similarity for different text, image, video documents. We also compare data visualization difference between parallel coordinate density model presentation and our proposed approach in both theoretical and practical for large data documents. The main key point of our proposed approach to define data sets in multi view data representation. Further improvement of our proposed approach is to define data documents in parallel processing using advanced machine learning approaches with real time data sets.