A Hybrid Technique to Classify Trending Topic on Twitter Dataset

— Text mining is in trend of research and development these days. A number of recent contributions in this domain are observed. Different social media text analysis techniques are used to find the trending topic. Both kinds of data mining techniques (i.e. supervised and unsupervised) are used in text mining. The proposed study is focused on the supervised learning approach. In the first step the dataset is prepared, thus different subjects or domain based data is extracted from twitter and according to these subjects the class labels are appended with the data. Further the dataset is split into two sets i.e. the training data and the testing data. In the next step the data is pre-processed where stop words and special characters are removed from the dataset. This dataset is used with the three different classification algorithms KNN (k-nearest neighbor), SVM (support vector machine) and a combination of both SVM and KNN. Hybrid approach is applied to build the classification model and it is implemented on test dataset to discover the twitter subjects. The result of the proposed work is compared with traditional KNN and SVM algorithm. According to the results obtained, the proposed hybrid technique provides more precise and better results as compared to other conventional classifiers.


B. Summarization
This procedure has primary objective of exact content from substantial number of content records. Physically it is unrealistic to compress vast archives [4]. In many research focuses, it is impractical to peruse all content reports that mean specialist has no opportunity to peruse these documents. They compress records and makes synopsis of records from principle points.
C. Topic Tracking Topic tracking refers to keeping track of user's previously made searches in order to find out the search pattern of the user. The previously searched topics of any individual are kept in a file known as user profile. This procedure is used to track the future searches that an individual can need. One of its drawbacks is that it also returns unnecessary data too. Theme following is utilized as a part of numerous territories, for example, radio, news communicates and so forth.
D. Classification It is a procedure of discovering primary subject of record by including metadata and investigating report. This system finds out the total number of words present in a document and from that number, it chooses the subject of the text. This method arranges text files into predefines class labels [5] The remaining document is organized as Related Study is given in the II chapter followed by proposed work, methodology and the then Result is discussed. At last conclusion and future work is given.
II. LITERATURE SURVEY The variety of ways that the issue has been contemplated and the numerous systems that have been created makes correlation between these reviews testing. Literature reveals that the likelihood of anticipating a few clients and message traits in content based, constant, web based informing administrations. For this reason, a substantial accumulation of chat messages are inspected. The materialness of different directed grouping methods for extricating data of the tweets is accessed. A termbased approach is utilized to research the client and message properties with regards to vocabulary utilize while a style-based approach is utilized to analyse the tweets indicated by the varieties in the writers' composition styles. From 100 of creators, the character of a creator is effectively anticipated with 99.7% exactness [6]. "Text classification methodologies applied to micro-text in military chat" includes strategies to group lines of military talk, or posts, which contain things of intrigue. Creators assessed a few current content classification and highlight determination strategies on visit posts. This talk posts are cases of 'miniaturized scale content', or content that is by and large short long, semi-organized, and described by unstructured or casual linguistic use and dialect. Despite the fact that this review concentrated particularly on strategic updates by means of talk, they trust the discoveries are appropriate to substance of a comparative semantic structure. Fulfilment of this point of reference is a huge initial phase in considering more perplexing arrangement and data extraction [7]. Sentiment Classification goes for digging surveys of individuals for a specific occasion's subject or item via programmed characterizing the audits into good or bad sentiments. Automatic opinion mining is beneficial for chief and customary individuals. There are principally two sorts of methodologies for slant characterization, machine learning techniques and semantic introduction strategies. In spite of the fact that some pioneer investigates investigated the methodologies for English surveys order, few occupations have been done on assumption characterization for Chinese audits [8]. This study considers the issue of consequently ordering human slant from normal dialect composed content. In this assessment mining space, they think about the exactness of gathering models, which exploit gatherings of learners to yield more noteworthy execution. They demonstrate that these troupe machine learning models can fundamentally enhance slant grouping for freestyle content [9]. Micro blogging today has turned into an extremely prevalent specialized device among Internet clients. A huge number of clients impart insights on various parts of life consistently. Accordingly smaller scale blogging sites are rich wellsprings of information for conclusion mining and slant examination. Since smaller scale blogging has showed up generally as of late, there are a couple of researchers that are dedicated to this subject. This study is concentrating on utilizing Twitter, the most well known miniaturized scale blogging stage, for the errand of Emotion examination. They will demonstrate to naturally gather a corpus for Emotion examination and sentiment mining purposes and after that perform phonetic investigation of the gathered corpus and clarify found wonders. Utilizing the corpus, writer will fabricate an Emotion classifier that will have the capacity to decide the feeling class of the individual written work [10].

III. PROPOSED WORK
Text mining and its approaches are used in various applications for extracting the essential information from the text data. This work is dedicated to obtain the topic classification on the social media data. That helps to identify the trending topics on twitter text. This section involves the understanding of the proposed technique for mining sentiment based topic tracking.
A. System Overview Data mining is performed on any kind of data for automated data analysis. In this analysis different kinds of algorithms are involved to process the data and to obtain the similar pattern of data. According to the requirements of the application the selection of an algorithm is employed. The algorithm may be supervised in nature or unsupervised; the supervised algorithms are employed when some patterns of data are predefined and need to distinguish the patterns among pre-known classes. On the other hand the unsupervised technique is used when the data is not initially available on predefined classes and the data is in huge volumes. In addition of that the supervised techniques are used for accurate recognition of data. In the presented work the supervised classification technique is used for analysis and text pattern classification. The main aim of this text classification technique is to discover the trending topics in the social media according to their hidden sentiments. This approach also helps to distinguish the patterns of data, in other words the topic information in any tweet. In order to perform this text classification the pre-processing technique is optimized in a different manner. Additionally the three different kinds of classification models are used. In first the SVM (support vector machine), in next the KNN (k-nearest neighbor) classifier is used for performing the classification task. It is a hybrid approach of SVM and KNN which is used to improve the precision and other parameters of classification.
B. Methodology For The method applied to the proposed system is described in this section. Figure 2.1 shows the proposed technique of data processing and classification of text. 1) Training Dataset: In order to demonstrate the capability of the proposed technique for topic tracking the twitter dataset is used. Before classifying this data using the proposed system, different topics of the data are identified and the classes of data are labelled for each tweet. This data is further used for classification purpose.
2) Test data: The complete dataset is divided into training and testing data in such a way that the results are not biased. This test data is used for performing the evaluation of data model for finding the accuracy of training.
3) Pre-processing: Pre-processing is task of data transformation and correction. Therefore the different techniques of data cleaning and refinement are applied to data for finding the data which are acceptable with the algorithm used. In order to process the data, two different techniques are applied in this phase.

4) Removing the special character:
This is the first phase of data pre-processing. Here a list of special characters is prepared, which is used with the data to find and replace functionality for removing the unwanted special characters.

5) Removal of stop words:
After removal of unnecessary special characters the stop words are reduced from the text data. The stop words are those words that are frequently used during the sentence formation. But these words do not put impact on recognizing the domain, i.e. this, that is, are, who, and so on. Here a list is prepared with the considered stop words and by using "find and replace" method all the unwanted stop words are removed.
6) Data mapping: After pre-processing of data, it is produced as input to the data mapping phase. Here data is again transformed into a different format. Thus we need to prepare the mapping of the text contents into symbolic form. In order to understand the process of mapping we take an example 1. hello i m online 1. "1", "2", "3", "4" 2. i m enjoying online 2. "2", "3", "5", "4" Table I gives that the individual words are converted into a definite symbol and during repetition of words the same mapping symbol is repeated. For example word "online", repeated in both the sentences at fourth place, thus both of the places are replaced with number "4". 7) SVM training: The encoded strings of the same length with the class label symbol are provided as input training dataset for SVM training. After some training cycles when the maximum iterations are completed, then, we understand the training process is completed. After training of SVM, the machine is ready to perform classification on similar pattern data.

8) Trained SVM:
The trained SVM classifier is now used with the testing dataset. But the similar tweeted texts are selected to prepare the testing dataset. Thus to create a similar kind of input patterns (i.e. as similar to training inputs for SVM classifier) the testing data is also converted to the similar encoded text. In this phase the encoded test data is produced for generating the class labels for input tweets. These class labels are name of those domains which are needed to find out. To demonstrate to classified outcomes and further process we take an example. Input to SVM Predicted outcome Label actual name 1. "2", "3", "5", "4" A Politics 2. "1", "3", "2", "4" A Politics 3. "2", "6", "4", "1" B Movies 4. "2", "3", "3", "4" B Movies According to Table II, the first column of table provides the input sequences to classify in SVM. After computing the class labels the data is predicted class label is given in second column. Finally the actual class labels are obtained by prediction outcome. But from this data only the initial two columns are used for further outcome refinement. 9) KNN: In this phase the outcome of previous phase data is accepted. This process helps to sub-divide each class into two sub-classes. First of them is called more likely and second is called less likely. Therefore similar class label patterns are used to find the distance between each other. Here to differentiate between two groups of patterns a distance threshold 75% is taken.
10) Performance: After classification of data the performance of classification is measured under the different performance parameters.
C. Proposed Algorithm The designed process of classification is summarized in terms of step process involved for training and testing. Table III gives the steps of proposed algorithm.   In this we have compared and analysis their comparison on the basis of the produced output. By this result, the sentiment of the twitter data set of user chat data, the proposed hybrid classifier performance is most efficient and accurate in terms of the precision rate other than evaluated classifier. B. Recall Recall is the ratio of the number of correct positive results and number of positive results that should have been returned. It measures the completeness or the sensitivity of the sentiment classifier. Higher the recall means that small false negatives (FN), whereas lower the recall means more false negatives. In this, following recall rate formula used to calculate classification of social chat using SVM and KNN:

IV. RESULT ANALYSIS
RecallRate TP TP FN -TP is the number of true positives -FN is the number of false negatives  Figure 3 and Table V shows the comparative recall rate of implemented classification algorithm. The X-axis contains the different execution of the project for training performance and the Y axis shows the performance in terms of recall rate percentage. The recall rate of the traditional classifier is given by the green line and the performance of the proposed classification Technique is given using the blue line and the base classification show by red line. The performance of the proposed classification is better and efficient and reduces with the amount of data increases. C. F-Measure F-Measure or F1 Score is the harmonic mean of the Precision and Recall often used as a weighted average for balancing quality vs. quantity of true positives selection of an algorithm given by:   Figure 4 and Table VI shows the performance of Proposed Hybrid text Classification systems in terms of Fmeasures. To demonstrate the performance of the system the X-axis shows the total runs for data execution and the Y axis represents the performance in terms of F-measures. According to the results we obtain, the performance of the implemented system is much stable and enhancing approach of user chat text classification lies in the combination of SVM and KNN classifier. In addition of that the results are in a more progressive manner as the amount of database is increasing. Thus the obtained results are adoptable and efficient for the text data sentiment analysis. D. Training Time Training is the process of taking content that is known to belong to specified classes and creating a classifier on the basis of that known content. Training time defines the time which is used to train the classifier to process twitter dataset. This time is used to train different size data along with classifying and processing of the data execution.
TrainTime Classi ierEndTime Classi ierStartTime The Train time of the implemented algorithm is given by Figure 5 and Table VII. X-axis shows different execution of system and Y-axis contains train time utilization in milliseconds. According to the results, the time required by the proposed system to train the dataset is less compared to other traditional algorithms. The amount of time increases in similar manner as the amount of data for analysis increases.   In this performance, proposed approach and base classifier taking approximately similar time to test the data and traditional KNN consuming more time. So that combine approach minimizes test time as compared to both performances.
F. Memory Consumption Memory consumption is the amount of memory required by the system to execute properly. Sometimes it is also given by the space complexity. The formula is given below: MemoryConsumption TotalMemory FreeMemory Memory utilization is the amount of information stored in the main memory, and thus it affects the cost of computation. The performance of our proposed classifier is given by Figure 7 and Table IX. In this figure, X-Axis displays total number of runs by executing algorithms and Y-Axis shows the memory utilization in kilobytes (KB). As indicated by the acquired outcomes the execution of calculation shows comparable conduct with various framework execution, yet the measure of memory utilization is increments with the measure of information. V. CONCLUSION Online social networking groups mostly display complex aggregate behaviour. Since feelings assume a relevant part in human basic leadership, social media plays a big role in driving human interest from one subject to another. A standout amongst the most applicable undertakings in Sentiment Analysis is Polarity Classification, went for grouping the opinion behind writings. Interpersonal organizations and small scale blogging apparatuses, for example, Twitter permit people to express their suppositions, sentiments, and considerations on an assortment of themes as short instant messages. These short messages (ordinarily known as tweets) may likewise incorporate the passionate conditions of people, (for example, satisfaction, nervousness, and sorrow, happiness) and also the feelings of a bigger gathering. The proposed Hybrid Social Chat Classification Approach using SVM and KNN uses the best features of both the traditional classifiers and analyses the social network based text for their sentiments and orientation based text classification. It involves all the data mining stages i.e. pre-processing; Feature Extraction, tagging, learning from training dataset, creating classification model and the classification of newly arrived patterns. Preprocessing involves removal of punctuations and unnecessary frequent words which are not important for deciding the class of the tweet, tagging involves the conversion of the data in terms of the sentence part of speech and feature selection from the raw data. After selecting appropriate feature, three different data mining models are applied to data that are used for learning on the trained patterns and then using this model to classify the new upcoming patterns. The performance of the system is estimated for finding the parameter, i.e. Precision, Recall, F-Measure, Train Time, Test Time and Memory of classification algorithms. The performance comparison of system with other traditional methods is given using Memory Consumption

Low
High Average The proposed model for text classification is desirable and efficient for classifying the text data from various domains and finding out which domain or subject is more popular among users. The proposed work makes empirical contributions to this research area by comparing the performance of different popular sentiment based user chat classification approaches and developing a collaborative approach, which further improves the sentiment classification performance. The overall performance of the proposed methodology is satisfactory and better compared to the traditional classification algorithms VI. FUTURE WORK The given proposed work is effective for characterizing the classes of content information for examining framework execution. This way the proposed strategy includes various methods for enhancing the classification rate. In near future,  The proposed work can be elaborated as a web administration service, which if asked can access various networking websites for gathering surveys and synopsis and display to end user.  A future study will also concentrate on the utilization of the certain conclusion examination inside the sentiment analysis in informal organizations.