TM-SGTD: Text Mining Based Semantic Graph for Text Document Approach for Text Representation

— Text representation is the essential step for the tasks of text mining. To represent the textual information more expressively, a kind of Text Mining based Semantic Graph approach is proposed, in which more semantic and ordering information among terms as well as the structural information of the text is incorporated. Such model can be constructed by extracting representative terms from texts and their mutually semantic relationships. The implementation of the proposed work is provided using the JAVA environment and python environment. Moreover, WordNet is showing relationship amongst word node. So that GEPHI tool is used to constructing more effectively semantic graph. Additionally the comparative performance is also compared with traditional. In order to compare the performance of the algorithms the memory consumption and time consumption is taken as stand parameters. The experimental results have proved the better performance of the proposed text information representation model in terms of its Time and Space complexity.


B. Classification of Graph-based Model
The representation of a graph G is given using the tuple set , where V is used for representing the nodes and E provides the edges of graph G. in the similar manner when the text is need to be presented using graph the vertices or node of graph represents the domain, subject or a word and the relationship among the nodes or these subjects or words are provided by the edges. We can classify the graphs in detail according to the node representation and the edge representation.
1.1 Node Representation Method Nodes of graph G demonstrate the valuable components of the text such as words, subjects, domains, sentences, and paragraphs. All these components are used for representing the concepts or these can be considered as semantic components. By the definition of graph model, a component is represented by node, or can be indicate more than two components. If a node represents one component, it is called homogenous representation. If a node represents more than two components, it is called heterogeneous representation. Therefore in order to find the influence of the subject for other subject either graphs is directed or the graph used with labels. In this context the text representation graphs are used with some kinds of weights. Therefore the text graph representation supports both the techniques weighted or un-weighted according to requirements.
1.2 Homogenous representation vs. Heterogeneous Representation Basically in homogenous representations a word is always represented as nodes [4]. In different places such as co-occurrence of a word are commonly expressed using graphical notations. Here the co-occurrence denotes that word appears more than once in a document or subject. Thus the words can be represented with some edge. Not only in this context the graphical notation of text also supports the grammatical associations among the words or semantic similarities in any homogeneous text graph representation. Since this representation is simple, the cost for building the model and analysis is low. The key advantage of this technique is that any existing graph representation algorithm can be used without modifications. In some researches, they also used homogenous representations, in which sentences, paragraphs or concepts are represented as nodes [5].

Weighted and Unweighted
In weighted representations, weighted value has been assigned in each node. On the other hand in un-weighted representations the nodes are not used with weights. Most researches assume weighted nodes which indicate the importance of the node in the graph. In order to evaluate the weight of nodes, some researches, for example PageRank, exploited the number of edges, the weights of edges, or the weight of nodes which are connected by the edge.
II. LITERATURE SURVEY This section provides the recently made efforts and contributions to Design the techniques of semantic graph for text documents. Therefore, different articles and research papers are included in this section. In this paper, Jinghua Wang et al. [6] introduced language network and described three kinds of networks. Keyword extraction is an important technology in many areas of document processing. In particularly, a keyword extraction algorithm based on language network and PageRank is proposed. Firstly a semantic network for a single document is build, then Pagerank is applied in the network to decide on the importance of a word, finally top-ranked words are selected as keywords of the document. The algorithm is tested on the corpus of CISTR, and the experiment result proves practical and effective. In this paper Chuntao Jiang et al. [7] portrayed a chart based way to deal with record arrangement. The chart portrayal offers the preferred standpoint that it takes into consideration a substantially more expressive archive encoding than the more standard sack of words/expressions approach, and subsequently gives enhanced arrangement exactness. Report sets are known as a sets by which a weighted graph mining calculation is to remove visit sub-graphs, which are then additionally prepared to deliver highlight vectors (one for every record) for characterization. Weighted sub-graph mining is utilized to guarantee grouping adequacy and computational productivity; just the most noteworthy sub-graphs are removed. The approach is approved and assessed utilizing a few famous characterization calculations together with a certifiable printed informational index. The outcomes exhibit that the approach can outflank existing content grouping calculations on some dataset. At the point when the measure of dataset expanded, additionally handling on extricated visit elements is fundamental. Programmed watchwords extraction is the errand to recognize a little arrangement of words, key expressions, catchphrases, or key fragments from a report that can portray the importance of the archive. Watchwords are helpful instruments as they give the most brief rundown of the archive. Vishal Gupta et al. [8] focuses on Automatic catchphrases extraction for Punjabi dialect content. It incorporates different stages like evacuating stop words, Identification of Punjabi things and thing stemming, Calculation of Term Frequency and Inverse Sentence Frequency (TF-ISF), Punjabi catchphrases as things with high TF-ISF score and title/feature highlight for Punjabi content. The extricated watchwords are particularly useful in programmed ordering, content outline, data recovery, arrangement, bunching, subject discovery and following and web looks and so forth.
In this paper, Marina Litvak et al. [9] present and think about between two novel methodologies, regulated and unsupervised, for recognizing the watchwords to be utilized as a part of extractive synopsis of content reports. Both this methodologies depend on the chart based syntactic portrayal of content and web records, which improves the conventional vector-space display by considering some auxiliary report highlights. In the managed approach, they prepare grouping calculations on a compressed accumulation of reports with the motivation behind instigating a watchword distinguishing proof model. In the unsupervised approach, they run the HITS calculation on archive charts under the suspicion that the top-positioned hubs ought to speak to the report watchwords. Our tests on a gathering of benchmark synopses demonstrate that given an arrangement of compressed preparing reports, the administered grouping gives the most elevated catchphrase distinguishing proof precision, while the most astounding F-measure is come to with a straightforward degree-based positioning. What's more, it is adequate to perform just the primary emphasis of HITS as opposed to running it to its joining. Chien-Liang Liu et al. [10] proposed semi-administered grouping technique called Constrained PLSA to bunch labeled reports with a little measure of named archives and uses two informational collections for framework execution assessments. The primary informational collection is a record set whose limits among the groups are not clear; while the second one has clear limits among bunches. This review utilizes modified works of papers and the labels commented on by clients to bunch records. Four blends of labels and words are utilized for highlight portrayals. The trial comes about show that the greater part of the strategies can profit by labels. Nonetheless, unsupervised learning strategies neglect to work appropriately in the informational collection with uproarious data, however Constrained-PLSA works legitimately. In numerous genuine applications, foundation information is prepared, making it fitting to utilize foundation learning in the grouping procedure to make the adapting all the more quick and compelling.
III. PROPOSED WORK This section includes the introduction of the proposed work and contribution made during the study. In addition of that the proposed model is also demonstrated with their working.
A. Problem Domain Technology platforms are becoming increasingly more capable every day of interpreting and responding to domain specific problem of text that not practically to show leniently. Ontologies of the words can help represent the relationships between entities such that they can be used to improve the accuracy and effective representation of the system at meeting its users' information needs. Simultaneously, when user's information or any of generalized databases is highly lengthy comparable to generate the effective way to show word layout in graphical form is not possible to construct semantic graph. On the other hand, lingual words of the text word length are limited and it increases word length in databases. For that there is need to design a effective representation model of semantic based model of ontology based with co-occurrence of the word. In most of the Ontologies are created by experts manually, therefore the design, development and maintenance or updating needs significant effort and time. To combat this, ontology learning systems, which attempt to automatically learn relationships from a domain and then map them into ontology, are becoming more prevalent B. Methodology For the refinement and advancement of traditional graph scenario here we present, a proposed Text Mining based Semantic Graph approach for text document, working flow that shows how the text data can be effectively depicts in a graphical view: Description: For constructing semantic graph for text documents is demonstrate in figure 1. In this, firstly, we take an input data in a textual form. Here we take a data from different news papers. After reading input file, secondly, preprocess the loaded file using normalization and stemming. Pre-processing is a process that identifies and removes the noisy content from the input data file for learning the procedure. indexing, we prepare, grouped concept for the indexed words. All grouped concept make a grouped word list using Wordnet. The output of this step a wordlist which can be represented as , , … . . .
In the next phase, we are treating these words like nodes of the graph and finding the relation between words using WordNet. It is a significant to note that WordNet does not capture suitable noun like place or a person's name so these words will be treated like a single node and we can't find any association between these words. Finally, we construct semantic graph using given algorithm. As well to constructed semantic graph, now calculate weight of each node of the graph. There is a different weight for each node relation that shows of each node is unique to each others. Significantly, put that nodes which have a highest node weight and generate a graph file corresponding analysis the semantic graph. C. Proposed Algorithm In the previous section the entire system design and proposed system architecture is demonstrated. In this given system model text based semantic graph modules are implemented for accurately constructing graph. Therefore in order to demonstrate the entire data process with their training and testing phase are used to demonstrate TM-SGTD. Table I shows the proposed algorithm of text representation: WordList getList GroupedConcept 8: SemanticGraph ConstructGraph WordList 9: Return S Graphs are a well-studied class of data structures used to model relationships (edges) between entities (nodes). Semantic Graph in general and ontologies specifically, model a domain by defining how different entities within the domain are related. In order to make understandable the text data can be visualized using the graphs, in this representation the text data is modeled in terms of nodes and edges. The edges represent the relationships between nodes. This process either performed manually or by some domain expert or automatically done by ontology based system. Because building such a semantic graph base typically requires explicitly modeling nodes and edges this ahead of time. A graph representation of text can have any combination of nodes and edges towards other nodes. That materializes to demonstrate relationships between nodes. The knowledge representation using graphs demonstrate a number of advantages such as their development and automatically creation using real-world data, demonstration of different combination of nodes and edges include the pre-existing nodes or domain, and a complete modeling of the semantic relationships between all entities in a domain and dynamically traversal of the graph.

IV. RESULT ANALYSIS A. Time Consumption
Figures must be numbered using Arabic numerals. Figure captions must be in 8 pt Regular font. Captions of a single line (e.g. Fig. 2) must be centered whereas multi-line captions must be justified (e.g. Fig. 1). Captions with figure numbers must be placed after their associated figures, as shown in Fig.1.

Time Consumed End Time Start Time
The time consumption of the proposed algorithm is given using figure 2 and table II. In order to show the performance of implemented approaches the time consumption is reported in figure 2 and table II. In this diagram the X axis shows the different experiments on which different values generated and the Y axis shows the amount of time consumed for processing the algorithm with respective of input data file.
Additionally the performance of proposed TM-SGTD is given using blue line and traditional approach depicts using orange line. According to the given results the proposed system consumes less time as compared to other traditional algorithm. Here the time required to matching the text keywords using indexing. Additionally the results shows the amount of time consumed is depends on the amount of data provided for execution of algorithm. But the respective performance of the system shows their effectiveness over the traditional algorithm. Moreover, while implementing TM-SGTD is representing text data over the graphical form efficiently. Base Approach (Time in Millisecond)  1  1108  1515  2  925  1331  3  920  1871  4  1142  1350  5  1573  1518  6 1498 1386

B. Memory Consumption
The memory consumption shows the amount of main memory required to process the algorithm with input amount of data to be processed. That is also known as the space complexity of algorithm. To compute the memory consumption, the following formula is used.
Memory Consumed Total Memory Free Memory  The figure 3 and table III shows the memory consumption or space complexity of the system with increasing the number of runs for text documents. The unit of experiments performed with the text dataset is given using X axis and Y axis describes memory requirements of the algorithms during experimentation. The amount of memory requirement is measured in terms of kilobytes (KB). According to the experimented results the amount of memory is not similar as much higher and not more fluctuating. So, this graph shows the proposed system is not highly consumed memory other than traditional system. When the semantic graph constructed system take more space for evaluating nodes and edges but traditional approach have large memory overhead to for processing data.

IV. CONCLUSION
The rapid developments of modern techniques have enabled a large amount of text data to be published on the web. However, effective representation of text for various text mining tasks is still an open problem. In popular and classical text representation models of text analysis and visualization the graphs are developed using domains such as single words or phrases, additionally that is assumed that all terms are treated as independent units of graphs. In this research work, we present a new idea for mining documents by exploiting semantic information of their texts. We have proposed, TM-SGTD i.e. Text Mining based Semantic Graph approach for Text Document, a novel approach which present the semantic graph that constructed using text data. The approach is a formal semantic representation of linguistic inputs is introduced and utilized to build a semantic representation scheme for documents. An approach to text representation using a semantic graph has been described. The graph representation of text allows both the structure and content of documents to be represented.
V. FUTURE WORK The primary goal of the proposed work is achieved successfully. In further for the more improvements and their different application feasibility the following suggestions are made for extensions.
 In near future work use the semantic similarity based technique and word sense disambiguation to improve connectivity and relations between nodes.  In near future work use the semantic similarity based technique and word sense disambiguation to improve connectivity and relations between nodes.  Another future direction is to investigate the usage of WordNet to extract the synonyms, hypernyms, and hyponyms and their effect on document clustering, categorization, and retrieval results, compared to that of traditional methods.