Design of a Framework for Knowledge Based Web Page Ranking

- Web is growing exponentially. The search mechanisms need to provide relevant and high quality web pages that too in short time to the internet user. The standard search engines utilize the link structure of the web to measure the quality of Web pages. Wherein it has been observed that the some less popular and lowly ranked but significantly important web pages remains missing. In this paper a framework for knowledge based web page ranking is being presented. It provides relevant and quality information in desirable time with the help of a proxy server. This framework exploits the content of the web to measure the quality of web pages.


A. Web Usage Mining (WUM)
The data obtained from the web server, access logs, browser logs, proxy server logs, user profiles, registration data, user queries, user sessions or transactions is called secondary data.The secondary data are mined through the web usage mining.Usage data grabs the identity or origin of Web users along with their browsing activities at a Web site.Web Usage Mining uses data mining techniques to realize important usage patterns from Web data in order to understand and to provide better services of Web-based applications.

B. Web Structure Mining (WSM)
Web structure mining discovers the structural synopsis about a web site and its underlying web pages. Through Web Structure Mining the link structure of hyper linkedresources are discovered. This model is used to classify and compareweb pages or integrate different web pages. Web structure mining is carried out in one of the following ways. 1. First extract the patterns from hyperlinks in the web: A hyperlink is a structural component that connects the web page to another web page at a different location. 2. Second mine the document structure: analysis of the tree-like structure of page structures to clarify HTML or XML tag usage.

C. Web Content Mining (WCM)
Web content mining extracts useful information such as text, images, audio, video, records from the contents of the web documents. Mining supplied web documents as well the result pages produced from a search engine. Basically there are two approaches in content mining called: 1. Agent based approach: The agent based approach relies on searching proper information using the uniqueness of a particular domain to interpret and organize the collected information. 2. Database based approach: The database approach is used to get back the semi structured data from the web.

Web page Ranking Algorithms
Web page ranking algorithms are categorized into two parts on the basis of web links and web content.
1-Content-based algorithmsthese algorithms return all those web pages which are matching user query words with the web documents. Vector space [2], TF-IDF [3] and BM25 [4] are examples of these algorithms. These algorithms are used for searching structured pages/documents within digital libraries, rather than the unstructured web pages. 2-Connectivity-based algorithmsthese algorithms work on the basis of link between web pages i.e. the importance and relevancy of web pages is computed on the basis of links. These algorithms are further categorized into two major parts:  Query-independent-PageRank [5], HostRank [6] and DistanceRank [7]. These algorithms use the entire web graph and compute the score of web pages offline.  Query-dependent-HITS [8] Algorithm creates a query-specific graph online and thereafter computes as rank of the web pages.

RELATED WORKS
Wenpu Xing and Ali Ghorbani modified to original page rank algorithm and named it Weighted Page Rank (WPR) Algorithm [14], where in the rank score is decided based on the popularity of the pages. The rank of the pages is computed at the indexing time. It is providing high value of rank to the more popular pages. Every outlink page is given a rank value based on its popularity. Popularity of the page is decided on the basis of number of its in-links and out-links. Weight Links Rank (WL Rank) algorithm [15] is the modification of standard page rank algorithm, given by Ricardo Baeza-Yates and sEmilio Davis. It assigns weight to web links based on three attributes: Relative position in page, tag where link is contained, length of anchor text. Relative position was not so effective, indicating that the logical position not always matches the Physical position. HITS [16] is the oldest official Page Ranking algorithm which divides pages into two categories, Authoritythe page which is pointed by many hyperlinks and HUBsthe page which points to various hyperlinks. It is primarily, a link based algorithm. Where in the web page is decided by analyzing their textual contents with respect to a given query string. Modified HITS (PHITS) is a modification of HITS where in a weight value is assigned to every link depending on the terms of queries and endpoints of the link [17]. A probabilistic explanation of relationship of term document is provided by PHITS. TagRank (TR) Algorithm [18] is Web Content Mining algorithm for page ranking called TagRank (TR) algorithm. It is a comparison based approach along with on social comments which calculate the heat of the tags by using time factor of the new data source tag and the comments behavior of the web users. In Time Rank algorithm (TIR) [19] the default rank of web page is computed on the basis of visiting time of the page and visiting time considered as a factor that shows the degree of importance to the users. This algorithm utilizes the time factor to increase the accuracy of the web page ranking. EigenRumor (ER) Algorithm [20] is proposed for ranking the blogs. The rank scores of blog entries as decided by the page rank algorithm is often very low so it cannot allow blog entries to be provided by rank score according to their importance. To resolve these limitations, an EigenRumor algorithm is proposed for ranking the blogs. Relation based algorithm [21] which is known as the most accurate page ranking algorithm among those that use Web Content Mining proposes a relation based page rank algorithm for semantic web search engine that depends on information extracted from the queries of the users and annotated resources. Query Dependent Page Ranking (QDR) [22] is a powerful semantic search engine that takes into account keywords and return page onlyif both keywords are present within the page and they are related to the associated concept as described in to the relational note associated with each page. In Distance Ranking Algorithm (DRA) [23] ranking is done base on the shortest logarithmic distance between two pages. It is intelligent ranking algorithm, proposed by Ali Mohammad ZarehBidoki and Nasser Yazdani. It is based on reinforcement learning algorithm. In this algorithm, the distance between pages is considered as a punishment factor. In this algorithm the ranking is done on the basis of the shortest logarithmic distance between two pages and ranked according to them. A critical look at the available literature reveals that the following issues need be addressed-1. Need to identify the less popular and lowly ranked but important pages. 2. Perceived delay in response by the users for their request over the web 3. Need to minimize the problem of information overkill.

4.
With the exponential growth of WWW and its coupled with the perceived delay by the users, it becomes imperative to prefetch the information sought by a particular group of users.

PROPOSED FRAMEWORK FOR KNOWLEDGE BASED WEB PAGE RANKING
In this work, a framework for knowledge based web page ranking as shown in Fig-2 is being proposed that computes the relevancy of a web page in response to a user query by exploiting a Proxy Server, which works between user and search engine, Proxy Server is a potential tool that can be suitably employed to intercept all requests to the search engine to see if it can fulfill the requests by itself. If not, then only it may forward the request to the search engine.

Components of Knowledge Based Web Page Ranking Using Web Mining
The proposed framework is composed of following components A.  B. Search Engine-A search engine is the popular term for an information retrieval (IR) system. It allows the users to ask for content in the form of a query consisting of words/phrases and it retrieves a list of references that contain those keywords/phrases. C. Proxy server -Due to exponential increase of www, there are a large number of users that interact with servers through millions of networks connected with each other leading to a significant increase in the traffic on the internet. It is a potential tool that can be suitably employed to intercept all requests to the search engine to see if it can fulfill the requests by itself. If not, then only it may forward the request to the search engine. In fact, the proxy servers can be employed to achieve three main purposes: 1. Increasing the relevancy of information 2. Reduce perceived latency 3. Minimizing the problem of information overkill. Mapper -mapper maps the query in the query log. If query is mapped then it invokes the web page extractor otherwise consults with search engine for the web pages. Query Log-The Query log is a data structure which stores successful queries fired over a period of time. Each query points to its resulted web pages stored in the database. Query Log Updater -On receiving a signal from web page analyzer Query log updater updates the query log with the successful query. Knowledge Base -Stores the ranked web pages pointed by query in query log. Web page analysisit is used to compute the relevance of a web page on the basis of its contents with respect to user query. It follows following steps. STEP-1: Extract web pages from buffer for a particular query and store them into Web Page Repository (WPR). The structure of WPR is given below.

Freq_Ti = Freq_TKi(j)
Where Freq_T i is frequency of all m query keywords in title of the i th web page in WPR.

Freq_Hi = Freq_HKi(j)
Where Freq_H i is frequency of all m query keywords in heading of the i th web page in WPR.

Freq_Pi = Freq_TKi(j)
Where Freq_Piis frequency of all m query keywords in paragraph of the ith web page in WPR. Structure of total Frequency of query keyword (FQK) is given below. 4. EXPERIMENTAL RESULTS A user query "human survival in society" was fired on Google search-engine and 10 web pages obtained from thereof are listed in Table 1. The proposed framework was tested on the 10 web pages (table -1) obtain from the Google search. The following keywords of the given query (QKL) were used to obtain the data about title, heading and paragraph words from the pages listed in table-1. The data obtained is provided in table-2 and table-3.  The expression -1 was applied on data provide in table-2 and table -3 to obtain total relevancy and rank of each web page as listed in table -4. It may noted that the page with high relevancy have been high rank where rank 1>2>3 ---.The same dataset of web pages was given to different experts to compute the rank of each web page manuallyand resulted ranks of the pages have been compared with the rank obtained from the proposed work as shown in table -5. Table 5-Comparison between system ranking and manual ranking of the web pages Performance evaluation of the proposed mechanism is done based on Precision of the downloaded pages as given below.

Precision =
Where: r-Number of relevant documents n -Total number of documents Comparison between the ranks obtained from a standard search engine (Google), manual rank from experts and rank provided by the proposed mechanism is given in table-5.
Precession for Google = 3/10 = 0.3 Precession for proposed mechanism = 6/10 =0.6 Hence the performance of the proposed method is higher as compared to existing approaches.

CONCLUSION
The standard search engines return large number of web pages in response to user's queries, while the user always seeks relevant web pages that too in a short time. The page ranking mechanismplay import role in this direction as the search engine can choose the best ranked documents for the user. In this paper we proposed a framework that computes page rank of a web page based on the knowledge based maintain at a proxy server. The performance of this mechanism was compared with the performance of standard search engine and manually obtains ranks from experts. And the comparison suggest that the result obtain thereof are better the performance of existing search engine.