A CLUSTERING BASED WEB PREFETCHING IN HIGH TRAFFIC ENVIRONMENT
The continued increase in demand for objects on the Internet causes high web traffic and consequently low user response time which is one of the major bottleneck in the network world. Increase in bandwidth is a possible solution to the problem but it involves increasing economic cost. An alternative solution is web prefetching. Web prefetching is the process of predicting and fetching web pages in advance by proxy server before a request is sent by a user. Prefetching is performed during the server idle time. Most literature based on the classical prefetch algorithm assumes that the server idle time is large enough to prefetch all user’s predicted requests which is not true in a real life situation. This research aims at improving the web prefetching technique by developing a prefetching technique that can be effective in a high traffic environment when the server idle time is very low.Log files were collected and preprocessed for several client group within a domain. The preprocessed log files were used to create web navigation graph, which shows the transition from one web page to another web page.Support and confidence threshold were used to remove web pages with values less than the threshold values. Several clusters were formed in a particular client group. When the prefetch time is predicted to be too small to prefetch, the entire clusters formed from various domains will be used to create a prioritized cluster based on several user request. The model was evaluated based on hit rate, byte rate, precision, accuracy of prediction and usefulness of prediction. The result shows that the proposed WebClustering algorithm performs better than the classical prefetch technique when the server idle time is small and behaves same as the classical algorithm as the server time becomes large enough to prefetch all users predictions.
WNG:Web Navigation Graph
BFS:Breadth First Search
LRU:Least Recently Used
LFU:Least Frequently Used
ISP:Internet Service Provider
HTTP: Hypertext Transfer Protocol
HTML:Hypertext Mail Language
ART:Adaptive Resonance Theory
NASA:National Aeronautics and Space Administration
IPGDSF#:Intelligent Predictive Greedy Dual Size Frequency
ANN:Artificial Neural Network
PSO:Particle Swam Optimization
XML:Extendible Markup Language
SVM:Support Vector Machine
PPM:Prediction by Partial Matching
URL:Uniform Resource Locator
FIFO:First Come First Serve
GIF:Graphics Interchange Format
JPEG: Joint Photographic Experts Group
CPU:Central Processing Unit
CLR:Common Language Runtime
1.1 Background of Study
The web is a collection of text documents and other resources, linked by hyperlinks and Uniform Resource Locator (URLs), usually accessed by web browsers, from web servers. The web started from a simple information sharing system, and has now grown to a rich collection of dynamic and interactive services. The tremendous growth of web has resulted into high demand for high bandwidth and delay in fetching user request (Neha, 2013). Users sometimes experience unpredictable delay while retrieving web pages from the server. Increase in bandwidth is a possible solution to the problem but it involves high economic cost. Web caching reduces the latency perceived by the user, reduces bandwidth utilization and reduces the loads on the origin servers (Pallis, 2007). Latency refers to the time elapsed from the time a request is sent to the time sender receives the requested information.
Many latency tolerant techniques have been developed over the years to solve this problem without necessarily increasing the bandwidth. Most notably are caching and prefetching. Web prefetching helps to fetch and cache users request during server idle time, which will reduce the load on the origin server. To reduce the access delay experienced by users, it is advisable to predict and prefetch web object based on user access patterns and cache them. Studies on web pre-fetching are mostly based on the history of user access patterns. If the history information shows an access pattern of URL address A followed B with a high probability, then B will be prefetched once A is accessed (Cheng-Zhong, 2000).
Web prefetching is the process of obtaining web pages in advance by proxy server before a request is sent by a user. When a client makes a request for web object, rather than sending request to the web server, it may be fetched from the cache. The main factor for selecting a web pre-fetching algorithm is its ability to predict the web object to be prefetched in order to reduce latency. Web prefetching exploits the spatial locality of web pages, i.e. pages that are linked with current page will be accessed with higher probability than other pages. Web prefetching can be applied in a web environment as between clients and web server, between proxy servers and web server and between clients and proxy server (Greeshma, 2012).
Web prefetching techniques are categorized into probability based and clustering based using weight-functions. In the probability based pre-fetching, probabilities are calculated using the history of data access. This method assumes that the request sequence follows a pattern and calculates the probabilities of following this pattern. Clustering based pre-fetching methods make decisions using the information of the web pages that have been fetched previously, assumes that pages that are close to the previously fetched pages are more likely to be requested in the near future (Greeshma, 2012).
Moreover, web prefetching is a research topic that has gained increasing attention in recent years. The web pre-fetching fetches some web objects before users actually request it. Thus, the cache pre-fetching helps on reducing the user perceived latency. Many studies have shown that the combination of caching and pre-fetching doubles the performance compared to single caching (Waleed, 2012).
Web caching is a well-known strategy for improving performance of Web based system by keeping Web objects that are likely to be used in the near future in location closer to user. The Web caching mechanisms are implemented at three levels client level, proxy level and original server level. Significantly, proxy servers play the key roles between users and web sites to reduce of the response time of user requests and saving of network bandwidth. Therefore, for achieving better response time, an efficient caching approach should be built in a proxy server (Waleed, 2011).
Due to the limitation of cache space, an intelligent mechanism is required to manage the Web cache content efficiently. The classical caching policies are not efficient in the Web caching since it considers either recency, frequency, size and ignore a combination of two factors that have impact on the efficiency of the Web caching. Unfortunately, the cache hit ratio is not improved much with classical caching schemes. Even though with a cache of infinite size, the hit ratio is still limited regardless of the caching scheme. This is because most people browse and explore the new web pages trying to find new information. In order to improve the hit ratio of cache, Web pre-fetching technique is integrated with web caching to overcome these limitations.
Knowing the user’s browsing history provides extra information like the type of the user or his/her preferences. This information about the user can help to improve prediction accuracy in pre-fetching process (Lenka, 2010).
1.2 Problem Statement
As the Internet continues to grow in size and popularity, web traffic and network bottlenecks are major issues in the network world. The continued increase in demand for objects on the Internet
causes severe traffic and low idle time to prefetch all clusters generated from users’ request. Clustering based prefetching has been explored in several ways all assumed the server idle time for prefetching is large enough to accommodate the prefetching. In a real scenario, this is not always the case since in high traffic, the idle time may not be so high to accommodate prefetching of large size data. This work therefore seeks to address this lack of consideration of volume of high traffic during the prefetching.
Internet users expect the web to be more friendly and meaningful with reduced network traffic. Every user needs the channel with high bandwidth and low traffic. In order to reduce the web server load, the access latency and to improve the network bandwidth from heavy network traffic, a web prefetching scheme taking low bandwidth during high traffic is considered.
1.4 Aim and Objectives
The aim of this research is to improve the web prefetching technique, by developing a prefetching technique that can be effective in a high traffic environment when the server idle time is very low.
The specific objectives are to:
a) predict user request based on history of user.
b) determine which pages will be requested by majority of users in the nearest future.
c) prioritize the prefetching based on the frequency of the server idle time.
d) evaluate the algorithm in respect to existing prefetching algorithm.
1.5 Research Method
In order to meet the objectives of this work, the following steps will be taken in the proposed inter clustering scheme:
a) Review of existing literature in the field of study.
b) Log files of users request will be collected using squid proxy server. The log files will pass through stages of cleaning processes for the removal of irrelevant information, user identification will be created for the size of pages made by users during a visit to a particular site.
c) The preprocessed log file will be used to construct a weighted Web Navigation Graph (WNG). The node of the graph represent the web pages while its edges represent the movement from one web page to another. The edges are assigned weights based on the frequency of visiting a page. Support and confidence threshold will be applied on the WNG to eliminate pages with low support and confidence value.
d) The graph will be transversed using Breadth First Search (BFS) algorithm to form several clusters within a domain.
e) In high traffic environment, clusters will be formed in favour of the requested web object by setting the support and confidence values to accommodate the requested web object from several domain. An inter domain cluster will be reconstructed from the several clusters.
f) C# will be used to implement the algorithm.
g) The proposed technique will be compared with that of Thulase et al. (2014) based on hit ratio, byte ratio, usefulness of prediction, accuracy of prediction and precision.
1.6 Organization of Dissertation
The rest of the work is organized as follows: Chapter 2 is the literature review, the proposed web prefetching scheme is discussed in chapter 3, chapter 4 entails the result and analysis and chapter 5 concludes, summarizes and recommend the future works.