Location Based Optimal Package Selection in Multi-Cloud

—Cloud computing is a ubiquitous platform that can be used to access services and resources. The requirements for accessing a cloud resource is minimal, however, the users of a cloud platform are initially required to configure the type of services they need to access in the cloud. Even though cloud platform is elastic, frequent upscaling is costly. This paper presents an effective technique that can be used to automatically identify user’s requirements and to allocate appropriate packages depending on the requirements. Usage logs of the client is grouped geographically and group based requirements are identified. This helps to determine the resource requirement for each group which address the problem of underutilization or overutilization of resources. These requirements are passed to PSO, along with the packages offered by multiple cloud operators in the same region to identify the best package for the current requirement. Experiments conducted on this architecture proves the effective working nature of the system in minimal time and with low QoS difference.


II. RELATED WORKS
Cloud computing is one of the sought after techniques due to its elastic nature. The major requirement for a cloud computing environment is to provide the appropriate requirement specifications such that the appropriate resource is allocated for the user. Several contributions exist in literature to identify the most optimal resource specifications for the current requirement. This section presents some of the most recent contributions towards cloud resource provisioning.
A summarization of the best practices for cloud optimization was presented by Ferry et al. In [10]. This technique stressed on the need for a model-driven optimization architecture for multi-cloud systems. A 2ehaviour framework used for automatic resource allocation in cloud (ROAR) was proposed by Sun et al. In [11]. This technique not only operates with its basis on resource allocation, it also provides techniques for optimization of the resource allocation decisions. ROAR is defined as a domain specific language that can be used to define the configurations of the web applications. This technique is presented as a cost optimal resource configuration system. Several other model based cloud provisioning systems include [12] [13]. Draheim et al. [12] proposed a real time approach that models the user's 2ehaviour to provision resources effectively. A Load Testing Automation Framework (LTAF) was proposed by Wang et al. In [8]. This technique has its major concentration on modelling user's 2ehaviour to test the load in each of the workflows and provision resources accordingly.
An automatic resource allocation technique that focuses on dependability and the security aspect of services was presented by Marrone et al. In [14]. This technique also uses a model driven principle to support cloud brokers in identifying the optimal configurations. It uses UML and Bayesian Networks to perform the optimization process. Energy efficient resource allocation is a major requirement currently due to the increased usage of resources. Some techniques that have their major focus on attaining energy efficiency were proposed by Beloglazov et al. And Hung et al. In [15] [16]. A model based and energy efficient approach for resource allocation was proposed by Dougherty et al. In [17]. An energy efficient approach that concentrates on energy utility by incorporating it as a major parameter into its allocation algorithm was presented by Gupta et al. In [18]. A reinforcement learning based dynamic resource provisioning technique for cloud was presented by Bahrpeyma et al. In [19]. The major advantage of this technique is that it uses a reinforcement learning based approach that performs dynamic resource provisioning enabling the elasticity in a cloud environment. Further, its ability to deal with the uncertainty of the cloud environment is another major advantage towards using this approach. Several approaches exist in literature that deals with the uncertainty in a cloud environment to perform effective resource provisioning. Some of them include, a response time based provisioning strategy proposed by Islam et al. In [20], a model predictive control, used to improve efficiency, proposed by Zhang et al. In [21] and an anticipatory model proposed by Huang et al. In [22] A dynamic resource provisioning system operating in a multi-agent architecture was proposed by Ayyoub et al. In [23]. This technique concentrates mainly in minimizing the cost and the time of operations for the customers. A technique concentrating on scientific jobs in cloud was proposed by Shi et al. In [24]. This technique proposes resource provisioning and task scheduling mechanisms to effectively perform scientific workflows in a cloud environment. An optimal resource provisioning technique that has its major concentrations on software, providing SaaS was presented by Li et al. In [25]. This technique minimizes the payment to VMs and maximizes its profit by accommodating more users to utilize the software service provided by them. Several such techniques exist in literature to perform resource provisioning. However, it is to be noted that the major requirements of all these approaches is to provision resources according to the user's requirements, where the users are required to define their own requirements. Automated user requirement identifications is not considered in any of the discussed approaches. A technique to provision resources based on the user's requirements was presented by Madhumathi et al. In [33]. This technique utilizes Ant Colony Optimization, a metaheuristic technique for the package selection process. The major concentration of this approach is to avoid vendor lock-ins to provide effective service to academic institutions.

III. GEOGRAPHIC LOCATION BASED OPTIMAL PACKAGE SELECTION IN MULTI-CLOUD
Automatic identification of a user's requirements is a feasible process, however it requires base data of the user and the usage scenario. This refers to the number and type of users for whom the service is provided, the geographic locations of access and the level of access in each of the locations. First time users will not be aware of these details initially, hence these are obtained as fuzzy inputs directly from the user. However, for users migrating to cloud environments these details can be mined from their usage/ web logs. This concept serves as the base for this paper. This paper proposes an architecture ( Fig.1.) to effectively perform automatic package selection in a multi-cloud environment.
For Each Cluster The proposed architecture utilizes the usage logs of a user to identify their requirements. The process is carried out in three broad phases. The first phase deals with identifying the usage levels on the basis of geographical locations and grouping them. The second phase deals with identifying the SMI parameters for each of these groups, where the SMI parameters corresponds to the usage requirements for the group. The third phase deals with selecting optimal packages for each of these groups by considering a multi-cloud environment such that the requirements are matched to the maximum extent.

A. Geographic Location Based Usage Level Identification
Resource utilization levels corresponding to a user might actually correspond to their aggregated access levels. The resource has a high probability of being used over several geographic locations. Each of these locations tend to have varied demands. When using a single resource for shared access, capability of the resource is determined by the highest access requirement. On its absence, the resource is always underutilized. The proposed architecture addresses this issue by grouping the regions according to their level of access, then determining the resource requirement for each group. Though this scheme leads to granularity, appropriate resource allocations can be performed and the probability of up-scaling can be reduced to a large extent.
The initial process is to divide the dataset on the basis of its IP address. IP based data grouping is carried out, which aggregates the transmissions carried out from the same IP. The groups are then sorted on the basis of the IP, leading to groups with closer IP addresses occurring consecutively. Usage levels, in terms of the level of access is identified for each group. This phase is followed by the aggregation phase. The aggregation phase analyzes the usage levels of each group and aggregates geographically close groups whose usage levels are less than the base threshold (thresh). The base threshold is a user defined parameter that determines the maximum usage level of the cloud resources that can be provided to each of the groups. This process finally ends up to provide the defined number of groups, and each of these groups exhibit a usage requirement defined by the user.

B. Cluster based SMI Parameters Identification
The base threshold defined in the previous phase provides a rough estimate for the aggregation process. However, in-order to provide appropriate allocation of cloud resources, the Quality of Service parameters must be identified from the web usage logs. Quality parameters considered for usage in the current architecture is presented in table 1. The QoS parameters are divided into two broad sections; parameters to be obtained from web logs and parameters that are to be obtained from the user. It has been discussed in the earlier sections that the QoS parameters can be obtained from the web logs, however, not all requirements can be directly obtained from the logs. Parameters such as security, portability, reliability, etc. needs to be explicitly obtained from the user. Five parameters are selected for user input in this paper. However, the architecture is flexible and several parameters can be added to the current list depending on the requirements. The process of identifying quality parameters from web logs has been proposed in the previous contribution by the authors. Fuzzified inputs are obtained from the users for the quality parameters categorized under user input. The inputs are varied with five levels, with 1 indicating the lowest and 5 the highest.

QoS Parameters Considered for Evaluation
Obtained From Bandwidth (Bw)

C. SMI Ranking using AHP (Analytical Hierarchy Process)
The next phase deals with ranking the quality parameters for identifying the weights associated with each parameter. Ranking parameters cannot be usually performed directly, as several issues are associated with them. Further, as the parameter set tends to increase, the difficulty in ranking process also increases. This paper uses Analytic Hierarchy Processing (AHP) to determine the ranks.
AHP [26] [27] is a collaborative technique used to make complex decisions. The decisions are usually made on the basis of the user's preferences. The application of AHP can be extended to several areas. In general, AHP can be used in any situation that requires a user to make decisions on the basis of several dependent or independent attributes. There exists two methods to identify weights using AHP. They are: (i) Pair wise comparison (ii) Direct user assigned weights.

1) Pair wise Comparison:
The pairwise comparison technique considers each pair of attributes and ranks them with respect to each other. In this paper, quality parameters such as availability, reliability, security, latency and so on are used as attributes. A square matrix is created, representing each attribute in the row and in the column. Each intersection is marked with the rank provided to the row element when compared with the column element. Since the diagonals correspond to the same attributes comparisons cannot be performed, hence the diagonal positions are provided with a value 1. This technique is useful if the process of ranking needs to be performed on a huge number of attributes, or if the user does not possess detailed knowledge about the QoS requirements of the current requirement. Since the process of cloud package selection faces both these issues, AHP can be considered to be the best candidate for ranking QoS parameters considered for cloud resource allocation. Comparisons are made on a 5 point scale and the comparison matrix depicting the priority set P is shown below.
Where S xy represents the comparison score of x when compared with y and W x /W y compares the attribute x against y. An attribute when compared to itself will return a value of 1. These values are then integrated to provide the final attribute weights (WA i ).

2) Direct User Assigned Weights
Though the user might not be able to appropriately provide ranks for all the QoS properties, several properties might be significant enough such that its rank can be directly determined by the user. Properties that are of least importance can also be ranked in this phase. Providing ranks for such attributes directly would be the best choice, as eliminating properties from the pairwise ranking process tends to reduce comparisons. Hence this process is performed prior to the pairwise comparison phase.
The weights obtained from AHP are used to identify the requirement levels of a particular package, specified by either the customer or the cloud provider while proposing packages. These requirement levels serve as the basis for identifying the fitness of a package.

D. Parameter Normalization
The process of attribute ranking provides appropriate weights for the attributes. These weights, in conjunction with the actual parameter values are used to determine the quality of service. The actual values of the quality parameters can occur in various ranges. Using them directly to obtain the final quality measure is not appropriate. Hence it becomes mandatory to transform them into a fixed range for effective operations. This paper uses min-max normalization [28] to perform the conversion process.

′ * 2
Where, A' contains Min-Max Normalized data one, A is the actual data and C and D are the pre-defined boundaries.
The final quality parameter is calculated using the weighted sum method [29], [30]. The weighted sum method calculates the quality of a requirement by aggregating the product of its weights and its corresponding normalized attribute values. Where w j corresponds to the weight of an attribute and a ij corresponds to the value the j th attribute of a requirement scenario i.

E. Optimal Package Selection in MultiCloud using PSO
The process of identifying the optimal package is performed using Particle Swarm Optimization (PSO) [31], [32]. PSO is a metaheuristic technique that optimizes a problem by iteratively improving the candidate solution. The improvement is measured in the form of a fitness function that is used to select the best solution among the list of available solutions. PSO operates by moving the particles in the search space to find the optimal solution.

1) PSO Search Space Creation and Particle representation
Particles are the operating agents that identifies the solutions in the search space. A search space is created using the data containing package information. Package details correspond to the values provided to each of the quality parameter. Due to the usage of location based package assignment, multi cloud based packages are used for constructing the search space. The dimension of the search space is determined by the number of quality parameters used for analysis. Particles are represented using SMI Parameter values of package data and user requirement. PSO operates in three major phases. Distribution of particles forms the first phase of operation in PSO, followed by determining initial velocity and triggering the movement of the particles. This is followed by iteratively varying the velocity on the basis of the best solutions identified by the previous iterations and determining the convergence of the system 2) Particle Distribution and Movement After the creation of the search space, the particles are usually distributed in the search space in random locations. However, the current process requires the user to find the best solution for a particular requirement. Hence, along with the package details, the current user requirements is also added as another node in the search space. All the particles are distributed in the requirement node. The process of particle movement is initiated by defining the initial velocity for the particles. The initial velocity is determined using the eq. 4 where b up and b lo are the upper and lower bounds of the search space. A random velocity is assigned to each particle and the movement is triggered. After a single migration, the particles occupy a position in the search space that does not correspond to any node. This is due to the fact that PSO operates on a continuous space. However, the current process requires discrete particle movement. Hence the particles are discretized and moved to a defined node using the eq. 5 Where P ik refers to the particle i's current location corresponding to dimension k, N jk refers to the k th dimension of node N i .

3) Fitness identification and Optimal Package Selection
This process is followed by the fitness identification. PSO operates by using two fitness values namely; p best and the g best . The best solution identified by a particle so far is recorded as the p best. Each particle has its own best solution that it has visited. The best solution for the entire search space is determined by the g best. All the p best solutions are compared and the best among them is set as the g best. Hence a search space only contains one g best. The initial iteration sets the current solution as the p best for each particle. All the p best values are compared and the best value is chosen as the g best. The comparison of solutions is performed using the fitness function. The fitness function for this approach is modified to incorporate the difference between the QoS values of the requirement and the current node representing the package. This marks the end of the initial iteration. After this stage, the velocity of the particles is determined by the p best and the g best values identified so-far. Velocity of a particle is calculated using eq. 6 where r p and r g are the random numbers, P i,d and g d are the parameter best and the global best values, X i,d is the value current particle position, and the parameters ω, φ p , and φ g are selected by the practitioner. Current particle position is updated by following equation.
The process of discretized movement and determining the p best and g best is continued until the termination criterion is met. The termination criterion is determined by either a stagnation condition or a time lapse. After deployment, a maximum time of operation is set by the user. After this time lapse, PSO is terminated and the solution contained in g best is taken as the optimal solution. However in the production phase, termination criterion is defined by the stagnation behavior. A maximum termination (maxTerm) level is set by the user. If the g best does not change for maxTermtimes, the process is terminated. In this paper, the maxTermis set to 1000. As the maxTermvalue gets higher, the probability of local optima is reduced. After termination the value contained in g best is taken as the most optimal package best suited to the current user requirement.

Algorithm (PSO based Optimal Package Selection):
1. Search space boundary identification using package data and user requirement 2. Initialize number of particles p. 3. For each particle i=1…p a. Particle initialization on the user node using parameter values of package data and user requirement. b. p best and g best initialization c. Velocity initialization using equation (4) 4. Until the termination criterion is met perform the following a. For each particle i=1…p i. Generate r p and r g using normal distribution ii. Discretize particle movement using equation (5) iii. Identify the velocity of each particle using equation (6) iv. Update the particle's position to P' using equation (7) v. If p best < current fitness 1. Assign current fitness to be the p best vi. If p best <g best 1. Assign p best as the new g best 5. g best contains the optimal package for the current user requirement Packages used for creating the search space is obtained from multiple service providers, such that the best provider, providing the best package is selected for each location.

A. Workflow and Dataset Analysis
The workflow corresponding to location based optimal package selection in multi-cloud is performed in two phases. The first phase deals with grouping requirements based on location and identifying quality parameters for each of the groups independently, and the second phase deals with identifying the best package for the current requirement in a multi-cloud environment. Access log dataset is used for the identification process. A sample screenshot of the access log is shown in  The access log is made up of seven basic components known as Common Log Format (CLF) as shown in Table II. Size of the object returned to client Client ID and User ID have been removed from the file to provide anonymity. The access log dataset used for this contribution is made up of 4.4 million transmission records. The timeline ranges approximately for 2.5 years.
B. Implementation Details Both the phases correspond to very distinct requirements. Implementation for the first phase is carried out using Python. The access log is provided as input to the Python code. The transactions are read and IP address based grouping is performed. IP based sorting is carried out to bring entries with similar location together. The value for maximum density threshold (thresh) is set to 5000. Aggregations and splitting is carried out depending on this value and the final location based groups are created. Each of these groups is analyzed and their quality parameters are identified. User based quality parameters are obtained as fuzzy inputs and the final group based requirements are identified and written to property files.
The package selection phase is performed next. This process is implemented using C#.NET. Particle Swarm Optimization is performed using the property files created in the previous phase. These files provide the user requirements on the basis of the location. The search space is constructed by package details from multi-clouds. Package details corresponding to 20 distinct requirements were used for building the search space. PSO is executed by adding the requirements for each cluster to the search space, and distributing all the particles on the requirement node. Particle movement is triggered and the gbest obtained after the termination is considered to be the best package for the current requirement.

C. Results and Discussion
The time taken for each of the phases individually and the aggregated time are presented in Fig. 3 and 4. Scalability of the proposed approaches is identified by varying the size of the access log from 0.5 million records to 4.4 million records. The aggregated time for grouping and identifying the best packages in a multi-cloud environment is presented in Fig. 6. The measurements were obtained using datasets of size ranging from .5 million records to 4.4 million records. It could be observed that the line representing time is almost constant, requiring 12ms to 14ms approximately. This proves the high scalability level of the proposed technique.  The difference between the required QoS and the provided QoS is presented in Fig.7. It could be observed that none of the transactions exhibited perfect match with zero QoS difference. This is due to the predefined package structures provided by the cloud service providers. This leads to a user's requirements that never match the defined packages. Hence a near optimal algorithm that identifies the closest possible solution is sufficient for the problem. The difference in quality is calculated using eq. 8 ….
(8) It could be observed that the differences in the quality parameters are quite low and mostly towards the positive direction, meaning that the provided quality matches the requirement as closely as possible and in most of the instances, the allocated resource is more than the requirement. It could be concluded that the requirement of a user is satisfied to the maximum extent approximately for 85% of the requirements.

IV. CONCLUSION
Cloud service selection is one of the major components of effectively utilizing a cloud architecture. However, this is the only component that has not been automated. The reason for not automating this component is that user's parameters play a huge and major role in this process. Hence the cloud providers have left the process of decision making to the users themselves. However, not all users are technically sound enough to identify and provide the accurate inputs during the package selection process. Further, the packages themselves are predefined, making the selection process further complicated. This paper presents an automated package selection system in multi-clouds by initially identifying the user's requirements from usage logs and then identifying the package that matches the user's requirements to the maximum extent. User's web log is obtained as input, followed by geographic location based clustering and cluster based requirement identification. PSO is used for identifying the best package due to its metaheuristic nature and the requirement of a near optimal solution, rather than the best solution. Experiments revealed that the proposed approach exhibits acceptable time limits, with very low time requirements (<1 sec) for both the grouping and the package selection process. The quality difference between the requirements and the assigned services was also observed to be low. This makes the proposed approach best suitable for automated service selection. Future directions include incorporating prediction and game theoretic approaches to identify requirements of the user to provide effective packages inorder to avoid customer churn or package shifting scenarios. Future directions from the current contributions also include incorporating the optimization module into the architecture such that the system recommends packages according to the resource usage levels observed in the system.