Keywords

1 Introduction

Cloud Computing is a model that provides many on demand services. XaaS is the common terminology used for this purpose. It means that anything can be outsourced and integrated with existing application to improve performance or quality of service. As Cloud model comprised of three main services as Infrastructure as a Service, Platform’s a Service, Software as a Service. We also have technologies as Communication as a Service, Database as a Service, and Monitoring as a Service. To emphasis on Monitoring as a Service which is used to improve quality of Service and to attain service level agreements at accuracy. Apart from that it also helps to troubleshoot, find out root cause analysis, identify various vulnerability and threats over the application environment.

With each deployment model i.e. public cloud, private cloud, hybrid cloud there comes the need of monitoring the computing resources to use resources in optimised way. For public cloud resources are made available over the internet to multiple known or unknown users. Similarly private cloud exposes and provides access to limited and most of the known audience. However, hybrid cloud is combination of above-mentioned cloud, but monitoring aspects remain constant irrespective of deployment model. Scaling up of resources or scaling down of resources plays an important part. Organization often terms it as capacity management or capacity planning. Entire moto of the proposed system its t monitors the environment with least cost and best quality of service with minimal or no failure of computing resources.

2 Motivation to Design

With increasing complex cloud infrastructure, it is especially important to have a stable monitoring solution applied over the cloud cluster. Moreover, existing resources are on rented basis i.e., already cost is to be paid for them. Resources here refers to the processors, storage, or network [1]. There is a need to identify and design an architecture that is efficient to monitor the cloud infrastructure with minimal or no cost. This solution should be effective and reliable as very less of negligible cost is associated to build and maintain it [2].

Cloud Computing involves many activities for which monitoring is an essential task. The most important ones are like.

  1. (1)

    Capacity and Resource Planning and Management.

  2. (2)

    Security management.

  3. (3)

    Data Center Management.

  4. (4)

    SLA Management.

  5. (5)

    Billing.

  6. (6)

    Performance Management.

Traditional monitoring or availed monitoring services are mostly domain specific in nature. This is an attempt to provide a versatile, low cost and effective solution for cloud infrastructure [3]. Using Monitoring as a service from cloud vendor will ultimately require cost of two thing. Firstly, it will be all the rented resources or platform, or it can be software service that is used directly [4]. Second part will comprise of Monitoring service which will continuously run over the cloud infrastructure, and it will cost per minute [1, 5]. In Order to provide cost effective open-source solution that can be designed and used as per domain specific need one can use various automation tools combine them and create a complete open-source solution. Hence Monitoring has become an important aspect that is to be considered while designing as well as maintaining cloud infrastructures.

Monitoring the cloud infrastructure and analysing the gathered statistics helps for capacity management. Resources or instances can provision based on the need and as required they can be decommissioned. This ultimately defines on demand pay as used model [6]. Automation helps to upscale and downscale the system. No human intervention is required in entire process and the accuracy is high. This defines the term rapid elasticity which the monitoring solution will provide. Used services can be measured, monitored, and reported [5].

3 Architecture of Monitoring System

Monitoring architecture can be broadly classified in three major steps.

  1. (1)

    Data Collection

  2. (2)

    Data Analysis

  3. (3)

    Data Visualization.

These three steps show high level picture of any monitoring solution. From granular perspective there are multiple operations that happens within this phase. These included collection of data, transportation of data, processing the data at required processing time, presenting thee data in most effective way [1, 7]. Apart from that every phase must be associated with some security protocol so that security breach can be avoided. Cloud monitoring architecture is shown in Fig. 1.

Fig. 1
figure 1

Cloud monitoring architecture

Basic requirement to set up any monitoring solution is environment that is to be monitored, light weight shippers to ship data, on demand parser, storage for gathering the stats, UI to visualize the gathered data. These are basic primary requirement for any monitoring solution that is to be built [8]. Initial stage design of architecture plays a significant role. Infrastructure to be monitored need to be measured as all the instances will forward the statistic to main centralized server. Average no of hits per minute need to check in order to build the centralized server [1, 5, 7]. One can design it as master server architecture, where the master comprises of universally available and efficient parser and slaves, or the servers are the devices that needs to be monitored [9].

Phase one comprises of data collection. Here data can be referred as files, log files, metrics, information stats of cloud services, health rules for the instances, etc. [3]. Broadly this monitoring solution can be classified into two types based on the data. It can Application monitoring or Infrastructure monitoring. Application monitoring is collection of application logs, application of grok patterns on it and storage in form of attribute and value. E.g.: A java application has various exceptions. One can setup alert if any such exception string occurs in log. Infrastructure monitoring is where system related metrics are gathered e.g.: CPU, Memory, Disk, Process, etc. Alerts can be added if any of the mentioned parameter breaches the threshold value that are defined already [10].

Overall, any event that is to be captured and monitored or visualized can be considered as data set. Although it depends upon design architecture what data to collect and what part of data to be extracted. To gather this data once can use beat agents [1, 11]. These agents are lightweight programs that run on the server. These shippers run on server as service, and they run throughout in the background without consuming resources. Main function of this agent is to push data to centralized server or the parser. In some architecture where parser is absent it could be direct storage. It can be static database of cloud availed database service. One can gather events from automation scripts. Also, various API can be used to fetch data. Collection of data can also be done with plugins or protocols such as Nagios or NRPE. Ready to install beat agents are available which are mainly written in java or GO language one can ship data without constraint of compatibility [2, 6, 7, 12]. Transportation of events or logs are mainly taken care by the shippers. It’s important to have encrypted mechanism if data is shipped with hops over the cluster. This will help to avoid all the security breaches [13].

Phase two comprises of as centralized server where all the data is accumulated and sent over to storage device or storage service. Each and every event that gets generated with help of beat agent is undergone through a parser device. Here various grok patterns can be placed. Segregation of data can be done at this level. Key value pair format can be used to store data. This parsing helps to visualize the data on graphical UI easily. One can plot various visualizations on segregated data and analyse the behaviour [2, 14].

The centralized server designed in this phase should be capable to oversee all the network load. Network load here refers to the events that are transferred all the way from cloud infrastructure. Parser or number of parsers should be defined based on no of events that we are going to receive. One can receive any number of events with any frequency. Mostly frequency started from 2 s. However, this is relative to the use case, if application is critical one can reduce the time to seconds otherwise its ca be in minutes interval [15]. Based on that infrastructure should be designed. Parser should be designed in such a way that it should oversee all the load. Multiple nodes of the parser can be installed. Nodes should be available with HA mode i.e., high availability mode. If one of the instances goes down other should be capable enough to handle all the traffic in the environment. Also, for larger clusters load balanced parsers can be applied. Events load will be routed and distributed evenly over the load balance parser [2, 16].

Phase three mainly focus to store and visualize the data. As mentioned earlier storage device should be efficient in terms of volume. It can be static storage or simple database or cloud service as s3. Elasticsearch database can be used here to store the data in Json format. One data get segregated with key value pair we can use it easily to analyse the data behaviour [1, 2].

Other important aspect is to hire the storage that can oversee substantial number of read write operation within low frequency and should provide efficient output with near real time. Time lapse parameter should be considered as further alerting or visualizing process is dependent on this. Data can be aggregated and stores in attribute value format. This can be depicted in form of graph, line chart, pie chart, heap map, etc. Various dashboards and canvas reports can be made out of it [17]. Stored data can be aggregated on any configured attribute that is present in logs which makes fault finding and debugging easy. Alerts can be configured on stored data in form of email. Also, integration can be extended to any destination or any ticketing tool for further analysis. Also, Machine learning modules can be applied over the data. Module can predict number of resources required based on prior knowledge also they can predict the failure in applications [1, 2, 18].

4 Results and Performance Analysis

This architecture can be built in both ways, i.e., centralized or decentralized as in Fig. 2. A solution can have hardware device with CPU that has processor greater than two cores with Memory more than 4 GB. Any Linux flavours will suffice the need. Light weight agent is installed on the system. Agent is java-based agent that is built in GO lang. Agent varies as per need. E.g. If system statistics are to be gathered then metric beat type of agent to be installed on the system. If file data or log files need to be stored, parsed, and monitored then file beat type of gent is required to be installed. Similarly, if application health or URL are to be monitored then heartbeat type of agent to be installed.

Fig. 2
figure 2

Designed architecture

Events that are getting collected can be parsed. The required strings are to be captured and stored in database. Unusual data can be discarded at initial stage. Parser collects the data at specific port e.g., 5044. Once event is received segregation of string happens so as to store the data in attribute value format. Time stamp data or real time data is parsed. Configuration file can be considered of three important parts namely input, filter, output. Input part listen to mentioned port. Filter part segregates the data and directs it to desired index mentioned in output part. Here the output is Elasticsearch type of database where data is stored. All graphical interfaces run on this.

Gathered data is seen in form of indices. Indices are used to create aggregated Dashboards. Canvas reporting the real time reporting is built on top of it. Java program watchers can be used to monitor the data. Every 5 min a program gets executed to identify if all nodes are reporting and all the nodes in the environment are performing well. Integration with SMTP server is required if Elasticsearch data alerts are to be sent in email format. This gives complete pictorial view of application and infrastructure on single dashboard.

Replication of proposed solution is easy as time taken for development is one time effort. Solution being scalable in nature can be scaled using various automation tool. Agents or monitoring shippers can be installed across entire infrastructure in using single playbook. Expansion of parser requires replication of existing parser. To generalize the fact, time complexity to expand the solution is minimal.

5 Advancement

As we have pass dated information that is parsed and stored over the cloud which is nothing but the data set for further uses. Here we can use the data set in two ways as prediction and dynamic alert configuration.

Prediction comprises of analysing the data. Identifying the trends for past dates and predicting the future based on past data. This can be used for early resource planning. For example, streaming platform where traffic gets increased on holidays. We can refer to the trends that are predicted by the machine learning model to plan server capacity that can fulfil request without delay. Similarly, this can be used for applications business management [4].

With the same predictions, adjustment of thresholds can be done in dynamic manner. Alerting part disused in phase three above has threshold values that were define. These thresholds are generalized as certain number of resources that should be consumed for application to be active. However, some positive or negative deviation in this consumption is sign of some discrepancy. Dynamic threshold definition will help for proactive monitoring [19].

6 Comparison of Cloud Monitoring Solutions

In comparison to existing solution, we have below mentioned highlights that make the system more effective. Here existing solution mainly refers to the on-demand services provided by cloud service providers as Amazon, Google or Microsoft. It can be also any monitoring tool as Nagios or AppDynamics. Tool Nagios uses NRPE kind of protocol that acts as beat agent to transfer data from source to destination. Nagios allows users to design and deploy customised plugin over the infrastructure. This designing of plugins is easy and ca be used. AppDynamics is has generic architecture wherein shippers are associated to transfer data. However, dynamic real time monitoring can be obtained using it. In comparison to all these tools given solution has below mentioned benefits [8, 11].

Cost Effective: To use any of the service one has to pay the cost with pay per use module. For monitoring service, the use is very high as infrastructure or application should be always under monitoring to avoid outages. Above given solution is totally open source and no cost is required and hence monitoring is free. However minimal storage cost is required if data storage to be done.

Customized Monitoring: One can monitor application logs and system logs. Grok pattern can be applied on data to find error, warnings or any of the desired string. One can parse and find various error codes. Application end points can be polled at certain interval to check if they are up. This all leads to proactive monitoring and zero downtime.

Easy to Setup and Maintain: Installation of agents over the servers is simple job with automation tools as ansible or chef. Single server with playbook can install agents on multiple servers within less time. Similarly, maintenance and upgrade become one step activity.

Scalable Solution: This solution can be implemented on any of the cloud infrastructure. Scaling of application is quite easy as one has to replicated the parser among various load balancers if required. Also, storage can be scaled if traffic is high. However, with all the changes implementation logic remains same.

Static/Dynamic Proactive Alerting: As mentioned earlier, alerting works on static thresholds as well as it works with dynamic threshold. Alerting being proactive i.e., alerting user before issue occurs. This will provide minimum or no outage over the application business [1, 4].

7 Conclusion

Client machines are the small nodes that will be idle but treated as application servers. These machines will have shippers that are placed over. These shippers will be lightweight and can be installed manually or in automated way as well. Small programs will be running all the time to monitor the instance. Solution being open source will be easy to manage and maintain in cloud infrastructure. Now we have multiple client that will reside in the cloud with that we will have centralized server that will be in same environment. Metrics will be fetched from clients and can be stored on server. Analysis can be done on saved data. Data can be visualized into gauge, graphs, line charts, bar charts. This dashboard can be plotted as and when required. Moreover, alerting, and proactive alerting can be set up based on data requirements.

This alerting feature will help the user to get alerts when something is going wrong or few of the features in application are not working as expected. This will help end user to identify and fix the failure before system gets crashed. One can analyses entire cluster in once screen. This solution is scalable and can be replicated over any cluster. This solution being open source can be replicated. It can be placed on two nodes clusters to two thousand with infra design that is capable to handle the load.