Hierarchical Monitoring Services for Efficient Distributed System Management
An essential part of an SLA-aware infrastructure is a scalable and self-sufficient monitoring system capable of monitoring large distributed systems, in real-time. The monitoring system must support two mutually exclusive perspectives arising from the Service Level Agreement, namely the customer’s perspective and the infrastructure/service provider’s perspective. The former is interested in the SLA alone, while the latter needs to be able to optimize the utilization of the infrastructure. To help process and manage the volume and variety of monitoring data, a multi-layer monitoring architecture has been proposed by Infrastructure Management.
Multi-Layer Monitoring Architecture
The distributed multi-layer monitoring architecture may be comprised of as many layers as necessary to support the monitoring of the underlying infrastructure. However, these layers have been divided into three logical layers, according to their primary purpose, amount of input and output events, and degree of processing. The lowest layer of the hierarchy, the data collection layer (L0), is mainly used for the collection of raw input data. Basic filtering and pre-processing of collected information can also be applied at L0 to reduce network traffic. However, processing on Layer 0 should be kept to a minimum to limit the monitoring resource usage. The second logical layer is the event evaluation layer (L1) that supports the integration of monitors into a cascade of increasingly more complex monitors, ranging from simple metric checks to composed monitors. Composed monitors re-use other monitoring agents to process complex rules, e.g. monitoring of an entire cluster, taking the relationship between nodes in a cluster into account. The top-most layer, named the service layer (L2), configures as well as defines the meaning of monitoring events generated in lower layers of the architecture. The architecture prevents top-level monitors from connecting to data collection layer and bypassing the event evaluation layer. L2 layer is a collection of conceptually similar functions that provide services used by any service dealing with infrastructure and receives inputs from layers below it.

The following subsections describe the logical layers of the architecture in further detail.
Data Collection Layer (L0)
Layer 0 represents the data-collecting monitoring agents producing low-level and (mostly) unprocessed events. These agents are wrappers around specialized data collectors, like Ganglia 4 or Munin 4 at the infrastructure layer. They can even use probes into Virtual Machine Monitor (xentop), /proc, kernel, or middleware components (web server, application server, database). Data collection monitors support three types of operation, timed push, conditional push and pull. Timed push publishes configured metrics in uniformly timed intervals regardless of the value change since the last published value. Agents support an arbitrary number of timed pushes, i.e. different metrics can be requested at different intervals. Conditional push provides a basic mechanism for the reduction of network load. Metric values are pushed on the channel only when the last published value is exceeded by specified threshold value or percentage. Agents from higher layers can also opt to query metrics from data collectors only when needed in their own calculations. For this purpose, L0 agents also support metrics to be pulled on request. Each agent can be configured to work in one or more of these modes simultaneously.
Evaluation Layer (L1)
The event evaluation layer is a dynamic network of distributed agents. A dedicated infrastructure node may be used to deploy these agents to reduce the overhead of nodes offered to customers/users. Agents are all subscribed to a single configuration channel to which monitoring requests are published by service layer monitors. Monitoring requests are represented as rules that L1 monitors are expected to validate. Every L1 monitor can verify whether it supports validation of requested rules and whether it has sufficient resources to accept additional monitoring. Agents willing to start monitoring solicited rules notify the requester that, based on these responses, decides which monitor to send configuration to. It is also possible to select several monitors for the same validation rule.
Service Layer (L2)
Every request for monitoring has to enter through one or more service layer (L2) monitors that form the boundary of the entire monitoring architecture. These monitors’ activities are composed according to the users’ requests. Each L2 monitor implies a certain configurations of the L1 monitors, which are managed dynamically. Users of L2 monitors may be infrastructure providers, service providers or even service customers requiring immediate notification that the agreement was breached. Events from L1 monitors are used as triggers for execution of required actions. Examples of service layer tasks are:
- Auditing task – logging and monitoring customer’s usage of resources.
- Accounting task – producing information used for billing.
- Autonomous management – providing self re-organization of the infrastructure in order to improve the utilization without breaking any SLA.
- Notification task – may be used to notify various stakeholders (service provider, customer) of a broken rule.
- Historical information repository task – is used to store various monitoring information, ranging from raw data about infrastructure and/or services to events raised by different monitors.
Tags: Architesture, Data collection layer, Evaluation layer, Infrastructure monitoring, Multi-layer monitoring, Service layer, Service Level Agreement, sla

July 11th, 2009 at 01:22
[...] @slasoi New blog post: Hierarchical Monitoring Services for Efficient Distributed System Management http://sla-at-soi.eu/?p=523 in reply to slasoi [...]