From 32c20132d014c9735483e1475e0c9864f8290e29 Mon Sep 17 00:00:00 2001 From: Michael Friedrich Date: Sun, 24 Aug 2014 11:21:54 +0200 Subject: [PATCH] Documentation: Rewrite cluster docs * Re-organize structure * New section with HA features * Permissions and security * How to add a new node * Cluster requirements * Additional hints on installation * More troubleshooting fixes #6743 fixes #6703 fixes #6997 --- doc/4-monitoring-remote-systems.md | 409 +++++++++++++++++++---------- doc/7-troubleshooting.md | 8 + etc/icinga2/constants.conf.cmake | 2 +- 3 files changed, 278 insertions(+), 141 deletions(-) diff --git a/doc/4-monitoring-remote-systems.md b/doc/4-monitoring-remote-systems.md index 366585077..089e4dabe 100644 --- a/doc/4-monitoring-remote-systems.md +++ b/doc/4-monitoring-remote-systems.md @@ -144,48 +144,125 @@ passing the check results to Icinga 2. remote sender to push check results into the Icinga 2 `ExternalCommandListener` feature. -## Distributed Monitoring and High Availability +> **Note** +> +> This addon works in a similar fashion like the Icinga 1.x distributed model. If you +> are looking for a real distributed architecture with Icinga 2, scroll down. + -An Icinga 2 cluster consists of two or more nodes and can reside on multiple -architectures. The base concept of Icinga 2 is the possibility to add additional -features using components. In case of a cluster setup you have to add the api feature -to all nodes. +## Distributed Monitoring and High Availability -An Icinga 2 cluster can be used for the following scenarios: +Building distributed environments with high availability included is fairly easy with Icinga 2. +The cluster feature is built-in and allows you to build many scenarios based on your requirements: * [High Availability](#cluster-scenarios-high-availability). All instances in the `Zone` elect one active master and run as Active/Active cluster. * [Distributed Zones](#cluster-scenarios-distributed-zones). A master zone and one or more satellites in their zones. * [Load Distribution](#cluster-scenarios-load-distribution). A configuration master and multiple checker satellites. +You can combine these scenarios into a global setup fitting your requirements. + +Each instance got their own event scheduler, and does not depend on a centralized master +coordinating and distributing the events. In case of a cluster failure, all nodes +continue to run independently. Be alarmed when your cluster fails and a Split-Brain-scenario +is in effect - all alive instances continue to do their job, and history will begin to differ. + +> ** Note ** +> +> Before you start, make sure to read the [requirements](#distributed-monitoring-requirements). + + +### Cluster Requirements + +Before you start deploying, keep the following things in mind: + +* Your [SSL CA and certificates](#certificate-authority-certificates) are mandatory for secure communication +* Get pen and paper or a drawing board and design your nodes and zones! +** all nodes in a cluster zone are providing high availability functionality and trust each other +** cluster zones can be built in a Top-Down-design where the child trusts the parent +** communication between zones happens bi-directional which means that a DMZ-located node can still reach the master node, or vice versa +* Update firewall rules and ACLs +* Decide whether to use the built-in [configuration syncronization](#cluster-zone-config-sync) or use an external tool (Puppet, Ansible, Chef, Salt, etc) to manage the configuration deployment + + > **Tip** > > If you're looking for troubleshooting cluster problems, check the general > [troubleshooting](#troubleshooting-cluster) section. -Before you start configuring the diffent nodes it is necessary to setup the underlying -communication layer based on SSL. +#### Cluster Naming Convention + +The SSL certificate common name (CN) will be used by the [ApiListener](#objecttype-apilistener) +object to determine the local authority. This name must match the local [Endpoint](#objecttype-endpoint) +object name. + +Example: + + # icinga2-build-key icinga2a + ... + Common Name (e.g. server FQDN or YOUR name) [icinga2a]: + + # vim cluster.conf + + object Endpoint "icinga2a" { + host = "icinga2a.icinga.org" + } + +The [Endpoint](#objecttype-endpoint) name is further referenced as `endpoints` attribute on the +[Zone](objecttype-zone) object. + + object Endpoint "icinga2b" { + host = "icinga2b.icinga.org" + } + + object Zone "config-ha-master" { + endpoints = [ "icinga2a", "icinga2b" ] + } + +Specifying the local node name using the [NodeName](#configure-nodename) variable requires +the same name as used for the endpoint name and common name above. If not set, the FQDN is used. + + const NodeName = "icinga2a" + ### Certificate Authority and Certificates Icinga 2 ships two scripts assisting with CA and node certificate creation for your Icinga 2 cluster. -The first step is the creation of CA running the following command: - - # icinga2-build-ca +> **Note** +> +> You're free to use your own method to generated a valid ca and signed client +> certificates. Please make sure to export the environment variable `ICINGA_CA` pointing to an empty folder for the newly created CA files: # export ICINGA_CA="/root/icinga-ca" +The scripts will put all generated data and the required certificates in there. + +The first step is the creation of the certificate authority (CA) running the +following command: + + # icinga2-build-ca + Now create a certificate and key file for each node running the following command (replace `icinga2a` with the required hostname): # icinga2-build-key icinga2a -Repeat the step for all nodes in your cluster scenario. Save the CA key in case -you want to set up certificates for additional nodes at a later time. +Repeat the step for all nodes in your cluster scenario. + +Save the CA key in a secure location in case you want to set up certificates for +additional nodes at a later time. + +Navigate to the location of your newly generated certificate files, and manually +copy/transfer them to `/etc/icinga2/pki` in your Icinga 2 configuration folder. + +> **Note** +> +> The certificate files must be readable by the user Icinga 2 is running as. Also, +> the private key file must not be world-readable. Each node requires the following files in `/etc/icinga2/pki` (replace `fqdn-nodename` with the host's FQDN): @@ -195,58 +272,49 @@ the host's FQDN): * <fqdn-nodename>.key +### Cluster Configuration + +The following section describe which configuration must be updated/created +in order to get your cluster running with basic functionality. + +* [configure the node name](#configure-nodename) +* [configure the ApiListener object](#configure-apilistener-object) +* [configure cluster endpoints](#configure-cluster-endpoints) +* [configure cluster zones](#configure-cluster-zones) + +Once you're finished with the basic setup the following section will +describe how to use [zone configuration synchronisation](#cluster-zone-config-sync) +and configure [cluster scenarios](#cluster-scenarios). -### Configure the Icinga Node Name +#### Configure the Icinga Node Name Instead of using the default FQDN as node name you can optionally set that value using the [NodeName](#global-constants) constant. + +> ** Note ** +> +> Skip this step if your FQDN already matches the default `NodeName` set +> in `/etc/icinga2/constants.conf`. + This setting must be unique for each node, and must also match the name of the local [Endpoint](#objecttype-endpoint) object and the -SSL certificate common name. +SSL certificate common name as described in the +[cluster naming convention](#cluster-naming-convention). + + vim /etc/icinga2/constants.conf + /* Our local instance name. By default this is the server's hostname as returned by `hostname --fqdn`. + * This should be the common name from the API certificate. + */ const NodeName = "icinga2a" + Read further about additional [naming conventions](#cluster-naming-convention). Not specifying the node name will make Icinga 2 using the FQDN. Make sure that all configured endpoint names and common names are in sync. -### Cluster Naming Convention - -The SSL certificate common name (CN) will be used by the [ApiListener](#objecttype-apilistener) -object to determine the local authority. This name must match the local [Endpoint](#objecttype-endpoint) -object name. - -Example: - - # icinga2-build-key icinga2a - ... - Common Name (e.g. server FQDN or YOUR name) [icinga2a]: - - # vim cluster.conf - - object Endpoint "icinga2a" { - host = "icinga2a.icinga.org" - } - -The [Endpoint](#objecttype-endpoint) name is further referenced as `endpoints` attribute on the -[Zone](objecttype-zone) object. - - object Endpoint "icinga2b" { - host = "icinga2b.icinga.org" - } - - object Zone "config-ha-master" { - endpoints = [ "icinga2a", "icinga2b" ] - } - -Specifying the local node name using the [NodeName](#global-constants) variable requires -the same name as used for the endpoint name and common name above. If not set, the FQDN is used. - - const NodeName = "icinga2a" - - -### Configure the ApiListener Object +#### Configure the ApiListener Object The [ApiListener](#objecttype-apilistener) object needs to be configured on every node in the cluster with the following settings: @@ -272,8 +340,7 @@ synchronisation enabled for this node. > The certificate files must be readable by the user Icinga 2 is running as. Also, > the private key file must not be world-readable. - -### Configure Cluster Endpoints +#### Configure Cluster Endpoints `Endpoint` objects specify the `host` and `port` settings for the cluster nodes. This configuration can be the same on all nodes in the cluster only containing @@ -292,8 +359,7 @@ A sample configuration looks like: If this endpoint object is reachable on a different port, you must configure the `ApiListener` on the local `Endpoint` object accordingly too. - -### Configure Cluster Zones +#### Configure Cluster Zones `Zone` objects specify the endpoints located in a zone. That way your distributed setup can be seen as zones connected together instead of multiple instances in that specific zone. @@ -324,7 +390,7 @@ the defined parent zone `config-ha-master`. } -#### Zone Configuration Synchronisation +### Zone Configuration Synchronisation By default all objects for specific zones should be organized in @@ -376,12 +442,19 @@ process. > determines the required include directory. This can be overridden using the > [global constant](#global-constants) `ZonesDir`. -#### Global Configuration Zone +#### Global Configuration Zone for Templates If your zone configuration setup shares the same templates, groups, commands, timeperiods, etc. you would have to duplicate quite a lot of configuration objects making the merged configuration on your configuration master unique. +> ** Note ** +> +> Only put templates, groups, etc into this zone. DO NOT add checkable objects such as +> hosts or services here. If they are checked by all instances globally, this will lead +> into duplicated check results and unclear state history. Not easy to troubleshoot too - +> you've been warned. + That is not necessary by defining a global zone shipping all those templates. By setting `global = true` you ensure that this zone serving common configuration templates will be synchronized to all involved nodes (only if they accept configuration though). @@ -406,11 +479,11 @@ your zone configuration visible to all nodes. > **Note** > > If the remote node does not have this zone configured, it will ignore the configuration -> update, if it accepts configuration. +> update, if it accepts synchronized configuration. If you don't require any global configuration, skip this setting. -#### Zone Configuration Permissions +#### Zone Configuration Synchronisation Permissions Each [ApiListener](#objecttype-apilistener) object must have the `accept_config` attribute set to `true` to receive configuration from the parent `Zone` members. Default value is `false`. @@ -422,15 +495,13 @@ set to `true` to receive configuration from the parent `Zone` members. Default v accept_config = true } -### Initial Cluster Sync +If `accept_config` is set to `false`, this instance won't accept configuration from remote +master instances anymore. -In order to make sure that all of your cluster nodes have the same state you will -have to pick one of the nodes as your initial "master" and copy its state file -to all the other nodes. - -You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying -the state file you should make sure that all your cluster nodes are properly shut -down. +> ** Tip ** +> +> Look into the [troubleshooting guides](#troubleshooting-cluster-config-sync) for debugging +> problems with the configuration synchronisation. ### Cluster Health Check @@ -441,12 +512,12 @@ one or more configured nodes are not connected. Example: - apply Service "cluster" { + object Service "cluster" { check_command = "cluster" check_interval = 5s retry_interval = 1s - assign where host.name == "icinga2a" + host_name = "icinga2a" } Each cluster node should execute its own local cluster health check to @@ -458,78 +529,33 @@ connected zones. Example for the `checker` zone checking the connection to the `master` zone: - apply Service "cluster-zone-master" { + object Service "cluster-zone-master" { check_command = "cluster-zone" check_interval = 5s retry_interval = 1s vars.cluster_zone = "master" - assign where host.name == "icinga2b" + host_name = "icinga2b" } -### Host With Multiple Cluster Nodes - -Special scenarios might require multiple cluster nodes running on a single host. -By default Icinga 2 and its features will place their runtime data below the prefix -`LocalStateDir`. By default packages will set that path to `/var`. -You can either set that variable as constant configuration -definition in [icinga2.conf](#icinga2-conf) or pass it as runtime variable to -the Icinga 2 daemon. - - # icinga2 -c /etc/icinga2/node1/icinga2.conf -DLocalStateDir=/opt/node1/var - -### High Availability with DB IDO - -All instances within the same zone (e.g. the `master` zone as HA cluster) must -have the DB IDO feature enabled. - -Example DB IDO MySQL: - - # icinga2-enable-feature ido-mysql - The feature 'ido-mysql' is already enabled. - -By default the DB IDO feature only runs on the elected zone master. All other nodes -disable the active IDO database connection at runtime. - -> **Note** -> -> The DB IDO HA feature can be disabled by setting the `enable_ha` attribute to `false` -> for the [IdoMysqlConnection](#objecttype-idomysqlconnection) or -> [IdoPgsqlConnection](#objecttype-idopgsqlconnection) object on all nodes in the -> same zone. -> -> All endpoints will enable the DB IDO feature then, connect to the configured -> database and dump configuration, status and historical data on their own. - -If the instance with the active DB IDO connection dies, the HA functionality will -re-enable the DB IDO connection on the newly elected zone master. - -The DB IDO feature will try to determine which cluster endpoint is currently writing -to the database and bail out if another endpoint is active. You can manually verify that -by running the following query: - - icinga=> SELECT status_update_time, endpoint_name FROM icinga_programstatus; - status_update_time | endpoint_name - ------------------------+--------------- - 2014-08-15 15:52:26+02 | icinga2a - (1 Zeile) - -This is useful when the cluster connection between endpoints breaks, and prevents -data duplication in split-brain-scenarios. The failover timeout can be set for the -`failover_timeout` attribute, but not lower than 60 seconds. - - ### Cluster Scenarios All cluster nodes are full-featured Icinga 2 instances. You only need to enabled the features for their role (for example, a `Checker` node only requires the `checker` feature enabled, but not `notification` or `ido-mysql` features). -Each instance got their own event scheduler, and does not depend on a centralized master -coordinating and distributing the events. In case of a cluster failure, all nodes -continue to run independently. Be alarmed when your cluster fails and a Split-Brain-scenario -is in effect - all alive instances continue to do their job, and history will begin to differ. +#### Security in Cluster Scenarios + +While there are certain capabilities to ensure the safe communication between all +nodes (firewalls, policies, software hardening, etc) the Icinga 2 cluster also provides +additional security itself: + +* [SSL certificates](#certificate-authority-certificates) are mandatory for cluster communication. +* Child zones only receive event updates (check results, commands, etc) for their configured updates. +* Zones cannot influence/interfere other zones. Each checked object is assigned to only one zone. +* All nodes in a zone trust each other. +* [Configuration sync](#zone-config-sync-permissions) is disabled by default. #### Features in Cluster Zones @@ -539,11 +565,13 @@ Even further all commands are distributed amongst connected nodes. For example, re-schedule a check or acknowledge a problem on the master, and it gets replicated to the actual slave checker node. -DB IDO on the left, graphite on the right side - works. +DB IDO on the left, graphite on the right side - works (if you disable +[DB IDO HA](#high-availability-db-ido)). Icinga Web 2 on the left, checker and notifications on the right side - works too. -Everything on the left and on the right side - make sure to deal with duplicated notifications -and automated check distribution. - +Everything on the left and on the right side - make sure to deal with +[load-balanced notifications and checks](#high-availability-features) in a +[HA zone](#cluster-scenarios-high-availability). +configure-cluster-zones #### Distributed Zones That scenario fits if your instances are spread over the globe and they all report @@ -612,7 +640,6 @@ The zones would look like: The `nuremberg-master` zone will only execute local checks, and receive check results from the satellite nodes in the zones `berlin` and `vienna`. - #### Load Distribution If you are planning to off-load the checks to a defined set of remote workers @@ -663,17 +690,13 @@ Zones: global = true } - -#### High Availability +#### Cluster High Availability High availability with Icinga 2 is possible by putting multiple nodes into -a dedicated `Zone`. All nodes will elect their active master, and retry an +a dedicated `Zone`. All nodes will elect one active master, and retry an election once the current active master failed. -Selected features (such as [DB IDO](#high-availability-db-ido)) will only be -active on the current active master. -All other passive nodes will pause the features without reload/restart. - +Selected features provide advanced [HA functionality](#high-availability-features). Checks and notifications are load-balanced between nodes in the high availability zone. @@ -693,7 +716,6 @@ Two or more nodes in a high availability setup require an [initial cluster sync] > configuration files in the `zones.d` directory. All other nodes must not > have that directory populated. Detail in the [Configuration Sync Chapter](#cluster-zone-config-sync). - #### Multiple Hierachies Your master zone collects all check results for reporting and graphing and also @@ -717,3 +739,110 @@ department instances. Furthermore the master NOC is able to see what's going on. The instances in the departments will serve a local interface, and allow the administrators to reschedule checks or acknowledge problems for their services. + + +### High Availability for Icinga 2 features + +All nodes in the same zone require the same features enabled for High Availability (HA) +amongst them. + +By default the following features provide advanced HA functionality: + +* [Checks](#high-availability-checks) (load balanced, automated failover) +* [Notifications](#high-availability-notifications) (load balanced, automated failover) +* DB IDO (Run-Once, automated failover) + +#### High Availability with Checks + +All nodes in the same zone automatically load-balance the check execution. When one instance +fails the other nodes will automatically take over the reamining checks. + +> **Note** +> +> If a node should not check anything, disable the `checker` feature explicitely and +> reload Icinga 2. + + # icinga2-disable-feature checker + # service icinga2 reload + +#### High Availability with Notifications + +Notifications are load balanced amongst all nodes in a zone. By default this functionality +is enabled. +If your nodes should notify independent from any other nodes (this will cause +duplicated notifications if not properly handled!), you can set `enable_ha = false` +in the [NotificationComponent](#objecttype-notificationcomponent) feature. + +#### High Availability with DB IDO + +All instances within the same zone (e.g. the `master` zone as HA cluster) must +have the DB IDO feature enabled. + +Example DB IDO MySQL: + + # icinga2-enable-feature ido-mysql + The feature 'ido-mysql' is already enabled. + +By default the DB IDO feature only runs on the elected zone master. All other passive +nodes disable the active IDO database connection at runtime. + +> **Note** +> +> The DB IDO HA feature can be disabled by setting the `enable_ha` attribute to `false` +> for the [IdoMysqlConnection](#objecttype-idomysqlconnection) or +> [IdoPgsqlConnection](#objecttype-idopgsqlconnection) object on all nodes in the +> same zone. +> +> All endpoints will enable the DB IDO feature then, connect to the configured +> database and dump configuration, status and historical data on their own. + +If the instance with the active DB IDO connection dies, the HA functionality will +re-enable the DB IDO connection on the newly elected zone master. + +The DB IDO feature will try to determine which cluster endpoint is currently writing +to the database and bail out if another endpoint is active. You can manually verify that +by running the following query: + + icinga=> SELECT status_update_time, endpoint_name FROM icinga_programstatus; + status_update_time | endpoint_name + ------------------------+--------------- + 2014-08-15 15:52:26+02 | icinga2a + (1 Zeile) + +This is useful when the cluster connection between endpoints breaks, and prevents +data duplication in split-brain-scenarios. The failover timeout can be set for the +`failover_timeout` attribute, but not lower than 60 seconds. + + +### Add a new cluster endpoint + +These steps are required for integrating a new cluster endpoint: + +* generate a new [SSL client certificate](#certificate-authority-certificates) +* identify its location in the zones +* update the `zones.conf` file on each involved node ([endpoint](#configure-cluster-endpoints), [zones](#configure-cluster-zones)) +** a new slave zone node requires updates for the master and slave zones +* if the node requires the existing zone history: [initial cluster sync](#initial-cluster-sync) +* add a [cluster health check](#cluster-health-check) + +#### Initial Cluster Sync + +In order to make sure that all of your cluster nodes have the same state you will +have to pick one of the nodes as your initial "master" and copy its state file +to all the other nodes. + +You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying +the state file you should make sure that all your cluster nodes are properly shut +down. + + +### Host With Multiple Cluster Nodes + +Special scenarios might require multiple cluster nodes running on a single host. +By default Icinga 2 and its features will place their runtime data below the prefix +`LocalStateDir`. By default packages will set that path to `/var`. +You can either set that variable as constant configuration +definition in [icinga2.conf](#icinga2-conf) or pass it as runtime variable to +the Icinga 2 daemon. + + # icinga2 -c /etc/icinga2/node1/icinga2.conf -DLocalStateDir=/opt/node1/var diff --git a/doc/7-troubleshooting.md b/doc/7-troubleshooting.md index 4a4e81f26..60a07911d 100644 --- a/doc/7-troubleshooting.md +++ b/doc/7-troubleshooting.md @@ -164,6 +164,14 @@ they remain in a Split-Brain-mode and history may differ. Although the Icinga 2 cluster protocol stores historical events in a replay log for later synchronisation, you should make sure to check why the network connection failed. +### Cluster Troubleshooting Config Sync + +If the cluster zones do not sync their configuration, make sure to check the following: + +* Within a config master zone, only one configuration master is allowed to have its config in `/etc/icinga2/zones.d`. +** The master syncs the configuration to `/var/lib/icinga2/api/zones/` during startup and only syncs valid configuration to the other nodes +** The other nodes receive the configuration into `/var/lib/icinga2/api/zones/` +* The `icinga2.log` log file will indicate whether this ApiListener [accepts config](#zone-config-sync-permissions), or not ## Debug Icinga 2 diff --git a/etc/icinga2/constants.conf.cmake b/etc/icinga2/constants.conf.cmake index b8951929d..c3ceb7322 100644 --- a/etc/icinga2/constants.conf.cmake +++ b/etc/icinga2/constants.conf.cmake @@ -8,7 +8,7 @@ const PluginDir = "@ICINGA2_PLUGINDIR@" /* Our local instance name. By default this is the server's hostname as returned by `hostname --fqdn`. * This should be the common name from the API certificate. -*/ + */ //const NodeName = "localhost" /* Our local zone name. */ -- 2.40.0