granicus.if.org Git - icinga2/blob - doc/13-distributed-monitoring-ha.md

   1 # <a id="distributed-monitoring-high-availability"></a> Distributed Monitoring and High Availability
   2
   3 Building distributed environments with high availability included is fairly easy with Icinga 2.
   4 The cluster feature is built-in and allows you to build many scenarios based on your requirements:
   5
   6 * [High Availability](13-distributed-monitoring-ha.md#cluster-scenarios-high-availability). All instances in the `Zone` run as Active/Active cluster.
   7 * [Distributed Zones](13-distributed-monitoring-ha.md#cluster-scenarios-distributed-zones). A master zone and one or more satellites in their zones.
   8 * [Load Distribution](13-distributed-monitoring-ha.md#cluster-scenarios-load-distribution). A configuration master and multiple checker satellites.
   9
  10 You can combine these scenarios into a global setup fitting your requirements.
  11
  12 Each instance got their own event scheduler, and does not depend on a centralized master
  13 coordinating and distributing the events. In case of a cluster failure, all nodes
  14 continue to run independently. Be alarmed when your cluster fails and a Split-Brain-scenario
  15 is in effect - all alive instances continue to do their job, and history will begin to differ.
  16
  17
  18 ## <a id="cluster-requirements"></a> Cluster Requirements
  19
  20 Before you start deploying, keep the following things in mind:
  21
  22 Your [SSL CA and certificates](13-distributed-monitoring-ha.md#manual-certificate-generation) are mandatory for secure communication.
  23
  24 Communication between zones requires one of these connection directions:
  25
  26 * The parent zone nodes are able to connect to the child zone nodes (`parent => child`).
  27 * The child zone nodes are able to connect to the parent zone nodes (`parent <= child`).
  28 * Both connnection directions work.
  29
  30 Update firewall rules and ACLs.
  31
  32 * Icinga 2 master, satellite and client instances communicate using the default tcp port `5665`.
  33
  34 Get pen and paper or a drawing board and design your nodes and zones!
  35
  36 * Keep the [naming convention](13-distributed-monitoring-ha.md#cluster-naming-convention) for nodes in mind.
  37 * All nodes (endpoints) in a cluster zone provide high availability functionality and trust each other.
  38 * Cluster zones can be built in a Top-Down-design where the child trusts the parent.
  39
  40 Decide whether to use the built-in [configuration syncronization](13-distributed-monitoring-ha.md#cluster-zone-config-sync) or use an external tool (Puppet, Ansible, Chef, Salt, etc) to manage the configuration deployment.
  41
  42
  43 > **Tip**
  44 >
  45 > If you're looking for troubleshooting cluster problems, check the general
  46 > [troubleshooting](17-troubleshooting.md#troubleshooting-cluster) section.
  47
  48 ## <a id="manual-certificate-generation"></a> Manual SSL Certificate Generation
  49
  50 Icinga 2 provides [CLI commands](8-cli-commands.md#cli-command-pki) assisting with CA
  51 and node certificate creation for your Icinga 2 distributed setup.
  52
  53 > **Tip**
  54 >
  55 > You can also use the master and client setup wizards to install the cluster nodes
  56 > using CSR-Autosigning.
  57 >
  58 > The manual steps are helpful if you want to use your own and/or existing CA (for example
  59 > Puppet CA).
  60
  61 > **Note**
  62 >
  63 > You're free to use your own method to generated a valid ca and signed client
  64 > certificates.
  65
  66 The first step is the creation of the certificate authority (CA) by running the
  67 following command:
  68
  69     # icinga2 pki new-ca
  70
  71 Now create a certificate and key file for each node running the following command
  72 (replace `icinga2a` with the required hostname):
  73
  74     # icinga2 pki new-cert --cn icinga2a --key icinga2a.key --csr icinga2a.csr
  75     # icinga2 pki sign-csr --csr icinga2a.csr --cert icinga2a.crt
  76
  77 Repeat the step for all nodes in your cluster scenario.
  78
  79 Save the CA key in a secure location in case you want to set up certificates for
  80 additional nodes at a later time.
  81
  82 Navigate to the location of your newly generated certificate files, and manually
  83 copy/transfer them to `/etc/icinga2/pki` in your Icinga 2 configuration folder.
  84
  85 > **Note**
  86 >
  87 > The certificate files must be readable by the user Icinga 2 is running as. Also,
  88 > the private key file must not be world-readable.
  89
  90 Each node requires the following files in `/etc/icinga2/pki` (replace `fqdn-nodename` with
  91 the host's FQDN):
  92
  93 * ca.crt
  94 * &lt;fqdn-nodename&gt;.crt
  95 * &lt;fqdn-nodename&gt;.key
  96
  97 If you're planning to use your existing CA and certificates please note that you *must not*
  98 use wildcard certificates. The common name (CN) is mandatory for the cluster communication and
  99 therefore must be unique for each connecting instance.
 100
 101 ## <a id="cluster-naming-convention"></a> Cluster Naming Convention
 102
 103 The SSL certificate common name (CN) will be used by the [ApiListener](6-object-types.md#objecttype-apilistener)
 104 object to determine the local authority. This name must match the local [Endpoint](6-object-types.md#objecttype-endpoint)
 105 object name.
 106
 107 Certificate generation for host with the FQDN `icinga2a`:
 108
 109     # icinga2 pki new-cert --cn icinga2a --key icinga2a.key --csr icinga2a.csr
 110     # icinga2 pki sign-csr --csr icinga2a.csr --cert icinga2a.crt
 111
 112 Add a new `Endpoint` object named `icinga2a`:
 113
 114     # vim zones.conf
 115
 116     object Endpoint "icinga2a" {
 117       host = "icinga2a.icinga.org"
 118     }
 119
 120 The [Endpoint](6-object-types.md#objecttype-endpoint) name is further referenced as `endpoints` attribute on the
 121 [Zone](6-object-types.md#objecttype-zone) object.
 122
 123     object Endpoint "icinga2b" {
 124       host = "icinga2b.icinga.org"
 125     }
 126
 127     object Zone "config-ha-master" {
 128       endpoints = [ "icinga2a", "icinga2b" ]
 129     }
 130
 131 Specifying the local node name using the [NodeName](13-distributed-monitoring-ha.md#configure-nodename) variable requires
 132 the same name as used for the endpoint name and common name above. If not set, the FQDN is used.
 133
 134     const NodeName = "icinga2a"
 135
 136 If you're using the host's FQDN everywhere, you're on the safe side. The setup wizards
 137 will do the very same.
 138
 139 ## <a id="cluster-configuration"></a> Cluster Configuration
 140
 141 The following section describe which configuration must be updated/created
 142 in order to get your cluster running with basic functionality.
 143
 144 * [configure the node name](13-distributed-monitoring-ha.md#configure-nodename)
 145 * [configure the ApiListener object](13-distributed-monitoring-ha.md#configure-apilistener-object)
 146 * [configure cluster endpoints](13-distributed-monitoring-ha.md#configure-cluster-endpoints)
 147 * [configure cluster zones](13-distributed-monitoring-ha.md#configure-cluster-zones)
 148
 149 Once you're finished with the basic setup the following section will
 150 describe how to use [zone configuration synchronisation](13-distributed-monitoring-ha.md#cluster-zone-config-sync)
 151 and configure [cluster scenarios](13-distributed-monitoring-ha.md#cluster-scenarios).
 152
 153 ### <a id="configure-nodename"></a> Configure the Icinga Node Name
 154
 155 Instead of using the default FQDN as node name you can optionally set
 156 that value using the [NodeName](20-language-reference.md#constants) constant.
 157
 158 > ** Note **
 159 >
 160 > Skip this step if your FQDN already matches the default `NodeName` set
 161 > in `/etc/icinga2/constants.conf`.
 162
 163 This setting must be unique for each node, and must also match
 164 the name of the local [Endpoint](6-object-types.md#objecttype-endpoint) object and the
 165 SSL certificate common name as described in the
 166 [cluster naming convention](13-distributed-monitoring-ha.md#cluster-naming-convention).
 167
 168     vim /etc/icinga2/constants.conf
 169
 170     /* Our local instance name. By default this is the server's hostname as returned by `hostname --fqdn`.
 171      * This should be the common name from the API certificate.
 172      */
 173     const NodeName = "icinga2a"
 174
 175
 176 Read further about additional [naming conventions](13-distributed-monitoring-ha.md#cluster-naming-convention).
 177
 178 Not specifying the node name will make Icinga 2 using the FQDN. Make sure that all
 179 configured endpoint names and common names are in sync.
 180
 181 ### <a id="configure-apilistener-object"></a> Configure the ApiListener Object
 182
 183 The [ApiListener](6-object-types.md#objecttype-apilistener) object needs to be configured on
 184 every node in the cluster with the following settings:
 185
 186 A sample config looks like:
 187
 188     object ApiListener "api" {
 189       cert_path = SysconfDir + "/icinga2/pki/" + NodeName + ".crt"
 190       key_path = SysconfDir + "/icinga2/pki/" + NodeName + ".key"
 191       ca_path = SysconfDir + "/icinga2/pki/ca.crt"
 192       accept_config = true
 193       accept_commands = true
 194     }
 195
 196 You can simply enable the `api` feature using
 197
 198     # icinga2 feature enable api
 199
 200 Edit `/etc/icinga2/features-enabled/api.conf` if you require the configuration
 201 synchronisation enabled for this node. Set the `accept_config` attribute to `true`.
 202
 203 If you want to use this node as [remote client for command execution](11-icinga2-client.md#icinga2-client-configuration-command-bridge)
 204 set the `accept_commands` attribute to `true`.
 205
 206 > **Note**
 207 >
 208 > The certificate files must be readable by the user Icinga 2 is running as. Also,
 209 > the private key file must not be world-readable.
 210
 211 ### <a id="configure-cluster-endpoints"></a> Configure Cluster Endpoints
 212
 213 `Endpoint` objects specify the `host` and `port` settings for the cluster node
 214 connection information.
 215 This configuration can be the same on all nodes in the cluster only containing
 216 connection information.
 217
 218 A sample configuration looks like:
 219
 220     /**
 221      * Configure config master endpoint
 222      */
 223
 224     object Endpoint "icinga2a" {
 225       host = "icinga2a.icinga.org"
 226     }
 227
 228 If this endpoint object is reachable on a different port, you must configure the
 229 `ApiListener` on the local `Endpoint` object accordingly too.
 230
 231 If you don't want the local instance to connect to the remote instance, remove the
 232 `host` attribute locally. Keep in mind that the configuration is now different amongst
 233 all instances and point-of-view dependant.
 234
 235 ### <a id="configure-cluster-zones"></a> Configure Cluster Zones
 236
 237 `Zone` objects specify the endpoints located in a zone. That way your distributed setup can be
 238 seen as zones connected together instead of multiple instances in that specific zone.
 239
 240 Zones can be used for [high availability](13-distributed-monitoring-ha.md#cluster-scenarios-high-availability),
 241 [distributed setups](13-distributed-monitoring-ha.md#cluster-scenarios-distributed-zones) and
 242 [load distribution](13-distributed-monitoring-ha.md#cluster-scenarios-load-distribution).
 243 Furthermore zones are used for the [Icinga 2 remote client](11-icinga2-client.md#icinga2-client).
 244
 245 Each Icinga 2 `Endpoint` must be put into its respective `Zone`. In this example, you will
 246 define the zone `config-ha-master` where the `icinga2a` and `icinga2b` endpoints
 247 are located. The `check-satellite` zone consists of `icinga2c` only, but more nodes could
 248 be added.
 249
 250 The `config-ha-master` zone acts as High-Availability setup - the Icinga 2 instances elect
 251 one instance running a check, notification or feature (DB IDO), for example `icinga2a`. In case of
 252 failure of the `icinga2a` instance, `icinga2b` will take over automatically.
 253
 254     object Zone "config-ha-master" {
 255       endpoints = [ "icinga2a", "icinga2b" ]
 256     }
 257
 258 The `check-satellite` zone is a separated location and only sends back their checkresults to
 259 the defined parent zone `config-ha-master`.
 260
 261     object Zone "check-satellite" {
 262       endpoints = [ "icinga2c" ]
 263       parent = "config-ha-master"
 264     }
 265
 266
 267 ## <a id="cluster-zone-config-sync"></a> Zone Configuration Synchronisation
 268
 269 By default all objects for specific zones should be organized in
 270
 271     /etc/icinga2/zones.d/<zonename>
 272
 273 on the configuration master.
 274
 275 Your child zones and endpoint members **must not** have their config copied to `zones.d`.
 276 The built-in configuration synchronisation takes care of that if your nodes accept
 277 configuration from the parent zone. You can define that in the
 278 [ApiListener](13-distributed-monitoring-ha.md#configure-apilistener-object) object by configuring the `accept_config`
 279 attribute accordingly.
 280
 281 You should remove the sample config included in `conf.d` by commenting the `recursive_include`
 282 statement in [icinga2.conf](4-configuring-icinga-2.md#icinga2-conf):
 283
 284     //include_recursive "conf.d"
 285
 286 This applies to any other non-used configuration directories as well (e.g. `repository.d`
 287 if not used).
 288
 289 Better use a dedicated directory name for local configuration like `local` or similar, and
 290 include that one if your nodes require local configuration not being synced to other nodes. That's
 291 useful for local [health checks](13-distributed-monitoring-ha.md#cluster-health-check) for example.
 292
 293 > **Note**
 294 >
 295 > In a [high availability](13-distributed-monitoring-ha.md#cluster-scenarios-high-availability)
 296 > setup only one assigned node can act as configuration master. All other zone
 297 > member nodes **must not** have the `/etc/icinga2/zones.d` directory populated.
 298
 299
 300 These zone packages are then distributed to all nodes in the same zone, and
 301 to their respective target zone instances.
 302
 303 Each configured zone must exist with the same directory name. The parent zone
 304 syncs the configuration to the child zones, if allowed using the `accept_config`
 305 attribute of the [ApiListener](13-distributed-monitoring-ha.md#configure-apilistener-object) object.
 306
 307 Config on node `icinga2a`:
 308
 309     object Zone "master" {
 310       endpoints = [ "icinga2a" ]
 311     }
 312
 313     object Zone "checker" {
 314       endpoints = [ "icinga2b" ]
 315       parent = "master"
 316     }
 317
 318     /etc/icinga2/zones.d
 319       master
 320         health.conf
 321       checker
 322         health.conf
 323         demo.conf
 324
 325 Config on node `icinga2b`:
 326
 327     object Zone "master" {
 328       endpoints = [ "icinga2a" ]
 329     }
 330
 331     object Zone "checker" {
 332       endpoints = [ "icinga2b" ]
 333       parent = "master"
 334     }
 335
 336     /etc/icinga2/zones.d
 337       EMPTY_IF_CONFIG_SYNC_ENABLED
 338
 339 If the local configuration is newer than the received update Icinga 2 will skip the synchronisation
 340 process.
 341
 342 > **Note**
 343 >
 344 > `zones.d` must not be included in [icinga2.conf](4-configuring-icinga-2.md#icinga2-conf). Icinga 2 automatically
 345 > determines the required include directory. This can be overridden using the
 346 > [global constant](20-language-reference.md#constants) `ZonesDir`.
 347
 348 ### <a id="zone-global-config-templates"></a> Global Configuration Zone for Templates
 349
 350 If your zone configuration setup shares the same templates, groups, commands, timeperiods, etc.
 351 you would have to duplicate quite a lot of configuration objects making the merged configuration
 352 on your configuration master unique.
 353
 354 > ** Note **
 355 >
 356 > Only put templates, groups, etc into this zone. DO NOT add checkable objects such as
 357 > hosts or services here. If they are checked by all instances globally, this will lead
 358 > into duplicated check results and unclear state history. Not easy to troubleshoot too -
 359 > you have been warned.
 360
 361 That is not necessary by defining a global zone shipping all those templates. By setting
 362 `global = true` you ensure that this zone serving common configuration templates will be
 363 synchronized to all involved nodes (only if they accept configuration though).
 364
 365 Config on configuration master:
 366
 367     /etc/icinga2/zones.d
 368       global-templates/
 369         templates.conf
 370         groups.conf
 371       master
 372         health.conf
 373       checker
 374         health.conf
 375         demo.conf
 376
 377 In this example, the global zone is called `global-templates` and must be defined in
 378 your zone configuration visible to all nodes.
 379
 380     object Zone "global-templates" {
 381       global = true
 382     }
 383
 384 If the remote node does not have this zone configured, it will ignore the configuration
 385 update, if it accepts synchronized configuration.
 386
 387 If you do not require any global configuration, skip this setting.
 388
 389 ### <a id="zone-config-sync-permissions"></a> Zone Configuration Synchronisation Permissions
 390
 391 Each [ApiListener](6-object-types.md#objecttype-apilistener) object must have the `accept_config` attribute
 392 set to `true` to receive configuration from the parent `Zone` members. Default value is `false`.
 393
 394     object ApiListener "api" {
 395       cert_path = SysconfDir + "/icinga2/pki/" + NodeName + ".crt"
 396       key_path = SysconfDir + "/icinga2/pki/" + NodeName + ".key"
 397       ca_path = SysconfDir + "/icinga2/pki/ca.crt"
 398       accept_config = true
 399     }
 400
 401 If `accept_config` is set to `false`, this instance won't accept configuration from remote
 402 master instances anymore.
 403
 404 > ** Tip **
 405 >
 406 > Look into the [troubleshooting guides](17-troubleshooting.md#troubleshooting-cluster-config-sync) for debugging
 407 > problems with the configuration synchronisation.
 408
 409
 410 ### <a id="zone-config-sync-best-practice"></a> Zone Configuration Synchronisation Best Practice
 411
 412 The configuration synchronisation works with multiple hierarchies. The following example
 413 illustrate a quite common setup where the master is reponsible for configuration deployment:
 414
 415 * [High-Availability master zone](13-distributed-monitoring-ha.md#distributed-monitoring-high-availability)
 416 * [Distributed satellites](12-distributed-monitoring-ha.md#)
 417 * [Remote clients](11-icinga2-client.md#icinga2-client-scenarios) connected to the satellite
 418
 419 While you could use the clients with local configuration and service discovery on the satellite/master
 420 **bottom up**, the configuration sync could be more reasonable working **top-down** in a cascaded scenario.
 421
 422 Take pen and paper and draw your network scenario including the involved zone and endpoint names.
 423 Once you've added them to your zones.conf as connection and permission configuration, start over with
 424 the actual configuration organization:
 425
 426 * Ensure that `command` object definitions are globally available. That way you can use the
 427 `command_endpoint` configuration more easily on clients as [command execution bridge](11-icinga2-client.md#icinga2-client-configuration-command-bridge)
 428 * Generic `Templates`, `timeperiods`, `downtimes` should be synchronized in a global zone as well.
 429 * [Apply rules](3-monitoring-basics.md#using-apply) can be synchronized globally. Keep in mind that they are evaluated on each instance,
 430 and might require additional filters (e.g. `match("icinga2*", NodeName) or similar based on the zone information.
 431 * [Apply rules](3-monitoring-basics.md#using-apply) specified inside zone directories will only affect endpoints in the same zone or below.
 432 * Host configuration must be put into the specific zone directory.
 433 * Duplicated host and service objects (also generated by faulty apply rules) will generate a configuration error.
 434 * Consider using custom constants in your host/service configuration. Each instance may set their local value, e.g. for `PluginDir`.
 435
 436 This example specifies the following hierarchy over three levels:
 437
 438 * `ha-master` zone with two child zones `dmz1-checker` and `dmz2-checker`
 439 * `dmz1-checker` has two client child zones `dmz1-client1` and `dmz1-client2`
 440 * `dmz2-checker` has one client child zone `dmz2-client9`
 441
 442 The configuration tree could look like this:
 443
 444     # tree /etc/icinga2/zones.d
 445     /etc/icinga2/zones.d
 446     ├── dmz1-checker
 447     │   └── health.conf
 448     ├── dmz1-client1
 449     │   └── hosts.conf
 450     ├── dmz1-client2
 451     │   └── hosts.conf
 452     ├── dmz2-checker
 453     │   └── health.conf
 454     ├── dmz2-client9
 455     │   └── hosts.conf
 456     ├── global-templates
 457     │   ├── apply_notifications.conf
 458     │   ├── apply_services.conf
 459     │   ├── commands.conf
 460     │   ├── groups.conf
 461     │   ├── templates.conf
 462     │   └── users.conf
 463     ├── ha-master
 464     │   └── health.conf
 465     └── README
 466
 467     7 directories, 13 files
 468
 469 If you prefer a different naming schema for directories or files names, go for it. If you
 470 are unsure about the best method, join the [support channels](1-about.md#support) and discuss
 471 with the community.
 472
 473 If you are planning to synchronize local service health checks inside a zone, look into the
 474 [command endpoint](13-distributed-monitoring-ha.md#cluster-health-check-command-endpoint)
 475 explainations.
 476
 477
 478
 479 ## <a id="cluster-health-check"></a> Cluster Health Check
 480
 481 The Icinga 2 [ITL](7-icinga-template-library.md#icinga-template-library) provides
 482 an internal check command checking all configured `EndPoints` in the cluster setup.
 483 The check result will become critical if one or more configured nodes are not connected.
 484
 485 Example:
 486
 487     object Host "icinga2a" {
 488       display_name = "Health Checks on icinga2a"
 489
 490       address = "192.168.33.10"
 491       check_command = "hostalive"
 492     }
 493
 494     object Service "cluster" {
 495         check_command = "cluster"
 496         check_interval = 5s
 497         retry_interval = 1s
 498
 499         host_name = "icinga2a"
 500     }
 501
 502 Each cluster node should execute its own local cluster health check to
 503 get an idea about network related connection problems from different
 504 points of view.
 505
 506 Additionally you can monitor the connection from the local zone to the remote
 507 connected zones.
 508
 509 Example for the `checker` zone checking the connection to the `master` zone:
 510
 511     object Service "cluster-zone-master" {
 512       check_command = "cluster-zone"
 513       check_interval = 5s
 514       retry_interval = 1s
 515       vars.cluster_zone = "master"
 516
 517       host_name = "icinga2b"
 518     }
 519
 520 ## <a id="cluster-health-check-command-endpoint"></a> Cluster Health Check with Command Endpoints
 521
 522 If you are planning to sync the zone configuration inside a [High-Availability]()
 523 cluster zone, you can also use the `command_endpoint` object attribute to
 524 pin host/service checks to a specific endpoint inside the same zone.
 525
 526 This requires the `accept_commands` setting inside the [ApiListener](13-distributed-monitoring-ha.md#configure-apilistener-object)
 527 object set to `true` similar to the [remote client command execution bridge](11-icinga2-client.md#icinga2-client-configuration-command-bridge)
 528 setup.
 529
 530 Make sure to set `command_endpoint` to the correct endpoint instance.
 531 The example below assumes that the endpoint name is the same as the
 532 host name configured for health checks. If it differs, define a host
 533 custom attribute providing [this information](11-icinga2-client.md#icinga2-client-configuration-command-bridge-master-config).
 534
 535     apply Service "cluster-ha" {
 536       check_command = "cluster"
 537       check_interval = 5s
 538       retry_interval = 1s
 539       /* make sure host.name is the same as endpoint name */
 540       command_endpoint = host.name
 541
 542       assign where regex("^icinga2[a|b]", host.name)
 543     }
 544
 545
 546 ## <a id="cluster-scenarios"></a> Cluster Scenarios
 547
 548 All cluster nodes are full-featured Icinga 2 instances. You only need to enabled
 549 the features for their role (for example, a `Checker` node only requires the `checker`
 550 feature enabled, but not `notification` or `ido-mysql` features).
 551
 552 > **Tip**
 553 >
 554 > There's a [Vagrant demo setup](https://github.com/Icinga/icinga-vagrant/tree/master/icinga2x-cluster)
 555 > available featuring a two node cluster showcasing several aspects (config sync,
 556 > remote command execution, etc).
 557
 558 ### <a id="cluster-scenarios-master-satellite-clients"></a> Cluster with Master, Satellites and Remote Clients
 559
 560 You can combine "classic" cluster scenarios from HA to Master-Checker with the
 561 Icinga 2 Remote Client modes. Each instance plays a certain role in that picture.
 562
 563 Imagine the following scenario:
 564
 565 * The master zone acts as High-Availability zone
 566 * Remote satellite zones execute local checks and report them to the master
 567 * All satellites query remote clients and receive check results (which they also replay to the master)
 568 * All involved nodes share the same configuration logic: zones, endpoints, apilisteners
 569
 570 You'll need to think about the following:
 571
 572 * Deploy the entire configuration from the master to satellites and cascading remote clients? ("top down")
 573 * Use local client configuration instead and report the inventory to satellites and cascading to the master? ("bottom up")
 574 * Combine that with command execution brdiges on remote clients and also satellites
 575
 576
 577 ### <a id="cluster-scenarios-security"></a> Security in Cluster Scenarios
 578
 579 While there are certain capabilities to ensure the safe communication between all
 580 nodes (firewalls, policies, software hardening, etc) the Icinga 2 cluster also provides
 581 additional security itself:
 582
 583 * [SSL certificates](13-distributed-monitoring-ha.md#manual-certificate-generation) are mandatory for cluster communication.
 584 * Child zones only receive event updates (check results, commands, etc) for their configured updates.
 585 * Zones cannot influence/interfere other zones. Each checked object is assigned to only one zone.
 586 * All nodes in a zone trust each other.
 587 * [Configuration sync](13-distributed-monitoring-ha.md#zone-config-sync-permissions) is disabled by default.
 588
 589 ### <a id="cluster-scenarios-features"></a> Features in Cluster Zones
 590
 591 Each cluster zone may use all available features. If you have multiple locations
 592 or departments, they may write to their local database, or populate graphite.
 593 Even further all commands are distributed amongst connected nodes. For example, you could
 594 re-schedule a check or acknowledge a problem on the master, and it gets replicated to the
 595 actual slave checker node.
 596
 597 > **Note**
 598 >
 599 > All features must be same on all endpoints inside an [HA zone](13-distributed-monitoring-ha.md#cluster-scenarios-high-availability).
 600 > There are additional [High-Availability-enabled features](13-distributed-monitoring-ha.md#high-availability-features) available.
 601
 602 ### <a id="cluster-scenarios-distributed-zones"></a> Distributed Zones
 603
 604 That scenario fits if your instances are spread over the globe and they all report
 605 to a master instance. Their network connection only works towards the master master
 606 (or the master is able to connect, depending on firewall policies) which means
 607 remote instances won't see each/connect to each other.
 608
 609 All events (check results, downtimes, comments, etc) are synced to the master node,
 610 but the remote nodes can still run local features such as a web interface, reporting,
 611 graphing, etc. in their own specified zone.
 612
 613 Imagine the following example with a master node in Nuremberg, and two remote DMZ
 614 based instances in Berlin and Vienna. Additonally you'll specify
 615 [global templates](13-distributed-monitoring-ha.md#zone-global-config-templates) available in all zones.
 616
 617 The configuration tree on the master instance `nuremberg` could look like this:
 618
 619     zones.d
 620       global-templates/
 621         templates.conf
 622         groups.conf
 623       nuremberg/
 624         local.conf
 625       berlin/
 626         hosts.conf
 627       vienna/
 628         hosts.conf
 629
 630 The configuration deployment will take care of automatically synchronising
 631 the child zone configuration:
 632
 633 * The master node sends `zones.d/berlin` to the `berlin` child zone.
 634 * The master node sends `zones.d/vienna` to the `vienna` child zone.
 635 * The master node sends `zones.d/global-templates` to the `vienna` and `berlin` child zones.
 636
 637 The endpoint configuration would look like:
 638
 639     object Endpoint "nuremberg-master" {
 640       host = "nuremberg.icinga.org"
 641     }
 642
 643     object Endpoint "berlin-satellite" {
 644       host = "berlin.icinga.org"
 645     }
 646
 647     object Endpoint "vienna-satellite" {
 648       host = "vienna.icinga.org"
 649     }
 650
 651 The zones would look like:
 652
 653     object Zone "nuremberg" {
 654       endpoints = [ "nuremberg-master" ]
 655     }
 656
 657     object Zone "berlin" {
 658       endpoints = [ "berlin-satellite" ]
 659       parent = "nuremberg"
 660     }
 661
 662     object Zone "vienna" {
 663       endpoints = [ "vienna-satellite" ]
 664       parent = "nuremberg"
 665     }
 666
 667     object Zone "global-templates" {
 668       global = true
 669     }
 670
 671 The `nuremberg-master` zone will only execute local checks, and receive
 672 check results from the satellite nodes in the zones `berlin` and `vienna`.
 673
 674 > **Note**
 675 >
 676 > The child zones `berlin` and `vienna` will get their configuration synchronised
 677 > from the configuration master 'nuremberg'. The endpoints in the child
 678 > zones **must not** have their `zones.d` directory populated if this endpoint
 679 > [accepts synced configuration](13-distributed-monitoring-ha.md#zone-config-sync-permissions).
 680
 681 ### <a id="cluster-scenarios-load-distribution"></a> Load Distribution
 682
 683 If you are planning to off-load the checks to a defined set of remote workers
 684 you can achieve that by:
 685
 686 * Deploying the configuration on all nodes.
 687 * Let Icinga 2 distribute the load amongst all available nodes.
 688
 689 That way all remote check instances will receive the same configuration
 690 but only execute their part. The master instance located in the `master` zone
 691 can also execute checks, but you may also disable the `Checker` feature.
 692
 693 Configuration on the master node:
 694
 695     zones.d/
 696       global-templates/
 697       master/
 698       checker/
 699
 700 If you are planning to have some checks executed by a specific set of checker nodes
 701 you have to define additional zones and define these check objects there.
 702
 703 Endpoints:
 704
 705     object Endpoint "master-node" {
 706       host = "master.icinga.org"
 707     }
 708
 709     object Endpoint "checker1-node" {
 710       host = "checker1.icinga.org"
 711     }
 712
 713     object Endpoint "checker2-node" {
 714       host = "checker2.icinga.org"
 715     }
 716
 717
 718 Zones:
 719
 720     object Zone "master" {
 721       endpoints = [ "master-node" ]
 722     }
 723
 724     object Zone "checker" {
 725       endpoints = [ "checker1-node", "checker2-node" ]
 726       parent = "master"
 727     }
 728
 729     object Zone "global-templates" {
 730       global = true
 731     }
 732
 733 > **Note**
 734 >
 735 > The child zones `checker` will get its configuration synchronised
 736 > from the configuration master 'master'. The endpoints in the child
 737 > zone **must not** have their `zones.d` directory populated if this endpoint
 738 > [accepts synced configuration](13-distributed-monitoring-ha.md#zone-config-sync-permissions).
 739
 740 ### <a id="cluster-scenarios-high-availability"></a> Cluster High Availability
 741
 742 High availability with Icinga 2 is possible by putting multiple nodes into
 743 a dedicated [zone](13-distributed-monitoring-ha.md#configure-cluster-zones). All nodes will elect one
 744 active master, and retry an election once the current active master is down.
 745
 746 Selected features provide advanced [HA functionality](13-distributed-monitoring-ha.md#high-availability-features).
 747 Checks and notifications are load-balanced between nodes in the high availability
 748 zone.
 749
 750 Connections from other zones will be accepted by all active and passive nodes
 751 but all are forwarded to the current active master dealing with the check results,
 752 commands, etc.
 753
 754     object Zone "config-ha-master" {
 755       endpoints = [ "icinga2a", "icinga2b", "icinga2c" ]
 756     }
 757
 758 Two or more nodes in a high availability setup require an [initial cluster sync](13-distributed-monitoring-ha.md#initial-cluster-sync).
 759
 760 > **Note**
 761 >
 762 > Keep in mind that **only one node acts as configuration master** having the
 763 > configuration files in the `zones.d` directory. All other nodes **must not**
 764 > have that directory populated. Instead they are required to
 765 > [accept synced configuration](13-distributed-monitoring-ha.md#zone-config-sync-permissions).
 766 > Details in the [Configuration Sync Chapter](13-distributed-monitoring-ha.md#cluster-zone-config-sync).
 767
 768 ### <a id="cluster-scenarios-multiple-hierarchies"></a> Multiple Hierarchies
 769
 770 Your master zone collects all check results for reporting and graphing and also
 771 does some sort of additional notifications.
 772 The customers got their own instances in their local DMZ zones. They are limited to read/write
 773 only their services, but replicate all events back to the master instance.
 774 Within each DMZ there are additional check instances also serving interfaces for local
 775 departments. The customers instances will collect all results, but also send them back to
 776 your master instance.
 777 Additionally the customers instance on the second level in the middle prohibits you from
 778 sending commands to the subjacent department nodes. You're only allowed to receive the
 779 results, and a subset of each customers configuration too.
 780
 781 Your master zone will generate global reports, aggregate alert notifications, and check
 782 additional dependencies (for example, the customers internet uplink and bandwidth usage).
 783
 784 The customers zone instances will only check a subset of local services and delegate the rest
 785 to each department. Even though it acts as configuration master with a master dashboard
 786 for all departments managing their configuration tree which is then deployed to all
 787 department instances. Furthermore the master NOC is able to see what's going on.
 788
 789 The instances in the departments will serve a local interface, and allow the administrators
 790 to reschedule checks or acknowledge problems for their services.
 791
 792
 793 ## <a id="high-availability-features"></a> High Availability for Icinga 2 features
 794
 795 All nodes in the same zone require the same features enabled for High Availability (HA)
 796 amongst them.
 797
 798 By default the following features provide advanced HA functionality:
 799
 800 * [Checks](13-distributed-monitoring-ha.md#high-availability-checks) (load balanced, automated failover)
 801 * [Notifications](13-distributed-monitoring-ha.md#high-availability-notifications) (load balanced, automated failover)
 802 * [DB IDO](13-distributed-monitoring-ha.md#high-availability-db-ido) (Run-Once, automated failover)
 803
 804 ### <a id="high-availability-checks"></a> High Availability with Checks
 805
 806 All instances within the same zone (e.g. the `master` zone as HA cluster) must
 807 have the `checker` feature enabled.
 808
 809 Example:
 810
 811     # icinga2 feature enable checker
 812
 813 All nodes in the same zone load-balance the check execution. When one instance shuts down
 814 the other nodes will automatically take over the reamining checks.
 815
 816 ### <a id="high-availability-notifications"></a> High Availability with Notifications
 817
 818 All instances within the same zone (e.g. the `master` zone as HA cluster) must
 819 have the `notification` feature enabled.
 820
 821 Example:
 822
 823     # icinga2 feature enable notification
 824
 825 Notifications are load balanced amongst all nodes in a zone. By default this functionality
 826 is enabled.
 827 If your nodes should notify independent from any other nodes (this will cause
 828 duplicated notifications if not properly handled!), you can set `enable_ha = false`
 829 in the [NotificationComponent](6-object-types.md#objecttype-notificationcomponent) feature.
 830
 831 ### <a id="high-availability-db-ido"></a> High Availability with DB IDO
 832
 833 All instances within the same zone (e.g. the `master` zone as HA cluster) must
 834 have the DB IDO feature enabled.
 835
 836 Example DB IDO MySQL:
 837
 838     # icinga2 feature enable ido-mysql
 839
 840 By default the DB IDO feature only runs on one node. All other nodes in the same zone disable
 841 the active IDO database connection at runtime. The node with the active DB IDO connection is
 842 not necessarily the zone master.
 843
 844 > **Note**
 845 >
 846 > The DB IDO HA feature can be disabled by setting the `enable_ha` attribute to `false`
 847 > for the [IdoMysqlConnection](6-object-types.md#objecttype-idomysqlconnection) or
 848 > [IdoPgsqlConnection](6-object-types.md#objecttype-idopgsqlconnection) object on **all** nodes in the
 849 > **same** zone.
 850 >
 851 > All endpoints will enable the DB IDO feature and connect to the configured
 852 > database and dump configuration, status and historical data on their own.
 853
 854 If the instance with the active DB IDO connection dies, the HA functionality will
 855 automatically elect a new DB IDO master.
 856
 857 The DB IDO feature will try to determine which cluster endpoint is currently writing
 858 to the database and bail out if another endpoint is active. You can manually verify that
 859 by running the following query:
 860
 861     icinga=> SELECT status_update_time, endpoint_name FROM icinga_programstatus;
 862        status_update_time   | endpoint_name
 863     ------------------------+---------------
 864      2014-08-15 15:52:26+02 | icinga2a
 865     (1 Zeile)
 866
 867 This is useful when the cluster connection between endpoints breaks, and prevents
 868 data duplication in split-brain-scenarios. The failover timeout can be set for the
 869 `failover_timeout` attribute, but not lower than 60 seconds.
 870
 871
 872 ## <a id="cluster-add-node"></a> Add a new cluster endpoint
 873
 874 These steps are required for integrating a new cluster endpoint:
 875
 876 * generate a new [SSL client certificate](13-distributed-monitoring-ha.md#manual-certificate-generation)
 877 * identify its location in the zones
 878 * update the `zones.conf` file on each involved node ([endpoint](13-distributed-monitoring-ha.md#configure-cluster-endpoints), [zones](13-distributed-monitoring-ha.md#configure-cluster-zones))
 879     * a new slave zone node requires updates for the master and slave zones
 880     * verify if this endpoints requires [configuration synchronisation](13-distributed-monitoring-ha.md#cluster-zone-config-sync) enabled
 881 * if the node requires the existing zone history: [initial cluster sync](13-distributed-monitoring-ha.md#initial-cluster-sync)
 882 * add a [cluster health check](13-distributed-monitoring-ha.md#cluster-health-check)
 883
 884 ### <a id="initial-cluster-sync"></a> Initial Cluster Sync
 885
 886 In order to make sure that all of your cluster nodes have the same state you will
 887 have to pick one of the nodes as your initial "master" and copy its state file
 888 to all the other nodes.
 889
 890 You can find the state file in `/var/lib/icinga2/icinga2.state`. Before copying
 891 the state file you should make sure that all your cluster nodes are properly shut
 892 down.
 893
 894
 895 ## <a id="host-multiple-cluster-nodes"></a> Host With Multiple Cluster Nodes
 896
 897 Special scenarios might require multiple cluster nodes running on a single host.
 898 By default Icinga 2 and its features will place their runtime data below the prefix
 899 `LocalStateDir`. By default packages will set that path to `/var`.
 900 You can either set that variable as constant configuration
 901 definition in [icinga2.conf](4-configuring-icinga-2.md#icinga2-conf) or pass it as runtime variable to
 902 the Icinga 2 daemon.
 903
 904     # icinga2 -c /etc/icinga2/node1/icinga2.conf -DLocalStateDir=/opt/node1/var