From 07790e456b6993a272a7e7a91b33b05823492959 Mon Sep 17 00:00:00 2001
From: Michael Friedrich <michael.friedrich@icinga.com>
Date: Wed, 8 May 2019 17:48:13 +0200
Subject: [PATCH] Docs: Improve features chapter and add details on HA setups

refs #4855
---
 doc/13-addons.md   |   2 +-
 doc/14-features.md | 333 ++++++++++++++++++++++++++++++++-------------
 2 files changed, 239 insertions(+), 96 deletions(-)
diff --git a/doc/13-addons.md b/doc/13-addons.md
index 760fe4ccd..01bdfee3f 100644
--- a/doc/13-addons.md
+++ b/doc/13-addons.md
@@ -60,7 +60,7 @@ Use your distribution's package manager to install the `pnp4nagios` package.
 
 If you're planning to use it, configure it to use the
 [bulk mode with npcd and npcdmod](https://docs.pnp4nagios.org/pnp-0.6/modes#bulk_mode_with_npcd_and_npcdmod)
-in combination with Icinga 2's [PerfdataWriter](14-features.md#performance-data). NPCD collects the performance
+in combination with Icinga 2's [PerfdataWriter](14-features.md#writing-performance-data-files). NPCD collects the performance
 data files which Icinga 2 generates.
 
 Enable performance data writer in icinga 2
diff --git a/doc/14-features.md b/doc/14-features.md
index 3bdbd6743..dff90cd9b 100644
--- a/doc/14-features.md
+++ b/doc/14-features.md
@@ -38,7 +38,13 @@ files then:
 
 By default, log files will be rotated daily.
 
-## DB IDO <a id="db-ido"></a>
+## Core Backends <a id="core-backends"></a>
+
+### REST API <a id="core-backends-api"></a>
+
+The REST API is documented [here](12-icinga2-api.md#icinga2-api) as a core feature.
+
+### IDO Database (DB IDO) <a id="db-ido"></a>
 
 The IDO (Icinga Data Output) feature for Icinga 2 takes care of exporting all
 configuration and status information into a database. The IDO database is used
@@ -49,10 +55,8 @@ chapter. Details on the configuration can be found in the
 [IdoMysqlConnection](09-object-types.md#objecttype-idomysqlconnection) and
 [IdoPgsqlConnection](09-object-types.md#objecttype-idopgsqlconnection)
 object configuration documentation.
-The DB IDO feature supports [High Availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido) in
-the Icinga 2 cluster.
 
-### DB IDO Health <a id="db-ido-health"></a>
+#### DB IDO Health <a id="db-ido-health"></a>
 
 If the monitoring health indicator is critical in Icinga Web 2,
 you can use the following queries to manually check whether Icinga 2
@@ -100,7 +104,21 @@ status_update_time
 
 A detailed list on the available table attributes can be found in the [DB IDO Schema documentation](24-appendix.md#schema-db-ido).
 
-### DB IDO Cleanup <a id="db-ido-cleanup"></a>
+#### DB IDO in Cluster HA Zones <a id="db-ido-cluster-ha"></a>
+
+The DB IDO feature supports [High Availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-db-ido) in
+the Icinga 2 cluster.
+
+By default, both endpoints in a zone calculate the
+endpoint which activates the feature, the other endpoint
+automatically pauses it. If the cluster connection
+breaks at some point, the paused IDO feature automatically
+does a failover.
+
+You can disable this behaviour by setting `enable_ha = false`
+in both feature configuration files.
+
+#### DB IDO Cleanup <a id="db-ido-cleanup"></a>
 
 Objects get deactivated when they are deleted from the configuration.
 This is visible with the `is_active` column in the `icinga_objects` table.
@@ -125,7 +143,7 @@ Example if you prefer to keep notification history for 30 days:
 The historical tables are populated depending on the data `categories` specified.
 Some tables are empty by default.
 
-### DB IDO Tuning <a id="db-ido-tuning"></a>
+#### DB IDO Tuning <a id="db-ido-tuning"></a>
 
 As with any application database, there are ways to optimize and tune the database performance.
 
@@ -171,98 +189,30 @@ VACUUM
 > Don't use `VACUUM FULL` as this has a severe impact on performance.
 
 
-## External Commands <a id="external-commands"></a>
-
-> **Note**
->
-> Please use the [REST API](12-icinga2-api.md#icinga2-api) as modern and secure alternative
-> for external actions.
-
-> **Note**
->
-> This feature is DEPRECATED and will be removed in future releases.
-> Check the [roadmap](https://github.com/Icinga/icinga2/milestones).
-
-Icinga 2 provides an external command pipe for processing commands
-triggering specific actions (for example rescheduling a service check
-through the web interface).
-
-In order to enable the `ExternalCommandListener` configuration use the
-following command and restart Icinga 2 afterwards:
+## Metrics <a id="metrics"></a>
 
-```
-# icinga2 feature enable command
-```
+Whenever a host or service check is executed, or received via the REST API,
+best practice is to provide performance data.
 
-Icinga 2 creates the command pipe file as `/var/run/icinga2/cmd/icinga2.cmd`
-using the default configuration.
+This data is parsed by features sending metrics to time series databases (TSDB):
 
-Web interfaces and other Icinga addons are able to send commands to
-Icinga 2 through the external command pipe, for example for rescheduling
-a forced service check:
+* [Graphite](14-features.md#graphite-carbon-cache-writer)
+* [InfluxDB](14-features.md#influxdb-writer)
+* [OpenTSDB](14-features.md#opentsdb-writer)
 
-```
-# /bin/echo "[`date +%s`] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;`date +%s`" >> /var/run/icinga2/cmd/icinga2.cmd
+Metrics, state changes and notifications can be managed with the following integrations:
 
-# tail -f /var/log/messages
+* [Elastic Stack](14-features.md#elastic-stack-integration)
+* [Graylog](14-features.md#graylog-integration)
 
-Oct 17 15:01:25 icinga-server icinga2: Executing external command: [1382014885] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382014885
-Oct 17 15:01:25 icinga-server icinga2: Rescheduling next check for service 'ping4'
-```
-
-A list of currently supported external commands can be found [here](24-appendix.md#external-commands-list-detail).
 
-Detailed information on the commands and their required parameters can be found
-on the [Icinga 1.x documentation](https://docs.icinga.com/latest/en/extcommands2.html).
+### Graphite Writer <a id="graphite-carbon-cache-writer"></a>
 
-## Performance Data <a id="performance-data"></a>
+[Graphite](13-addons.md#addons-graphing-graphite) is a tool stack for storing
+metrics and needs to be running prior to enabling the `graphite` feature.
 
-When a host or service check is executed plugins should provide so-called
-`performance data`. Next to that additional check performance data
-can be fetched using Icinga 2 runtime macros such as the check latency
-or the current service state (or additional custom attributes).
-
-The performance data can be passed to external applications which aggregate and
-store them in their backends. These tools usually generate graphs for historical
-reporting and trending.
-
-Well-known addons processing Icinga performance data are [PNP4Nagios](13-addons.md#addons-graphing-pnp),
-[Graphite](13-addons.md#addons-graphing-graphite) or [OpenTSDB](14-features.md#opentsdb-writer).
-
-### Writing Performance Data Files <a id="writing-performance-data-files"></a>
-
-PNP4Nagios and Graphios use performance data collector daemons to fetch
-the current performance files for their backend updates.
-
-Therefore the Icinga 2 [PerfdataWriter](09-object-types.md#objecttype-perfdatawriter)
-feature allows you to define the output template format for host and services helped
-with Icinga 2 runtime vars.
-
-```
-host_format_template = "DATATYPE::HOSTPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tHOSTPERFDATA::$host.perfdata$\tHOSTCHECKCOMMAND::$host.check_command$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$"
-service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.check_command$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.state_type$"
-```
-
-The default templates are already provided with the Icinga 2 feature configuration
-which can be enabled using
-
-```
-# icinga2 feature enable perfdata
-```
-
-By default all performance data files are rotated in a 15 seconds interval into
-the `/var/spool/icinga2/perfdata/` directory as `host-perfdata.<timestamp>` and
-`service-perfdata.<timestamp>`.
-External collectors need to parse the rotated performance data files and then
-remove the processed files.
-
-### Graphite Carbon Cache Writer <a id="graphite-carbon-cache-writer"></a>
-
-While there are some [Graphite](13-addons.md#addons-graphing-graphite)
-collector scripts and daemons like Graphios available for Icinga 1.x it's more
-reasonable to directly process the check and plugin performance
-in memory in Icinga 2. Once there are new metrics available, Icinga 2 will directly
-write them to the defined Graphite Carbon daemon tcp socket.
+Icinga 2 writes parsed metrics directly to Graphite's Carbon Cache
+TCP port, defaulting to `2003`.
 
 You can enable the feature using
 
@@ -273,7 +223,7 @@ You can enable the feature using
 By default the [GraphiteWriter](09-object-types.md#objecttype-graphitewriter) feature
 expects the Graphite Carbon Cache to listen at `127.0.0.1` on TCP port `2003`.
 
-#### Current Graphite Schema <a id="graphite-carbon-cache-writer-schema"></a>
+#### Graphite Schema <a id="graphite-carbon-cache-writer-schema"></a>
 
 The current naming schema is defined as follows. The [Icinga Web 2 Graphite module](https://github.com/icinga/icingaweb2-module-graphite)
 depends on this schema.
@@ -308,7 +258,8 @@ Metric values are stored like this:
 <prefix>.perfdata.<perfdata-label>.value
 ```
 
-The following characters are escaped in perfdata labels:
+The following characters are escaped in performance labels
+parsed from plugin output:
 
   Character	| Escaped character
   --------------|--------------------------
@@ -317,7 +268,7 @@ The following characters are escaped in perfdata labels:
   /		| _
   ::		| .
 
-Note that perfdata labels may contain dots (`.`) allowing to
+Note that labels may contain dots (`.`) allowing to
 add more subsequent levels inside the Graphite tree.
 `::` adds support for [multi performance labels](http://my-plugin.de/wiki/projects/check_multi/configuration/performance)
 and is therefore replaced by `.`.
@@ -369,6 +320,25 @@ pattern = ^icinga2\.
 retentions = 1m:2d,5m:10d,30m:90d,360m:4y
 ```
 
+#### Graphite in Cluster HA Zones <a id="graphite-carbon-cache-writer-cluster-ha"></a>
+
+The Graphite feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing metrics to a Carbon Cache socket. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes metrics, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that metrics are written even if the cluster fails.
+
+The recommended way of running Graphite in this scenario is a dedicated server
+where Carbon Cache/Relay is running as receiver.
+
 
 ### InfluxDB Writer <a id="influxdb-writer"></a>
 
@@ -447,6 +417,25 @@ object InfluxdbWriter "influxdb" {
 }
 ```
 
+#### InfluxDB in Cluster HA Zones <a id="influxdb-writer-cluster-ha"></a>
+
+The InfluxDB feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing metrics to the InfluxDB HTTP API. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes metrics, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that metrics are written even if the cluster fails.
+
+The recommended way of running InfluxDB in this scenario is a dedicated server
+where the InfluxDB HTTP API or Telegraf as Proxy are running.
+
 ### Elastic Stack Integration <a id="elastic-stack-integration"></a>
 
 [Icingabeat](https://github.com/icinga/icingabeat) is an Elastic Beat that fetches data
@@ -524,6 +513,26 @@ check_result.perfdata.<perfdata-label>.warn
 check_result.perfdata.<perfdata-label>.crit
 ```
 
+#### Elasticsearch in Cluster HA Zones <a id="elasticsearch-writer-cluster-ha"></a>
+
+The Elasticsearch feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing events to the Elasticsearch HTTP API. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes events, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that events are written even if the cluster fails.
+
+The recommended way of running Elasticsearch in this scenario is a dedicated server
+where you either have the Elasticsearch HTTP API, or a TLS secured HTTP proxy,
+or Logstash for additional filtering.
+
 ### Graylog Integration <a id="graylog-integration"></a>
 
 #### GELF Writer <a id="gelfwriter"></a>
@@ -550,6 +559,24 @@ Currently these events are processed:
 * State changes
 * Notifications
 
+#### Graylog/GELF in Cluster HA Zones <a id="gelf-writer-cluster-ha"></a>
+
+The Gelf feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing events to the Graylog HTTP API. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes events, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that events are written even if the cluster fails.
+
+The recommended way of running Graylog in this scenario is a dedicated server
+where you have the Graylog HTTP API listening.
 
 ### OpenTSDB Writer <a id="opentsdb-writer"></a>
 
@@ -625,6 +652,75 @@ with the following tags
 > You might want to set the tsd.core.auto_create_metrics setting to `true`
 > in your opentsdb.conf configuration file.
 
+#### OpenTSDB in Cluster HA Zones <a id="opentsdb-writer-cluster-ha"></a>
+
+The OpenTSDB feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing events to the OpenTSDB listener. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes metrics, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that metrics are written even if the cluster fails.
+
+The recommended way of running OpenTSDB in this scenario is a dedicated server
+where you have OpenTSDB running.
+
+
+### Writing Performance Data Files <a id="writing-performance-data-files"></a>
+
+PNP and Graphios use performance data collector daemons to fetch
+the current performance files for their backend updates.
+
+Therefore the Icinga 2 [PerfdataWriter](09-object-types.md#objecttype-perfdatawriter)
+feature allows you to define the output template format for host and services helped
+with Icinga 2 runtime vars.
+
+```
+host_format_template = "DATATYPE::HOSTPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tHOSTPERFDATA::$host.perfdata$\tHOSTCHECKCOMMAND::$host.check_command$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$"
+service_format_template = "DATATYPE::SERVICEPERFDATA\tTIMET::$icinga.timet$\tHOSTNAME::$host.name$\tSERVICEDESC::$service.name$\tSERVICEPERFDATA::$service.perfdata$\tSERVICECHECKCOMMAND::$service.check_command$\tHOSTSTATE::$host.state$\tHOSTSTATETYPE::$host.state_type$\tSERVICESTATE::$service.state$\tSERVICESTATETYPE::$service.state_type$"
+```
+
+The default templates are already provided with the Icinga 2 feature configuration
+which can be enabled using
+
+```
+# icinga2 feature enable perfdata
+```
+
+By default all performance data files are rotated in a 15 seconds interval into
+the `/var/spool/icinga2/perfdata/` directory as `host-perfdata.<timestamp>` and
+`service-perfdata.<timestamp>`.
+External collectors need to parse the rotated performance data files and then
+remove the processed files.
+
+#### Perfdata Files in Cluster HA Zones <a id="perfdata-writer-cluster-ha"></a>
+
+The Perfdata feature supports [high availability](06-distributed-monitoring.md#distributed-monitoring-high-availability-features)
+in cluster zones since 2.11.
+
+By default, all endpoints in a zone will activate the feature and start
+writing metrics to the local spool directory. In HA enabled scenarios,
+it is possible to set `enable_ha = true` in all feature configuration
+files. This allows each endpoint to calculate the feature authority,
+and only one endpoint actively writes metrics, the other endpoints
+pause the feature.
+
+When the cluster connection breaks at some point, the remaining endpoint(s)
+in that zone will automatically resume the feature. This built-in failover
+mechanism ensures that metrics are written even if the cluster fails.
+
+The recommended way of running Perfdata is to mount the perfdata spool
+directory via NFS on a central server where PNP with the NPCD collector
+is running on.
+
+
+
 
 ## Livestatus <a id="setting-up-livestatus"></a>
 
@@ -831,7 +927,9 @@ The `commands` table is populated with `CheckCommand`, `EventCommand` and `Notif
 A detailed list on the available table attributes can be found in the [Livestatus Schema documentation](24-appendix.md#schema-livestatus).
 
 
-## Status Data Files <a id="status-data"></a>
+## Deprecated Features <a id="deprecated-features"></a>
+
+### Status Data Files <a id="status-data"></a>
 
 > **Note**
 >
@@ -850,7 +948,7 @@ status updates in a regular interval.
 If you are not using any web interface or addon which uses these files,
 you can safely disable this feature.
 
-## Compat Log Files <a id="compat-logging"></a>
+### Compat Log Files <a id="compat-logging"></a>
 
 > **Note**
 >
@@ -876,7 +974,52 @@ By default, the Icinga 1.x log file called `icinga.log` is located
 in `/var/log/icinga2/compat`. Rotated log files are moved into
 `var/log/icinga2/compat/archives`.
 
-## Check Result Files <a id="check-result-files"></a>
+### External Command Pipe <a id="external-commands"></a>
+
+> **Note**
+>
+> Please use the [REST API](12-icinga2-api.md#icinga2-api) as modern and secure alternative
+> for external actions.
+
+> **Note**
+>
+> This feature is DEPRECATED and will be removed in future releases.
+> Check the [roadmap](https://github.com/Icinga/icinga2/milestones).
+
+Icinga 2 provides an external command pipe for processing commands
+triggering specific actions (for example rescheduling a service check
+through the web interface).
+
+In order to enable the `ExternalCommandListener` configuration use the
+following command and restart Icinga 2 afterwards:
+
+```
+# icinga2 feature enable command
+```
+
+Icinga 2 creates the command pipe file as `/var/run/icinga2/cmd/icinga2.cmd`
+using the default configuration.
+
+Web interfaces and other Icinga addons are able to send commands to
+Icinga 2 through the external command pipe, for example for rescheduling
+a forced service check:
+
+```
+# /bin/echo "[`date +%s`] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;`date +%s`" >> /var/run/icinga2/cmd/icinga2.cmd
+
+# tail -f /var/log/messages
+
+Oct 17 15:01:25 icinga-server icinga2: Executing external command: [1382014885] SCHEDULE_FORCED_SVC_CHECK;localhost;ping4;1382014885
+Oct 17 15:01:25 icinga-server icinga2: Rescheduling next check for service 'ping4'
+```
+
+A list of currently supported external commands can be found [here](24-appendix.md#external-commands-list-detail).
+
+Detailed information on the commands and their required parameters can be found
+on the [Icinga 1.x documentation](https://docs.icinga.com/latest/en/extcommands2.html).
+
+
+### Check Result Files <a id="check-result-files"></a>
 
 > **Note**
 >
-- 
2.40.0