Update documentation: Remove TODO from advanced topics.

author Michael Friedrich <michael.friedrich@netways.de>

Thu, 17 Oct 2013 18:40:07 +0000 (20:40 +0200)

committer Michael Friedrich <michael.friedrich@netways.de>

Thu, 17 Oct 2013 18:40:50 +0000 (20:40 +0200)
author Michael Friedrich <michael.friedrich@netways.de>
Thu, 17 Oct 2013 18:40:07 +0000 (20:40 +0200)
committer Michael Friedrich <michael.friedrich@netways.de>
Thu, 17 Oct 2013 18:40:50 +0000 (20:40 +0200)
diff --git a/doc/6-advanced-topics.md b/doc/6-advanced-topics.md

index cdd20adbefa76433943d6561aff385aed3441001..14df767982e6483a98ec64bf59e9cce11cb02275 100644 (file)
--- a/doc/6-advanced-topics.md
+++ b/doc/6-advanced-topics.md
@@ -1,16 +1,151 @@
  # Advanced Topics
  
+## Soft and Hard State Types
+
+### Soft State Type
+
+A `SOFT` state type is being entered when a service checks ends up in
+a `NOT-OK` state and its check counter has not yet reached the
+configured `max_check_attempts` attribute.
+
+Additionally a `SOFT` state type happens when a service recovers
+from a soft error changing from `NOT-OK` state to `OK` state.
+
+`SOFT` states are logged as alerts and event commands are executed.
+
+> **Note**
+>
+> Event commands can filter the `SOFT` state type by using the
+> macro `$SERVICESTATETYPE$` and its value being `"SOFT"`.
+
+### Hard State Type
+
+A `HARD` state type happens when a service has been re-checked
+as many times as the `max_check_attempts` attribute defines
+being in a `NOT-OK` state during the entire re-check starting
+with the first `NOT-OK` transition.
+
+If a service state changes from one `NOT-OK` state into another
+`NOT-OK` state (for example from `WARNING` to `CRITICAL`) this
+is also considered a `HARD` state change.
+
+Additionally a `HARD` state type happens when a service recovers
+from a hard error changing from `NOT-OK` state to `OK` state.
+
+`HARD` states are logged, event commands are executed and notifications
+are triggered notifying all associated users.
+
+> **Note**
+>
+> Event commands can filter the `HARD` state type by using the
+> macro `$SERVICESTATETYPE$` and its value being `"HARD"`.
+
  ## Downtimes
  
-TODO (move to basics?)
+Downtimes can be scheduled for planned server maintenance or
+any other targetted service outage you are aware of in advance.
+
+Downtimes will suppress any notifications, and may trigger other
+downtimes too. If the downtime was set by accident, or the duration
+exceeds the maintenance, you can manually cancel the downtime.
+Planned downtimes will also be taken into account for SLA reporting
+tools calculating the SLAs based on the state and downtime history.
+
+> **Note**
+>
+> Downtimes may overlap with their start and end times. If there
+> are multiple downtimes triggered, the overall downtime depth
+> will be more than `1`. This is useful when you want to extend
+> your maintenance window taking longer than expected.
+
+### Fixed and Flexible Downtimes
+
+A `fixed` downtime will be activated at the defined start time, and
+removed at the end time. During this time window the service state
+will change to `NOT-OK` and then actually trigger the downtime.
+Notifications are suppressed and the downtime depth is incremented.
+
+Common scenarios are a planned distribution upgrade on your linux
+servers, or database updates in your warehouse. The customer knows
+about a fixed downtime window between 23:00 and 24:00. After 24:00
+all problems should be alerted again. Solution is simple -
+schedule a `fixed` downtime starting at 23:00 and ending at 24:00.
+
+Unlike a `fixed` downtime, a `flexible` downtime end does not necessarily
+happen at the provided end time. Instead the downtime will be triggered
+in the time span defined by start and end time, but then last a defined
+duration in minutes.
+
+Imagine the following scenario: Your service is frequently polled
+by users trying to grab free deleted domains for immediate registration.
+Between 07:30 and 08:00 the impact will hit for 15 minutes and generate
+a network outage visible to the monitoring. The service is still alive,
+but answering too slow to Icinga 2 service checks.
+For that reason, you may want to schedule a downtime between 07:30 and
+08:00 with a duration of 15 minutes. The downtime will then last from
+its trigger time until the duration is over. After that, the downtime
+is removed (may happen before or after the actual end time!).
+
+### Scheduling a downtime
+
+This can either happen through a web interface (Icinga 1.x Classic UI or Web)
+or by using the external command pipe provided by the `ExternalCommandListener`
+configuration.
+
+Fixed downtimes require a start and end time (a duration will be ignored).
+Flexible downtimes need a start and end time for the time span, and a duration
+independant of that.
+
+> **Note**
+>
+> Modern web interfaces treat services in a downtime as `handled`.
+
+### Triggered Downtimes
+
+This is optional when scheduling a downtime. If there is already a downtime
+scheduled for a future maintenance, the current downtime can be triggered by
+that downtime. This renders useful if you have scheduled a host downtime and
+are now scheduling a child host's downtime getting triggered by the parent
+downtime on NOT-OK state change.
  
  ## Comments
  
-TODO (move to basics?)
+Comments can be added at runtime and are persistent over restarts. You can
+add useful information for others on repeating incidents (for example
+"last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which
+is primarly accessible using web interfaces.
+
+Adding and deleting comment actions are possible through the external command pipe
+provided with the `ExternalCommandListener` configuration. The caller must
+pass the the comment id in case of manipulating an existing comment.
  
  ## Acknowledgements
  
-TODO (move to basics?)
+If a problem is alerted and notified you may signal the other notification
+receipients that you are aware of the problem and will handle it.
+
+By sending an acknowledgement to Icinga 2 (using the external command pipe
+provided with `ExternalCommandListener` configuration) all future notifications
+are suppressed, a new comment is added with the provided description and
+a notification with the type `NotificationFilterAcknowledgement` is sent
+to all notified users.
+
+> **Note**
+>
+> Modern web interfaces treat acknowledged problems as `handled`.
+
+### Expiring Acknowledgements
+
+Once a problem is acknowledged it may disappear from your `handled problems`
+dashboard and no-one ever looks at it again since it will suppress
+notifications too.
+
+This `fire-and-forget` action is quite common. If you're sure that a
+current problem should be resolved in the future at a defined time
+you can define an expiration time when acknowledging the problem.
+
+Icinga 2 will clear the acknowledgement when expired and start to
+re-notify if the problem persists.
  
  ## Cluster
  
@@ -24,7 +159,7 @@ The first step is the creation of CA using
  
         icinga2-build-ca 
  
-Please make sure to export a varialbe containing an empty folder for the created CA-files
+Please make sure to export a variable containing an empty folder for the created CA-files
  
         export ICINGA_CA="/root/icinga-ca"
  
@@ -111,7 +246,43 @@ If you update the configs on the configured file sender, it will force a restart
  
  ## Dependencies
  
-TODO
+Icinga 2 uses host and service dependencies as attribute directly on the host or
+service object or template. A service can depend on a host, and vice versa. A
+service has an implicit dependeny (parent) to its host. A host to host dependency acts
+implicit as host parent relation.
+
+A common scenario is the Icinga 2 server behind a router. Checking internet access
+by pinging the Google DNS server `google-dns` is a common method, but will fail in
+case the `dsl-router` host is down. Therefore the example below defines a host dependency
+which acts implicit as parent relation too.
+Furthermore the host may be reachable but ping samples are dropped by the router's iptables.
+By defining a service dependency the `google-dns ping4` re-check will be skipped in
+case the parent service `dsl-router ping4` is in a `NOT-OK` state.
+
+    object Host "dsl-router" {
+      services["ping4"] = {
+        templates = "generic-service",
+        check_command = "ping4"
+      }
+      
+      macros = {
+        address = "192.168.1.1",
+      },
+    }
+
+    object Host "google-dns" {
+      services["ping4"] = {
+        templates = "generic-service",
+        check_command = "ping4"
+        service_dependencies = { "dsl-router", "ping4" }
+      }
+      
+      macros = {
+        address = "8.8.8.8",
+      }, 
+      
+      host_dependencies = [ "dsl-router" ]
+    }
  
  ## Check Result Freshness
  
@@ -129,23 +300,47 @@ If the freshness checks are invalid, a new check is executed defined by the
  
  ## Check Flapping
  
-TODO
+The flapping algorithm used in Icinga 2 does not store the past states but
+calculcates the flapping threshold from a single value based on counters and
+half-life values. Icinga 2 compares the value with a single flapping threshold
+configuration attribute named `flapping_threshold`.
+
+> **Note**
+>
+> Flapping must be explicitely enabled seting the `Service` object attribute
+> `enable_flapping = 1`.
  
  ## Volatile Services
  
-TODO
+By default all services remain in a non-volatile state. Whe a problem
+occurs, the `SOFT` state applies and once `max_check_attempts` attribute
+is reached with the check counter, a `HARD` state transition happens.
+Notifications are only triggered by `HARD` state changes and are then
+re-sent defined by the `notification_interval` attribute.
+
+It may be reasonable to have a volatile service which stays in a `HARD`
+state type if the service stays in a `NOT-OK` state. That way each
+service recheck will automatically trigger a notification unless the
+service is acknowledged or in a scheduled downtime.
  
  ## Modified Attributes
  
-TODO
+Icinga 2 allows you to modify defined object attributes at runtime different to
+the local configuration object attributes. These modified attributes are
+stored as bit-shifted-value and made available in backends. Icinga 2 stores
+modified attributes in its state file and restores them on restart.
  
-## List of External Commands
+Modified Attributes can be reset using external commands.
  
-TODO
  
  ## Plugin API
  
-TODO
+Currently the native plugin api inherited from the `Nagios Plugins` project is available.
+Future specifications will be documented here.
+
+### Nagios Plugin API
+
+The `Nagios Plugin API` is defined the [Nagios Plugins Development Guidelines](https://www.nagios-plugins.org/doc/guidelines.html).
+
  
-### Nagios Plugins
author	Michael Friedrich <michael.friedrich@netways.de>
	Thu, 17 Oct 2013 18:40:07 +0000 (20:40 +0200)
committer	Michael Friedrich <michael.friedrich@netways.de>
	Thu, 17 Oct 2013 18:40:50 +0000 (20:40 +0200)