granicus.if.org Git - icinga2/blob - doc/08-advanced-topics.md

   1 # Advanced Topics <a id="advanced-topics"></a>
   2
   3 This chapter covers a number of advanced topics. If you're new to Icinga, you
   4 can safely skip over things you're not interested in.
   5
   6 ## Downtimes <a id="downtimes"></a>
   7
   8 Downtimes can be scheduled for planned server maintenance or
   9 any other targeted service outage you are aware of in advance.
  10
  11 Downtimes suppress notifications and can trigger other
  12 downtimes too. If the downtime was set by accident, or the duration
  13 exceeds the maintenance windows, you can manually cancel the downtime.
  14
  15 ### Scheduling a downtime <a id="scheduling-downtime"></a>
  16
  17 The most convenient way to schedule planned downtimes is to create
  18 them in Icinga Web 2 inside the host/service detail view. Select
  19 multiple hosts/services from the listing with the shift key to
  20 schedule multiple downtimes.
  21
  22 ![Downtime in Icinga Web 2](images/advanced-topics/icingaweb2_downtime_handled.png)
  23
  24 In addition to that you can schedule a downtime by using the Icinga 2 API action
  25 [schedule-downtime](12-icinga2-api.md#icinga2-api-actions-schedule-downtime).
  26 This is especially useful to schedule a downtime on-demand inside a (remote) backup
  27 script, or create maintenance downtimes from a cron job for specific dates and intervals.
  28
  29 Multiple downtimes for a single object may overlap. This is useful
  30 when you want to extend your maintenance window taking longer than expected.
  31 If there are multiple downtimes triggered for one object, the overall downtime depth
  32 will be greater than `1`.
  33
  34 If the downtime was scheduled after the problem changed to a critical hard
  35 state triggering a problem notification, and the service recovers during
  36 the downtime window, the recovery notification won't be suppressed.
  37
  38 Planned downtimes are also taken into account for SLA reporting
  39 tools calculating the SLAs based on the state and downtime history.
  40
  41 ### Fixed and Flexible Downtimes <a id="fixed-flexible-downtimes"></a>
  42
  43 A `fixed` downtime will be activated at the defined start time, and
  44 removed at the end time. During this time window the service state
  45 will change to `NOT-OK` and then actually trigger the downtime.
  46 Notifications are suppressed and the downtime depth is incremented.
  47
  48 Common scenarios are a planned distribution upgrade on your linux
  49 servers, or database updates in your warehouse. The customer knows
  50 about a fixed downtime window between 23:00 and 24:00. After 24:00
  51 all problems should be alerted again. Solution is simple -
  52 schedule a `fixed` downtime starting at 23:00 and ending at 24:00.
  53
  54 Unlike a `fixed` downtime, a `flexible` downtime will be triggered
  55 by the state change in the time span defined by start and end time,
  56 and then last for the specified duration in minutes.
  57
  58 Imagine the following scenario: Your service is frequently polled
  59 by users trying to grab free deleted domains for immediate registration.
  60 Between 07:30 and 08:00 the impact will hit for 15 minutes and generate
  61 a network outage visible to the monitoring. The service is still alive,
  62 but answering too slow to Icinga 2 service checks.
  63 For that reason, you may want to schedule a downtime between 07:30 and
  64 08:00 with a duration of 15 minutes. The downtime will then last from
  65 its trigger time until the duration is over. After that, the downtime
  66 is removed (may happen before or after the actual end time!).
  67
  68 #### Fixed Downtime <a id="fixed-downtime"></a>
  69
  70 If the host/service changes into a NOT-OK state between the start and
  71 end time window, the downtime will be marked as `in effect` and
  72 increases the downtime depth counter.
  73
  74 ```
  75    |       |         |
  76 start      |        end
  77        trigger time
  78 ```
  79
  80 #### Flexible Downtime <a id="flexible-downtime"></a>
  81
  82 A flexible downtime defines a time window where the downtime may be
  83 triggered from a host/service NOT-OK state change. It will then last
  84 until the specified time duration is reached. That way it can happen
  85 that the downtime end time is already gone, but the downtime ends
  86 at `trigger time + duration`.
  87
  88
  89 ```
  90    |       |         |
  91 start      |        end               actual end time
  92            |--------------duration--------|
  93        trigger time
  94 ```
  95
  96
  97 ### Triggered Downtimes <a id="triggered-downtimes"></a>
  98
  99 This is optional when scheduling a downtime. If there is already a downtime
 100 scheduled for a future maintenance, the current downtime can be triggered by
 101 that downtime. This renders useful if you have scheduled a host downtime and
 102 are now scheduling a child host's downtime getting triggered by the parent
 103 downtime on `NOT-OK` state change.
 104
 105 ### Recurring Downtimes <a id="recurring-downtimes"></a>
 106
 107 [ScheduledDowntime objects](09-object-types.md#objecttype-scheduleddowntime) can be used to set up
 108 recurring downtimes for services.
 109
 110 Example:
 111
 112 ```
 113 apply ScheduledDowntime "backup-downtime" to Service {
 114   author = "icingaadmin"
 115   comment = "Scheduled downtime for backup"
 116
 117   ranges = {
 118     monday = "02:00-03:00"
 119     tuesday = "02:00-03:00"
 120     wednesday = "02:00-03:00"
 121     thursday = "02:00-03:00"
 122     friday = "02:00-03:00"
 123     saturday = "02:00-03:00"
 124     sunday = "02:00-03:00"
 125   }
 126
 127   assign where "backup" in service.groups
 128 }
 129 ```
 130
 131 Icinga 2 attempts to find the next possible segment from a ScheduledDowntime object's
 132 `ranges` attribute, and wont create multiple downtimes in the future. In case you need
 133 all these downtimes planned and visible for the next days, weeks or months, schedule them
 134 manually via the [REST API](12-icinga2-api.md#icinga2-api-actions-schedule-downtime) using
 135 a script or cron job.
 136
 137 > **Note**
 138 >
 139 > If ScheduledDowntime objects are synced in a distributed high-availability setup,
 140 > both will create the next possible downtime on their own. These runtime generated
 141 > downtimes are synced among both zone instances, and you may see sort-of duplicate downtimes
 142 > in Icinga Web 2.
 143
 144
 145 ## Comments <a id="comments-intro"></a>
 146
 147 Comments can be added at runtime and are persistent over restarts. You can
 148 add useful information for others on repeating incidents (for example
 149 "last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which
 150 is primarily accessible using web interfaces.
 151
 152 You can add a comment either by using the Icinga 2 API action
 153 [add-comment](12-icinga2-api.md#icinga2-api-actions-add-comment) or
 154 by sending an [external command](14-features.md#external-commands).
 155
 156 ## Acknowledgements <a id="acknowledgements"></a>
 157
 158 If a problem persists and notifications have been sent, you can
 159 acknowledge the problem. That way other users will get
 160 a notification that you're aware of the issue and probably are
 161 already working on a fix.
 162
 163 Note: Acknowledgements also add a new [comment](08-advanced-topics.md#comments-intro)
 164 which contains the author and text fields.
 165
 166 You can send an acknowledgement either by using the Icinga 2 API action
 167 [acknowledge-problem](12-icinga2-api.md#icinga2-api-actions-acknowledge-problem) or
 168 by sending an [external command](14-features.md#external-commands).
 169
 170
 171 ### Sticky Acknowledgements <a id="sticky-acknowledgements"></a>
 172
 173 The acknowledgement is removed if a state change occurs or if the host/service
 174 recovers (OK/Up state).
 175
 176 If you acknowledge a problem once you've received a `Critical` notification,
 177 the acknowledgement will be removed if there is a state transition to `Warning`.
 178 ```
 179 OK -> WARNING -> CRITICAL -> WARNING -> OK
 180 ```
 181
 182 If you prefer to keep the acknowledgement until the problem is resolved (`OK`
 183 recovery) you need to enable the `sticky` parameter.
 184
 185
 186 ### Expiring Acknowledgements <a id="expiring-acknowledgements"></a>
 187
 188 Once a problem is acknowledged it may disappear from your `handled problems`
 189 dashboard and no-one ever looks at it again since it will suppress
 190 notifications too.
 191
 192 This `fire-and-forget` action is quite common. If you're sure that a
 193 current problem should be resolved in the future at a defined time,
 194 you can define an expiration time when acknowledging the problem.
 195
 196 Icinga 2 will clear the acknowledgement when expired and start to
 197 re-notify, if the problem persists.
 198
 199
 200 ## Time Periods <a id="timeperiods"></a>
 201
 202 [Time Periods](09-object-types.md#objecttype-timeperiod) define
 203 time ranges in Icinga where event actions are triggered, for
 204 example whether a service check is executed or not within
 205 the `check_period` attribute. Or a notification should be sent to
 206 users or not, filtered by the `period` and `notification_period`
 207 configuration attributes for `Notification` and `User` objects.
 208
 209 The `TimePeriod` attribute `ranges` may contain multiple directives,
 210 including weekdays, days of the month, and calendar dates.
 211 These types may overlap/override other types in your ranges dictionary.
 212
 213 The descending order of precedence is as follows:
 214
 215 * Calendar date (2008-01-01)
 216 * Specific month date (January 1st)
 217 * Generic month date (Day 15)
 218 * Offset weekday of specific month (2nd Tuesday in December)
 219 * Offset weekday (3rd Monday)
 220 * Normal weekday (Tuesday)
 221
 222 If you don't set any `check_period` or `notification_period` attribute
 223 on your configuration objects, Icinga 2 assumes `24x7` as time period
 224 as shown below.
 225
 226 ```
 227 object TimePeriod "24x7" {
 228   display_name = "Icinga 2 24x7 TimePeriod"
 229   ranges = {
 230     "monday"    = "00:00-24:00"
 231     "tuesday"   = "00:00-24:00"
 232     "wednesday" = "00:00-24:00"
 233     "thursday"  = "00:00-24:00"
 234     "friday"    = "00:00-24:00"
 235     "saturday"  = "00:00-24:00"
 236     "sunday"    = "00:00-24:00"
 237   }
 238 }
 239 ```
 240
 241 If your operation staff should only be notified during workhours,
 242 create a new timeperiod named `workhours` defining a work day from
 243 09:00 to 17:00.
 244
 245 ```
 246 object TimePeriod "workhours" {
 247   display_name = "Icinga 2 8x5 TimePeriod"
 248   ranges = {
 249     "monday"    = "09:00-17:00"
 250     "tuesday"   = "09:00-17:00"
 251     "wednesday" = "09:00-17:00"
 252     "thursday"  = "09:00-17:00"
 253     "friday"    = "09:00-17:00"
 254   }
 255 }
 256 ```
 257
 258 If you want to specify a notification period across midnight,
 259 you can define it the following way:
 260
 261 ```
 262 object Timeperiod "across-midnight" {
 263   display_name = "Nightly Notification"
 264   ranges = {
 265     "saturday" = "22:00-24:00"
 266     "sunday" = "00:00-03:00"
 267   }
 268 }
 269 ```
 270
 271 Below you can see another example for configuring timeperiods across several
 272 days, weeks or months. This can be useful when taking components offline
 273 for a distinct period of time.
 274
 275 ```
 276 object Timeperiod "standby" {
 277   display_name = "Standby"
 278   ranges = {
 279     "2016-09-30 - 2016-10-30" = "00:00-24:00"
 280   }
 281 }
 282 ```
 283
 284 Please note that the spaces before and after the dash are mandatory.
 285
 286 Once your time period is configured you can Use the `period` attribute
 287 to assign time periods to `Notification` and `Dependency` objects:
 288
 289 ```
 290 apply Notification "mail-icingaadmin" to Service {
 291   import "mail-service-notification"
 292   user_groups = host.vars.notification.mail.groups
 293   users = host.vars.notification.mail.users
 294
 295   period = "workhours"
 296
 297   assign where host.vars.notification.mail
 298 }
 299 ```
 300
 301 ### Time Periods Inclusion and Exclusion <a id="timeperiods-includes-excludes"></a>
 302
 303 Sometimes it is necessary to exclude certain time ranges from
 304 your default time period definitions, for example, if you don't
 305 want to send out any notification during the holiday season,
 306 or if you only want to allow small time windows for executed checks.
 307
 308 The [TimePeriod object](09-object-types.md#objecttype-timeperiod)
 309 provides the `includes` and `excludes` attributes to solve this issue.
 310 `prefer_includes` defines whether included or excluded time periods are
 311 preferred.
 312
 313 The following example defines a time period called `holidays` where
 314 notifications should be suppressed:
 315
 316 ```
 317 object TimePeriod "holidays" {
 318   ranges = {
 319     "january 1" = "00:00-24:00"                 //new year's day
 320     "july 4" = "00:00-24:00"                    //independence day
 321     "december 25" = "00:00-24:00"               //christmas
 322     "december 31" = "18:00-24:00"               //new year's eve (6pm+)
 323     "2017-04-16" = "00:00-24:00"                //easter 2017
 324     "monday -1 may" = "00:00-24:00"             //memorial day (last monday in may)
 325     "monday 1 september" = "00:00-24:00"        //labor day (1st monday in september)
 326     "thursday 4 november" = "00:00-24:00"       //thanksgiving (4th thursday in november)
 327   }
 328 }
 329 ```
 330
 331 In addition to that the time period `weekends` defines an additional
 332 time window which should be excluded from notifications:
 333
 334 ```
 335 object TimePeriod "weekends-excluded" {
 336   ranges = {
 337     "saturday"  = "00:00-09:00,18:00-24:00"
 338     "sunday"    = "00:00-09:00,18:00-24:00"
 339   }
 340 }
 341 ```
 342
 343 The time period `prod-notification` defines the default time ranges
 344 and adds the excluded time period names as an array.
 345
 346 ```
 347 object TimePeriod "prod-notification" {
 348   excludes = [ "holidays", "weekends-excluded" ]
 349
 350   ranges = {
 351     "monday"    = "00:00-24:00"
 352     "tuesday"   = "00:00-24:00"
 353     "wednesday" = "00:00-24:00"
 354     "thursday"  = "00:00-24:00"
 355     "friday"    = "00:00-24:00"
 356     "saturday"  = "00:00-24:00"
 357     "sunday"    = "00:00-24:00"
 358   }
 359 }
 360 ```
 361
 362 ## External Check Results <a id="external-check-results"></a>
 363
 364 Hosts or services which do not actively execute a check plugin to receive
 365 the state and output are called "passive checks" or "external check results".
 366 In this scenario an external client or script is sending in check results.
 367
 368 You can feed check results into Icinga 2 with the following transport methods:
 369
 370 * [process-check-result action](12-icinga2-api.md#icinga2-api-actions-process-check-result) available with the [REST API](12-icinga2-api.md#icinga2-api) (remote and local)
 371 * External command sent via command pipe (local only)
 372
 373 Each time a new check result is received, the next expected check time
 374 is updated. This means that if there are no check result received from
 375 the external source, Icinga 2 will execute [freshness checks](08-advanced-topics.md#check-result-freshness).
 376
 377 > **Note**
 378 >
 379 > The REST API action allows to specify the `check_source` attribute
 380 > which helps identifying the external sender. This is also visible
 381 > in Icinga Web 2 and the REST API queries.
 382
 383 ## Check Result Freshness <a id="check-result-freshness"></a>
 384
 385 In Icinga 2 active check freshness is enabled by default. It is determined by the
 386 `check_interval` attribute and no incoming check results in that period of time.
 387
 388 The threshold is calculated based on the last check execution time for actively executed checks:
 389
 390     (last check execution time + check interval) > current time
 391
 392 If this host/service receives check results from an [external source](08-advanced-topics.md#external-check-results),
 393 the threshold is based on the last time a check result was received:
 394
 395     (last check result time + check interval) > current time
 396
 397 > **Tip**
 398 >
 399 > The [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result) REST API
 400 > action allows to overrule the pre-defined check interval with a specified TTL in Icinga 2 v2.9+.
 401
 402 If the freshness checks fail, Icinga 2 will execute the defined check command.
 403
 404 Best practice is to define a [dummy](10-icinga-template-library.md#itl-dummy) `check_command` which gets
 405 executed when freshness checks fail.
 406
 407 ```
 408 apply Service "external-check" {
 409   check_command = "dummy"
 410   check_interval = 1m
 411
 412   /* Set the state to UNKNOWN (3) if freshness checks fail. */
 413   vars.dummy_state = 3
 414
 415   /* Use a runtime function to retrieve the last check time and more details. */
 416   vars.dummy_text = {{
 417     var service = get_service(macro("$host.name$"), macro("$service.name$"))
 418     var lastCheck = DateTime(service.last_check).to_string()
 419
 420     return "No check results received. Last result time: " + lastCheck
 421   }}
 422
 423   assign where "external" in host.vars.services
 424 }
 425 ```
 426
 427 References: [get_service](18-library-reference.md#objref-get_service), [macro](18-library-reference.md#scoped-functions-macro), [DateTime](18-library-reference.md#datetime-type).
 428
 429 Example output in Icinga Web 2:
 430
 431 ![Icinga 2 Freshness Checks](images/advanced-topics/icinga2_external_checks_freshness_icingaweb2.png)
 432
 433
 434 ## Check Flapping <a id="check-flapping"></a>
 435
 436 Icinga 2 supports optional detection of hosts and services that are "flapping".
 437
 438 Flapping occurs when a service or host changes state too frequently, which would result in a storm of problem and
 439 recovery notifications. With flapping detection enabled a flapping notification will be sent while other notifications are
 440 suppresed until it calms down after receiving the same status from checks a few times. Flapping detection can help detect
 441
 442 configuration problems (wrong thresholds), troublesome services, or network problems.
 443
 444 Flapping detection can be enabled or disabled using the `enable_flapping` attribute.
 445 The `flapping_threshold_high` and `flapping_threshold_low` attributes allows to specify the thresholds that control
 446 when a [host](09-object-types.md#objecttype-host) or [service](objecttype-service) is considered to be flapping.
 447
 448 The default thresholds are 30% for high and 25% for low. If the computed flapping value exceeds the high threshold a
 449 host or service is considered flapping until it drops below the low flapping threshold.
 450
 451 `FlappingStart` and `FlappingEnd` notifications will be sent out accordingly, if configured. See the chapter on
 452 [notifications](alert-notifications) for details
 453
 454 > Note: There is no distinctions between hard and soft states with flapping. All state changes count and notifications
 455 > will be sent out regardless of the objects state.
 456
 457 ### How it works <a id="check-flapping-how-it-works"></a>
 458
 459 Icinga 2 saves the last 20 state changes for every host and service. See the graphic below:
 460
 461 ![Icinga 2 Flapping State Timeline](images/advanced-topics/flapping-state-graph.png)
 462
 463 All the states ware weighted, with the most recent one being worth the most (1.15) and the 20th the least (0.8). The
 464 states in between are fairly distributed. The final flapping value are the weighted state changes divided by the total
 465 count of 20.
 466
 467 In the example above, the added states would have a total value of 7.82 (`0.84 + 0.86 + 0.88 + 0.9 + 0.98 + 1.06 + 1.12 + 1.18`).
 468 This yields a flapping percentage of 39.1% (`7.82 / 20 * 100`). As the default upper flapping threshold is 30%, it would be
 469 considered flapping.
 470
 471 If the next seven check results then would not be state changes, the flapping percentage would fall below the lower threshold
 472 of 25% and therefore the host or service would recover from flapping.
 473
 474 ## Volatile Services and Hosts <a id="volatile-services-hosts"></a>
 475
 476 The `volatile` option, if enabled for a host or service, makes it treat every [state change](03-monitoring-basics.md#hard-soft-states)
 477 as a `HARD` state change. It is comparable to `max_check_attempts = 1`. With this any `NOT-OK` result will
 478 ignore `max_check_attempts` and trigger notifications etc. It will further cause any additional `NOT-OK`
 479 result to re-send notifications.
 480
 481 It may be reasonable to have a volatile service which stays in a `HARD` state if the service stays in a `NOT-OK`
 482 state. That way each service recheck will automatically trigger a notification unless the service is acknowledged or
 483 in a scheduled downtime.
 484
 485 A common example are security checks where each `NOT-OK` check result should immediately trigger a notification.
 486
 487 The default for this option is `false` and should only be enabled when required.
 488
 489
 490 ## Monitoring Icinga 2 <a id="monitoring-icinga"></a>
 491
 492 Why should you do that? Icinga and its components run like any other
 493 service application on your server. There are predictable issues
 494 such as "disk space is running low" and your monitoring suffers from just
 495 that.
 496
 497 You would also like to ensure that features and backends are running
 498 and storing required data. Be it the database backend where Icinga Web 2
 499 presents fancy dashboards, forwarded metrics to Graphite or InfluxDB or
 500 the entire distributed setup.
 501
 502 This list isn't complete but should help with your own setup.
 503 Windows client specific checks are highlighted.
 504
 505 Type            | Description                   | Plugins and CheckCommands
 506 ----------------|-------------------------------|-----------------------------------------------------
 507 System          | Filesystem                    | [disk](10-icinga-template-library.md#plugin-check-command-disk), [disk-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 508 System          | Memory, Swap                  | [mem](10-icinga-template-library.md#plugin-contrib-command-mem), [swap](10-icinga-template-library.md#plugin-check-command-swap), [memory](10-icinga-template-library.md#windows-plugins) (Windows Client)
 509 System          | Hardware                      | [hpasm](10-icinga-template-library.md#plugin-contrib-command-hpasm), [ipmi-sensor](10-icinga-template-library.md#plugin-contrib-command-ipmi-sensor)
 510 System          | Virtualization                | [VMware](10-icinga-template-library.md#plugin-contrib-vmware), [esxi_hardware](10-icinga-template-library.md#plugin-contrib-command-esxi-hardware)
 511 System          | Processes                     | [procs](10-icinga-template-library.md#plugin-check-command-processes), [service-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 512 System          | System Activity Reports       | [check_sar_perf](https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_sar_perf.py)
 513 System          | I/O                           | [iostat](10-icinga-template-library.md#plugin-contrib-command-iostat)
 514 System          | Network interfaces            | [nwc_health](10-icinga-template-library.md#plugin-contrib-command-nwc_health), [interfaces](10-icinga-template-library.md#plugin-contrib-command-interfaces)
 515 System          | Users                         | [users](10-icinga-template-library.md#plugin-check-command-users), [users-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 516 System          | Logs                          | Forward them to [Elastic Stack](14-features.md#elastic-stack-integration) or [Graylog](14-features.md#graylog-integration) and add your own alerts.
 517 System          | NTP                           | [ntp_time](10-icinga-template-library.md#plugin-check-command-ntp-time)
 518 System          | Updates                       | [apt](10-icinga-template-library.md#plugin-check-command-apt), [yum](10-icinga-template-library.md#plugin-contrib-command-yum)
 519 Icinga          | Status & Stats                | [icinga](10-icinga-template-library.md#itl-icinga) (more below)
 520 Icinga          | Cluster & Clients             | [health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks)
 521 Database        | MySQL                         | [mysql_health](10-icinga-template-library.md#plugin-contrib-command-mysql_health)
 522 Database        | PostgreSQL                    | [postgres](10-icinga-template-library.md#plugin-contrib-command-postgres)
 523 Database        | Housekeeping                  | Check the database size and growth and analyse metrics to examine trends.
 524 Database        | DB IDO                        | [ido](10-icinga-template-library.md#itl-icinga-ido) (more below)
 525 Webserver       | Apache2, Nginx, etc.          | [http](10-icinga-template-library.md#plugin-check-command-http), [apache_status](10-icinga-template-library.md#plugin-contrib-command-apache_status), [nginx_status](10-icinga-template-library.md#plugin-contrib-command-nginx_status)
 526 Webserver       | Certificates                  | [http](10-icinga-template-library.md#plugin-check-command-http)
 527 Webserver       | Authorization                 | [http](10-icinga-template-library.md#plugin-check-command-http)
 528 Notifications   | Mail (queue)                  | [smtp](10-icinga-template-library.md#plugin-check-command-smtp), [mailq](10-icinga-template-library.md#plugin-check-command-mailq)
 529 Notifications   | SMS (GSM modem)               | [check_sms3_status](https://exchange.icinga.com/netways/check_sms3status)
 530 Notifications   | Messengers, Cloud services    | XMPP, Twitter, IRC, Telegram, PagerDuty, VictorOps, etc.
 531 Metrics         | PNP, RRDTool                  | [check_pnp_rrds](https://github.com/lingej/pnp4nagios/tree/master/scripts) checks for stale RRD files.
 532 Metrics         | Graphite                      | [graphite](10-icinga-template-library.md#plugin-contrib-command-graphite)
 533 Metrics         | InfluxDB                      | [check_influxdb](https://exchange.icinga.com/Mikanoshi/InfluxDB+data+monitoring+plugin)
 534 Metrics         | Elastic Stack                 | [elasticsearch](10-icinga-template-library.md#plugin-contrib-command-elasticsearch), [Elastic Stack integration](14-features.md#elastic-stack-integration)
 535 Metrics         | Graylog                       | [Graylog integration](14-features.md#graylog-integration)
 536
 537
 538 The [icinga](10-icinga-template-library.md#itl-icinga) CheckCommand provides metrics for the runtime stats of
 539 Icinga 2. You can forward them to your preferred graphing solution.
 540 If you require more metrics you can also query the [REST API](12-icinga2-api.md#icinga2-api) and write
 541 your own custom check plugin. Or you keep using the built-in [object accessor functions](08-advanced-topics.md#access-object-attributes-at-runtime)
 542 to calculate stats in-memory.
 543
 544 There is a built-in [ido](10-icinga-template-library.md#itl-icinga-ido) check available for DB IDO MySQL/PostgreSQL
 545 which provides additional metrics for the IDO database.
 546
 547 ```
 548 apply Service "ido-mysql" {
 549   check_command = "ido"
 550
 551   vars.ido_type = "IdoMysqlConnection"
 552   vars.ido_name = "ido-mysql" //the name defined in /etc/icinga2/features-enabled/ido-mysql.conf
 553
 554   assign where match("master*.localdomain", host.name)
 555 }
 556 ```
 557
 558 More specific database queries can be found in the [DB IDO](14-features.md#db-ido) chapter.
 559
 560 Distributed setups should include specific [health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks).
 561 You might also want to add additional checks for SSL certificate expiration.
 562
 563
 564 ## Advanced Configuration Hints <a id="advanced-configuration-hints"></a>
 565
 566 ### Advanced Use of Apply Rules <a id="advanced-use-of-apply-rules"></a>
 567
 568 [Apply rules](03-monitoring-basics.md#using-apply) can be used to create a rule set which is
 569 entirely based on host objects and their attributes.
 570 In addition to that [apply for and custom attribute override](03-monitoring-basics.md#using-apply-for)
 571 extend the possibilities.
 572
 573 The following example defines a dictionary on the host object which contains
 574 configuration attributes for multiple web servers. This then used to add three checks:
 575
 576 * A `ping4` check using the local IP `address` of the web server.
 577 * A `tcp` check querying the TCP port where the HTTP service is running on.
 578 * If the `url` key is defined, the third apply for rule will create service objects using the `http` CheckCommand.
 579 In addition to that you can optionally define the `ssl` attribute which enables HTTPS checks.
 580
 581 Host definition:
 582
 583     object Host "webserver01" {
 584       import "generic-host"
 585       address = "192.168.56.200"
 586       vars.os = "Linux"
 587
 588       vars.webserver = {
 589         instance["status"] = {
 590           address = "192.168.56.201"
 591           port = "80"
 592           url = "/status"
 593         }
 594         instance["tomcat"] = {
 595           address = "192.168.56.202"
 596           port = "8080"
 597         }
 598         instance["icingaweb2"] = {
 599           address = "192.168.56.210"
 600           port = "443"
 601           url = "/icingaweb2"
 602           ssl = true
 603         }
 604       }
 605     }
 606
 607 Service apply for definitions:
 608
 609     apply Service "webserver_ping" for (instance => config in host.vars.webserver.instance) {
 610       display_name = "webserver_" + instance
 611       check_command = "ping4"
 612
 613       vars.ping_address = config.address
 614
 615       assign where host.vars.webserver.instance
 616     }
 617
 618     apply Service "webserver_port" for (instance => config in host.vars.webserver.instance) {
 619       display_name = "webserver_" + instance + "_" + config.port
 620       check_command = "tcp"
 621
 622       vars.tcp_address = config.address
 623       vars.tcp_port = config.port
 624
 625       assign where host.vars.webserver.instance
 626     }
 627
 628     apply Service "webserver_url" for (instance => config in host.vars.webserver.instance) {
 629       display_name = "webserver_" + instance + "_" + config.url
 630       check_command = "http"
 631
 632       vars.http_address = config.address
 633       vars.http_port = config.port
 634       vars.http_uri = config.url
 635
 636       if (config.ssl) {
 637         vars.http_ssl = config.ssl
 638       }
 639
 640       assign where config.url != ""
 641     }
 642
 643 The variables defined in the host dictionary are not using the typical custom attribute
 644 prefix recommended for CheckCommand parameters. Instead they are re-used for multiple
 645 service checks in this example.
 646 In addition to defining check parameters this way, you can also enrich the `display_name`
 647 attribute with more details. This will be shown in in Icinga Web 2 for example.
 648
 649 ### Use Functions in Object Configuration <a id="use-functions-object-config"></a>
 650
 651 There is a limited scope where functions can be used as object attributes such as:
 652
 653 * As value for [Custom Attributes](03-monitoring-basics.md#custom-attributes-functions)
 654 * Returning boolean expressions for [set_if](08-advanced-topics.md#use-functions-command-arguments-setif) inside command arguments
 655 * Returning a [command](08-advanced-topics.md#use-functions-command-attribute) array inside command objects
 656
 657 The other way around you can create objects dynamically using your own global functions.
 658
 659 > **Note**
 660 >
 661 > Functions called inside command objects share the same global scope as runtime macros.
 662 > Therefore you can access host custom attributes like `host.vars.os`, or any other
 663 > object attribute from inside the function definition used for [set_if](08-advanced-topics.md#use-functions-command-arguments-setif) or [command](08-advanced-topics.md#use-functions-command-attribute).
 664
 665 Tips when implementing functions:
 666
 667 * Use [log()](18-library-reference.md#global-functions-log) to dump variables. You can see the output
 668 inside the `icinga2.log` file depending in your log severity
 669 * Use the `icinga2 console` to test basic functionality (e.g. iterating over a dictionary)
 670 * Build them step-by-step. You can always refactor your code later on.
 671
 672 #### Register and Use Global Functions <a id="use-functions-global-register"></a>
 673
 674 [Functions](17-language-reference.md#functions) can be registered into the global scope. This allows custom functions being available
 675 in objects and other functions. Keep in mind that these functions are not marked
 676 as side-effect-free and as such are not available via the REST API.
 677
 678 Add a new configuration file `functions.conf` and include it into the [icinga2.conf](04-configuring-icinga-2.md#icinga2-conf)
 679 configuration file in the very beginning, e.g. after `constants.conf`. You can also manage global
 680 functions inside `constants.conf` if you prefer.
 681
 682 The following function converts a given state parameter into a returned string value. The important
 683 bits for registering it into the global scope are:
 684
 685 * `globals.<unique_function_name>` adds a new globals entry.
 686 * `function()` specifies that a call to `state_to_string()` executes a function.
 687 * Function parameters are defined inside the `function()` definition.
 688
 689 ```
 690 globals.state_to_string = function(state) {
 691   if (state == 2) {
 692     return "Critical"
 693   } else if (state == 1) {
 694     return "Warning"
 695   } else if (state == 0) {
 696     return "OK"
 697   } else if (state == 3) {
 698     return "Unknown"
 699   } else {
 700     log(LogWarning, "state_to_string", "Unknown state " + state + " provided.")
 701   }
 702 }
 703 ```
 704
 705 The else-condition allows for better error handling. This warning will be shown in the Icinga 2
 706 log file once the function is called.
 707
 708 > **Note**
 709 >
 710 > If these functions are used in a distributed environment, you must ensure to deploy them
 711 > everywhere needed.
 712
 713 In order to test-drive the newly created function, restart Icinga 2 and use the [debug console](11-cli-commands.md#cli-command-console)
 714 to connect to the REST API.
 715
 716 ```
 717 $ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/'
 718 Icinga 2 (version: v2.8.1-373-g4bea6d25c)
 719 <1> => globals.state_to_string(1)
 720 "Warning"
 721 <2> => state_to_string(2)
 722 "Critical"
 723 ```
 724
 725 You can see that this function is now registered into the [global scope](17-language-reference.md#variable-scopes). The function call
 726 `state_to_string()` can be used in any object at static config compile time or inside runtime
 727 lambda functions.
 728
 729 The following service object example uses the service state and converts it to string output.
 730 The function definition is not optimized and is enrolled for better readability including a log message.
 731
 732 ```
 733 object Service "state-test" {
 734   check_command = "dummy"
 735   host_name = NodeName
 736
 737   vars.dummy_state = 2
 738
 739   vars.dummy_text = {{
 740     var h = macro("$host.name$")
 741     var s = macro("$service.name$")
 742
 743     var state = get_service(h, s).state
 744
 745     log(LogInformation, "dummy_state", "Host: " + h + " Service: " + s + " State: " + state)
 746
 747     return state_to_string(state)
 748   }}
 749 }
 750 ```
 751
 752
 753 #### Use Custom Functions as Attribute <a id="custom-functions-as-attribute"></a>
 754
 755 To use custom functions as attributes, the function must be defined in a
 756 slightly unexpected way. The following example shows how to assign values
 757 depending on group membership. All hosts in the `slow-lan` host group use 300
 758 as value for `ping_wrta`, all other hosts use 100.
 759
 760     globals.group_specific_value = function(group, group_value, non_group_value) {
 761         return function() use (group, group_value, non_group_value) {
 762             if (group in host.groups) {
 763                 return group_value
 764             } else {
 765                 return non_group_value
 766             }
 767         }
 768     }
 769
 770     apply Service "ping4" {
 771         import "generic-service"
 772         check_command = "ping4"
 773
 774         vars.ping_wrta = group_specific_value("slow-lan", 300, 100)
 775         vars.ping_crta = group_specific_value("slow-lan", 500, 200)
 776
 777         assign where true
 778     }
 779
 780 #### Use Functions in Assign Where Expressions <a id="use-functions-assign-where"></a>
 781
 782 If a simple expression for matching a name or checking if an item
 783 exists in an array or dictionary does not fit, you should consider
 784 writing your own global [functions](17-language-reference.md#functions).
 785 You can call them inside `assign where` and `ignore where` expressions
 786 for [apply rules](03-monitoring-basics.md#using-apply-expressions) or
 787 [group assignments](03-monitoring-basics.md#group-assign-intro) just like
 788 any other global functions for example [match](18-library-reference.md#global-functions-match).
 789
 790 The following example requires the host `myprinter` being added
 791 to the host group `printers-lexmark` but only if the host uses
 792 a template matching the name `lexmark*`.
 793
 794     template Host "lexmark-printer-host" {
 795       vars.printer_type = "Lexmark"
 796     }
 797
 798     object Host "myprinter" {
 799       import "generic-host"
 800       import "lexmark-printer-host"
 801
 802       address = "192.168.1.1"
 803     }
 804
 805     /* register a global function for the assign where call */
 806     globals.check_host_templates = function(host, search) {
 807       /* iterate over all host templates and check if the search matches */
 808       for (tmpl in host.templates) {
 809         if (match(search, tmpl)) {
 810           return true
 811         }
 812       }
 813
 814       /* nothing matched */
 815       return false
 816     }
 817
 818     object HostGroup "printers-lexmark" {
 819       display_name = "Lexmark Printers"
 820       /* call the global function and pass the arguments */
 821       assign where check_host_templates(host, "lexmark*")
 822     }
 823
 824
 825 Take a different more complex example: All hosts with the
 826 custom attribute `vars_app` as nested dictionary should be
 827 added to the host group `ABAP-app-server`. But only if the
 828 `app_type` for all entries is set to `ABAP`.
 829
 830 It could read as wildcard match for nested dictionaries:
 831
 832     where host.vars.vars_app["*"].app_type == "ABAP"
 833
 834 The solution for this problem is to register a global
 835 function which checks the `app_type` for all hosts
 836 with the `vars_app` dictionary.
 837
 838     object Host "appserver01" {
 839       check_command = "dummy"
 840       vars.vars_app["ABC"] = { app_type = "ABAP" }
 841     }
 842     object Host "appserver02" {
 843       check_command = "dummy"
 844       vars.vars_app["DEF"] = { app_type = "ABAP" }
 845     }
 846
 847     globals.check_app_type = function(host, type) {
 848       /* ensure that other hosts without the custom attribute do not match */
 849       if (typeof(host.vars.vars_app) != Dictionary) {
 850         return false
 851       }
 852
 853       /* iterate over the vars_app dictionary */
 854       for (key => val in host.vars.vars_app) {
 855         /* if the value is a dictionary and if contains the app_type being the requested type */
 856         if (typeof(val) == Dictionary && val.app_type == type) {
 857           return true
 858         }
 859       }
 860
 861       /* nothing matched */
 862       return false
 863     }
 864
 865     object HostGroup "ABAP-app-server" {
 866       assign where check_app_type(host, "ABAP")
 867     }
 868
 869
 870 #### Use Functions in Command Arguments set_if <a id="use-functions-command-arguments-setif"></a>
 871
 872 The `set_if` attribute inside the command arguments definition in the
 873 [CheckCommand object definition](09-object-types.md#objecttype-checkcommand) is primarily used to
 874 evaluate whether the command parameter should be set or not.
 875
 876 By default you can evaluate runtime macros for their existence. If the result is not an empty
 877 string, the command parameter is passed. This becomes fairly complicated when want to evaluate
 878 multiple conditions and attributes.
 879
 880 The following example was found on the community support channels. The user had defined a host
 881 dictionary named `compellent` with the key `disks`. This was then used inside service apply for rules.
 882
 883     object Host "dict-host" {
 884       check_command = "check_compellent"
 885       vars.compellent["disks"] = {
 886         file = "/var/lib/check_compellent/san_disks.0.json",
 887         checks = ["disks"]
 888       }
 889     }
 890
 891 The more significant problem was to only add the command parameter `--disk` to the plugin call
 892 when the dictionary `compellent` contains the key `disks`, and omit it if not found.
 893
 894 By defining `set_if` as [abbreviated lambda function](17-language-reference.md#nullary-lambdas)
 895 and evaluating the host custom attribute `compellent` containing the `disks` this problem was
 896 solved like this:
 897
 898     object CheckCommand "check_compellent" {
 899       command   = [ "/usr/bin/check_compellent" ]
 900       arguments   = {
 901         "--disks"  = {
 902           set_if = {{
 903             var host_vars = host.vars
 904             log(host_vars)
 905             var compel = host_vars.compellent
 906             log(compel)
 907             compel.contains("disks")
 908           }}
 909         }
 910       }
 911     }
 912
 913 This implementation uses the dictionary type method [contains](18-library-reference.md#dictionary-contains)
 914 and will fail if `host.vars.compellent` is not of the type `Dictionary`.
 915 Therefore you can extend the checks using the [typeof](17-language-reference.md#types) function.
 916
 917 You can test the types using the `icinga2 console`:
 918
 919     # icinga2 console
 920     Icinga (version: v2.3.0-193-g3eb55ad)
 921     <1> => srv_vars.compellent["check_a"] = { file="outfile_a.json", checks = [ "disks", "fans" ] }
 922     null
 923     <2> => srv_vars.compellent["check_b"] = { file="outfile_b.json", checks = [ "power", "voltages" ] }
 924     null
 925     <3> => typeof(srv_vars.compellent)
 926     type 'Dictionary'
 927     <4> =>
 928
 929 The more programmatic approach for `set_if` could look like this:
 930
 931         "--disks" = {
 932           set_if = {{
 933             var srv_vars = service.vars
 934             if(len(srv_vars) > 0) {
 935               if (typeof(srv_vars.compellent) == Dictionary) {
 936                 return srv_vars.compellent.contains("disks")
 937               } else {
 938                 log(LogInformationen, "checkcommand set_if", "custom attribute compellent_checks is not a dictionary, ignoring it.")
 939                 return false
 940               }
 941             } else {
 942               log(LogWarning, "checkcommand set_if", "empty custom attributes")
 943               return false
 944             }
 945           }}
 946         }
 947
 948
 949 #### Use Functions as Command Attribute <a id="use-functions-command-attribute"></a>
 950
 951 This comes in handy for [NotificationCommands](09-object-types.md#objecttype-notificationcommand)
 952 or [EventCommands](09-object-types.md#objecttype-eventcommand) which does not require
 953 a returned checkresult including state/output.
 954
 955 The following example was taken from the community support channels. The requirement was to
 956 specify a custom attribute inside the notification apply rule and decide which notification
 957 script to call based on that.
 958
 959     object User "short-dummy" {
 960     }
 961
 962     object UserGroup "short-dummy-group" {
 963       assign where user.name == "short-dummy"
 964     }
 965
 966     apply Notification "mail-admins-short" to Host {
 967        import "mail-host-notification"
 968        command = "mail-host-notification-test"
 969        user_groups = [ "short-dummy-group" ]
 970        vars.short = true
 971        assign where host.vars.notification.mail
 972     }
 973
 974 The solution is fairly simple: The `command` attribute is implemented as function returning
 975 an array required by the caller Icinga 2.
 976 The local variable `mailscript` sets the default value for the notification scrip location.
 977 If the notification custom attribute `short` is set, it will override the local variable `mailscript`
 978 with a new value.
 979 The `mailscript` variable is then used to compute the final notification command array being
 980 returned.
 981
 982 You can omit the `log()` calls, they only help debugging.
 983
 984     object NotificationCommand "mail-host-notification-test" {
 985       command = {{
 986         log("command as function")
 987         var mailscript = "mail-host-notification-long.sh"
 988         if (notification.vars.short) {
 989            mailscript = "mail-host-notification-short.sh"
 990         }
 991         log("Running command")
 992         log(mailscript)
 993
 994         var cmd = [ SysconfDir + "/icinga2/scripts/" + mailscript ]
 995         log(LogCritical, "me", cmd)
 996         return cmd
 997       }}
 998
 999       env = {
1000       }
1001     }
1002
1003
1004 ### Access Object Attributes at Runtime <a id="access-object-attributes-at-runtime"></a>
1005
1006 The [Object Accessor Functions](18-library-reference.md#object-accessor-functions)
1007 can be used to retrieve references to other objects by name.
1008
1009 This allows you to access configuration and runtime object attributes. A detailed
1010 list can be found [here](09-object-types.md#object-types).
1011
1012 #### Access Object Attributes at Runtime: Cluster Check <a id="access-object-attributes-at-runtime-cluster-check"></a>
1013
1014 This is a simple cluster example for accessing two host object states and calculating a virtual
1015 cluster state and output:
1016
1017 ```
1018 object Host "cluster-host-01" {
1019   check_command = "dummy"
1020   vars.dummy_state = 2
1021   vars.dummy_text = "This host is down."
1022 }
1023
1024 object Host "cluster-host-02" {
1025   check_command = "dummy"
1026   vars.dummy_state = 0
1027   vars.dummy_text = "This host is up."
1028 }
1029
1030 object Host "cluster" {
1031   check_command = "dummy"
1032   vars.cluster_nodes = [ "cluster-host-01", "cluster-host-02" ]
1033
1034   vars.dummy_state = {{
1035     var up_count = 0
1036     var down_count = 0
1037     var cluster_nodes = macro("$cluster_nodes$")
1038
1039     for (node in cluster_nodes) {
1040       if (get_host(node).state > 0) {
1041         down_count += 1
1042       } else {
1043         up_count += 1
1044       }
1045     }
1046
1047     if (up_count >= down_count) {
1048       return 0 //same up as down -> UP
1049     } else {
1050       return 2 //something is broken
1051     }
1052   }}
1053
1054   vars.dummy_text = {{
1055     var output = "Cluster hosts:\n"
1056     var cluster_nodes = macro("$cluster_nodes$")
1057
1058     for (node in cluster_nodes) {
1059       output += node + ": " + get_host(node).last_check_result.output + "\n"
1060     }
1061
1062     return output
1063   }}
1064 }
1065 ```
1066
1067 #### Time Dependent Thresholds <a id="access-object-attributes-at-runtime-time-dependent-thresholds"></a>
1068
1069 The following example sets time dependent thresholds for the load check based on the current
1070 time of the day compared to the defined time period.
1071
1072 ```
1073 object TimePeriod "backup" {
1074   ranges = {
1075     monday = "02:00-03:00"
1076     tuesday = "02:00-03:00"
1077     wednesday = "02:00-03:00"
1078     thursday = "02:00-03:00"
1079     friday = "02:00-03:00"
1080     saturday = "02:00-03:00"
1081     sunday = "02:00-03:00"
1082   }
1083 }
1084
1085 object Host "webserver-with-backup" {
1086   check_command = "hostalive"
1087   address = "127.0.0.1"
1088 }
1089
1090 object Service "webserver-backup-load" {
1091   check_command = "load"
1092   host_name = "webserver-with-backup"
1093
1094   vars.load_wload1 = {{
1095     if (get_time_period("backup").is_inside) {
1096       return 20
1097     } else {
1098       return 5
1099     }
1100   }}
1101   vars.load_cload1 = {{
1102     if (get_time_period("backup").is_inside) {
1103       return 40
1104     } else {
1105       return 10
1106     }
1107   }}
1108 }
1109 ```
1110
1111
1112 ## Advanced Value Types <a id="advanced-value-types"></a>
1113
1114 In addition to the default value types Icinga 2 also uses a few other types
1115 to represent its internal state. The following types are exposed via the [API](12-icinga2-api.md#icinga2-api).
1116
1117 ### CheckResult <a id="advanced-value-types-checkresult"></a>
1118
1119   Name                      | Type                  | Description
1120   --------------------------|-----------------------|----------------------------------
1121   exit\_status              | Number                | The exit status returned by the check execution.
1122   output                    | String                | The check output.
1123   performance\_data         | Array                 | Array of [performance data values](08-advanced-topics.md#advanced-value-types-perfdatavalue).
1124   check\_source             | String                | Name of the node executing the check.
1125   state                     | Number                | The current state (0 = OK, 1 = WARNING, 2 = CRITICAL, 3 = UNKNOWN).
1126   command                   | Value                 | Array of command with shell-escaped arguments or command line string.
1127   execution\_start          | Timestamp             | Check execution start time (as a UNIX timestamp).
1128   execution\_end            | Timestamp             | Check execution end time (as a UNIX timestamp).
1129   schedule\_start           | Timestamp             | Scheduled check execution start time (as a UNIX timestamp).
1130   schedule\_end             | Timestamp             | Scheduled check execution end time (as a UNIX timestamp).
1131   active                    | Boolean               | Whether the result is from an active or passive check.
1132   vars\_before              | Dictionary            | Internal attribute used for calculations.
1133   vars\_after               | Dictionary            | Internal attribute used for calculations.
1134   ttl                       | Number                | Time-to-live duration in seconds for this check result. The next expected check result is `now + ttl` where freshness checks are executed.
1135
1136 ### PerfdataValue <a id="advanced-value-types-perfdatavalue"></a>
1137
1138 Icinga 2 parses performance data strings returned by check plugins and makes the information available to external interfaces (e.g. [GraphiteWriter](09-object-types.md#objecttype-graphitewriter) or the [Icinga 2 API](12-icinga2-api.md#icinga2-api)).
1139
1140   Name                      | Type                  | Description
1141   --------------------------|-----------------------|----------------------------------
1142   label                     | String                | Performance data label.
1143   value                     | Number                | Normalized performance data value without unit.
1144   counter                   | Boolean               | Enabled if the original value contains `c` as unit. Defaults to `false`.
1145   unit                      | String                | Unit of measurement (`seconds`, `bytes`. `percent`) according to the [plugin API](05-service-monitoring.md#service-monitoring-plugin-api).
1146   crit                      | Value                 | Critical threshold value.
1147   warn                      | Value                 | Warning threshold value.
1148   min                       | Value                 | Minimum value returned by the check.
1149   max                       | Value                 | Maximum value returned by the check.