granicus.if.org Git - icinga2/blob - doc/08-advanced-topics.md

   1 # Advanced Topics <a id="advanced-topics"></a>
   2
   3 This chapter covers a number of advanced topics. If you're new to Icinga, you
   4 can safely skip over things you're not interested in.
   5
   6 ## Downtimes <a id="downtimes"></a>
   7
   8 Downtimes can be scheduled for planned server maintenance or
   9 any other targeted service outage you are aware of in advance.
  10
  11 Downtimes will suppress any notifications, and may trigger other
  12 downtimes too. If the downtime was set by accident, or the duration
  13 exceeds the maintenance, you can manually cancel the downtime.
  14 Planned downtimes will also be taken into account for SLA reporting
  15 tools calculating the SLAs based on the state and downtime history.
  16
  17 Multiple downtimes for a single object may overlap. This is useful
  18 when you want to extend your maintenance window taking longer than expected.
  19 If there are multiple downtimes triggered for one object, the overall downtime depth
  20 will be greater than `1`.
  21
  22
  23 If the downtime was scheduled after the problem changed to a critical hard
  24 state triggering a problem notification, and the service recovers during
  25 the downtime window, the recovery notification won't be suppressed.
  26
  27 ### Fixed and Flexible Downtimes <a id="fixed-flexible-downtimes"></a>
  28
  29 A `fixed` downtime will be activated at the defined start time, and
  30 removed at the end time. During this time window the service state
  31 will change to `NOT-OK` and then actually trigger the downtime.
  32 Notifications are suppressed and the downtime depth is incremented.
  33
  34 Common scenarios are a planned distribution upgrade on your linux
  35 servers, or database updates in your warehouse. The customer knows
  36 about a fixed downtime window between 23:00 and 24:00. After 24:00
  37 all problems should be alerted again. Solution is simple -
  38 schedule a `fixed` downtime starting at 23:00 and ending at 24:00.
  39
  40 Unlike a `fixed` downtime, a `flexible` downtime will be triggered
  41 by the state change in the time span defined by start and end time,
  42 and then last for the specified duration in minutes.
  43
  44 Imagine the following scenario: Your service is frequently polled
  45 by users trying to grab free deleted domains for immediate registration.
  46 Between 07:30 and 08:00 the impact will hit for 15 minutes and generate
  47 a network outage visible to the monitoring. The service is still alive,
  48 but answering too slow to Icinga 2 service checks.
  49 For that reason, you may want to schedule a downtime between 07:30 and
  50 08:00 with a duration of 15 minutes. The downtime will then last from
  51 its trigger time until the duration is over. After that, the downtime
  52 is removed (may happen before or after the actual end time!).
  53
  54 ### Scheduling a downtime <a id="scheduling-downtime"></a>
  55
  56 You can schedule a downtime either by using the Icinga 2 API action
  57 [schedule-downtime](12-icinga2-api.md#icinga2-api-actions-schedule-downtime) or
  58 by sending an [external command](14-features.md#external-commands).
  59
  60
  61 #### Fixed Downtime <a id="fixed-downtime"></a>
  62
  63 If the host/service changes into a NOT-OK state between the start and
  64 end time window, the downtime will be marked as `in effect` and
  65 increases the downtime depth counter.
  66
  67 ```
  68    |       |         |
  69 start      |        end
  70        trigger time
  71 ```
  72
  73 #### Flexible Downtime <a id="flexible-downtime"></a>
  74
  75 A flexible downtime defines a time window where the downtime may be
  76 triggered from a host/service NOT-OK state change. It will then last
  77 until the specified time duration is reached. That way it can happen
  78 that the downtime end time is already gone, but the downtime ends
  79 at `trigger time + duration`.
  80
  81
  82 ```
  83    |       |         |
  84 start      |        end               actual end time
  85            |--------------duration--------|
  86        trigger time
  87 ```
  88
  89
  90 ### Triggered Downtimes <a id="triggered-downtimes"></a>
  91
  92 This is optional when scheduling a downtime. If there is already a downtime
  93 scheduled for a future maintenance, the current downtime can be triggered by
  94 that downtime. This renders useful if you have scheduled a host downtime and
  95 are now scheduling a child host's downtime getting triggered by the parent
  96 downtime on `NOT-OK` state change.
  97
  98 ### Recurring Downtimes <a id="recurring-downtimes"></a>
  99
 100 [ScheduledDowntime objects](09-object-types.md#objecttype-scheduleddowntime) can be used to set up
 101 recurring downtimes for services.
 102
 103 Example:
 104
 105     apply ScheduledDowntime "backup-downtime" to Service {
 106       author = "icingaadmin"
 107       comment = "Scheduled downtime for backup"
 108
 109       ranges = {
 110         monday = "02:00-03:00"
 111         tuesday = "02:00-03:00"
 112         wednesday = "02:00-03:00"
 113         thursday = "02:00-03:00"
 114         friday = "02:00-03:00"
 115         saturday = "02:00-03:00"
 116         sunday = "02:00-03:00"
 117       }
 118
 119       assign where "backup" in service.groups
 120     }
 121
 122
 123 ## Comments <a id="comments-intro"></a>
 124
 125 Comments can be added at runtime and are persistent over restarts. You can
 126 add useful information for others on repeating incidents (for example
 127 "last time syslog at 100% cpu on 17.10.2013 due to stale nfs mount") which
 128 is primarily accessible using web interfaces.
 129
 130 You can add a comment either by using the Icinga 2 API action
 131 [add-comment](12-icinga2-api.md#icinga2-api-actions-add-comment) or
 132 by sending an [external command](14-features.md#external-commands).
 133
 134 ## Acknowledgements <a id="acknowledgements"></a>
 135
 136 If a problem persists and notifications have been sent, you can
 137 acknowledge the problem. That way other users will get
 138 a notification that you're aware of the issue and probably are
 139 already working on a fix.
 140
 141 Note: Acknowledgements also add a new [comment](08-advanced-topics.md#comments-intro)
 142 which contains the author and text fields.
 143
 144 You can send an acknowledgement either by using the Icinga 2 API action
 145 [acknowledge-problem](12-icinga2-api.md#icinga2-api-actions-acknowledge-problem) or
 146 by sending an [external command](14-features.md#external-commands).
 147
 148
 149 ### Sticky Acknowledgements <a id="sticky-acknowledgements"></a>
 150
 151 The acknowledgement is removed if a state change occurs or if the host/service
 152 recovers (OK/Up state).
 153
 154 If you acknowledge a problem once you've received a `Critical` notification,
 155 the acknowledgement will be removed if there is a state transition to `Warning`.
 156 ```
 157 OK -> WARNING -> CRITICAL -> WARNING -> OK
 158 ```
 159
 160 If you prefer to keep the acknowledgement until the problem is resolved (`OK`
 161 recovery) you need to enable the `sticky` parameter.
 162
 163
 164 ### Expiring Acknowledgements <a id="expiring-acknowledgements"></a>
 165
 166 Once a problem is acknowledged it may disappear from your `handled problems`
 167 dashboard and no-one ever looks at it again since it will suppress
 168 notifications too.
 169
 170 This `fire-and-forget` action is quite common. If you're sure that a
 171 current problem should be resolved in the future at a defined time,
 172 you can define an expiration time when acknowledging the problem.
 173
 174 Icinga 2 will clear the acknowledgement when expired and start to
 175 re-notify, if the problem persists.
 176
 177
 178 ## Time Periods <a id="timeperiods"></a>
 179
 180 [Time Periods](09-object-types.md#objecttype-timeperiod) define
 181 time ranges in Icinga where event actions are triggered, for
 182 example whether a service check is executed or not within
 183 the `check_period` attribute. Or a notification should be sent to
 184 users or not, filtered by the `period` and `notification_period`
 185 configuration attributes for `Notification` and `User` objects.
 186
 187 > **Note**
 188 >
 189 > If you are familiar with Icinga 1.x, these time period definitions
 190 > are called `legacy timeperiods` in Icinga 2.
 191 >
 192 > An Icinga 2 legacy timeperiod requires the `ITL` provided template
 193 >`legacy-timeperiod`.
 194
 195 The `TimePeriod` attribute `ranges` may contain multiple directives,
 196 including weekdays, days of the month, and calendar dates.
 197 These types may overlap/override other types in your ranges dictionary.
 198
 199 The descending order of precedence is as follows:
 200
 201 * Calendar date (2008-01-01)
 202 * Specific month date (January 1st)
 203 * Generic month date (Day 15)
 204 * Offset weekday of specific month (2nd Tuesday in December)
 205 * Offset weekday (3rd Monday)
 206 * Normal weekday (Tuesday)
 207
 208 If you don't set any `check_period` or `notification_period` attribute
 209 on your configuration objects, Icinga 2 assumes `24x7` as time period
 210 as shown below.
 211
 212     object TimePeriod "24x7" {
 213       import "legacy-timeperiod"
 214
 215       display_name = "Icinga 2 24x7 TimePeriod"
 216       ranges = {
 217         "monday"    = "00:00-24:00"
 218         "tuesday"   = "00:00-24:00"
 219         "wednesday" = "00:00-24:00"
 220         "thursday"  = "00:00-24:00"
 221         "friday"    = "00:00-24:00"
 222         "saturday"  = "00:00-24:00"
 223         "sunday"    = "00:00-24:00"
 224       }
 225     }
 226
 227 If your operation staff should only be notified during workhours,
 228 create a new timeperiod named `workhours` defining a work day from
 229 09:00 to 17:00.
 230
 231     object TimePeriod "workhours" {
 232       import "legacy-timeperiod"
 233
 234       display_name = "Icinga 2 8x5 TimePeriod"
 235       ranges = {
 236         "monday"    = "09:00-17:00"
 237         "tuesday"   = "09:00-17:00"
 238         "wednesday" = "09:00-17:00"
 239         "thursday"  = "09:00-17:00"
 240         "friday"    = "09:00-17:00"
 241       }
 242     }
 243
 244 Furthermore if you wish to specify a notification period across midnight,
 245 you can define it the following way:
 246
 247     object Timeperiod "across-midnight" {
 248       import "legacy-timeperiod"
 249
 250       display_name = "Nightly Notification"
 251       ranges = {
 252         "saturday" = "22:00-24:00"
 253         "sunday" = "00:00-03:00"
 254       }
 255     }
 256
 257 Below you can see another example for configuring timeperiods across several
 258 days, weeks or months. This can be useful when taking components offline
 259 for a distinct period of time.
 260
 261     object Timeperiod "standby" {
 262       import "legacy-timeperiod"
 263
 264       display_name = "Standby"
 265       ranges = {
 266         "2016-09-30 - 2016-10-30" = "00:00-24:00"
 267       }
 268     }
 269
 270 Please note that the spaces before and after the dash are mandatory.
 271
 272 Once your time period is configured you can Use the `period` attribute
 273 to assign time periods to `Notification` and `Dependency` objects:
 274
 275     object Notification "mail" {
 276       import "generic-notification"
 277
 278       host_name = "localhost"
 279
 280       command = "mail-notification"
 281       users = [ "icingaadmin" ]
 282       period = "workhours"
 283     }
 284
 285 ### Time Periods Inclusion and Exclusion <a id="timeperiods-includes-excludes"></a>
 286
 287 Sometimes it is necessary to exclude certain time ranges from
 288 your default time period definitions, for example, if you don't
 289 want to send out any notification during the holiday season,
 290 or if you only want to allow small time windows for executed checks.
 291
 292 The [TimePeriod object](09-object-types.md#objecttype-timeperiod)
 293 provides the `includes` and `excludes` attributes to solve this issue.
 294 `prefer_includes` defines whether included or excluded time periods are
 295 preferred.
 296
 297 The following example defines a time period called `holidays` where
 298 notifications should be suppressed:
 299
 300     object TimePeriod "holidays" {
 301       import "legacy-timeperiod"
 302
 303       ranges = {
 304         "january 1" = "00:00-24:00"                 //new year's day
 305         "july 4" = "00:00-24:00"                    //independence day
 306         "december 25" = "00:00-24:00"               //christmas
 307         "december 31" = "18:00-24:00"               //new year's eve (6pm+)
 308         "2017-04-16" = "00:00-24:00"                //easter 2017
 309         "monday -1 may" = "00:00-24:00"             //memorial day (last monday in may)
 310         "monday 1 september" = "00:00-24:00"        //labor day (1st monday in september)
 311         "thursday 4 november" = "00:00-24:00"       //thanksgiving (4th thursday in november)
 312       }
 313     }
 314
 315 In addition to that the time period `weekends` defines an additional
 316 time window which should be excluded from notifications:
 317
 318     object TimePeriod "weekends-excluded" {
 319       import "legacy-timeperiod"
 320
 321       ranges = {
 322         "saturday"  = "00:00-09:00,18:00-24:00"
 323         "sunday"    = "00:00-09:00,18:00-24:00"
 324       }
 325     }
 326
 327 The time period `prod-notification` defines the default time ranges
 328 and adds the excluded time period names as an array.
 329
 330     object TimePeriod "prod-notification" {
 331       import "legacy-timeperiod"
 332
 333       excludes = [ "holidays", "weekends-excluded" ]
 334
 335       ranges = {
 336         "monday"    = "00:00-24:00"
 337         "tuesday"   = "00:00-24:00"
 338         "wednesday" = "00:00-24:00"
 339         "thursday"  = "00:00-24:00"
 340         "friday"    = "00:00-24:00"
 341         "saturday"  = "00:00-24:00"
 342         "sunday"    = "00:00-24:00"
 343       }
 344     }
 345
 346 ## External Check Results <a id="external-check-results"></a>
 347
 348 Hosts or services which do not actively execute a check plugin to receive
 349 the state and output are called "passive checks" or "external check results".
 350 In this scenario an external client or script is sending in check results.
 351
 352 You can feed check results into Icinga 2 with the following transport methods:
 353
 354 * [process-check-result action](12-icinga2-api.md#icinga2-api-actions-process-check-result) available with the [REST API](12-icinga2-api.md#icinga2-api) (remote and local)
 355 * External command sent via command pipe (local only)
 356
 357 Each time a new check result is received, the next expected check time
 358 is updated. This means that if there are no check result received from
 359 the external source, Icinga 2 will execute [freshness checks](08-advanced-topics.md#check-result-freshness).
 360
 361 > **Note**
 362 >
 363 > The REST API action allows to specify the `check_source` attribute
 364 > which helps identifying the external sender. This is also visible
 365 > in Icinga Web 2 and the REST API queries.
 366
 367 ## Check Result Freshness <a id="check-result-freshness"></a>
 368
 369 In Icinga 2 active check freshness is enabled by default. It is determined by the
 370 `check_interval` attribute and no incoming check results in that period of time.
 371
 372 The threshold is calculated based on the last check execution time for actively executed checks:
 373
 374     (last check execution time + check interval) > current time
 375
 376 If this host/service receives check results from an [external source](08-advanced-topics.md#external-check-results),
 377 the threshold is based on the last time a check result was received:
 378
 379     (last check result time + check interval) > current time
 380
 381 > **Tip**
 382 >
 383 > The [process-check-result](12-icinga2-api.md#icinga2-api-actions-process-check-result) REST API
 384 > action allows to overrule the pre-defined check interval with a specified TTL in Icinga 2 v2.9+.
 385
 386 If the freshness checks fail, Icinga 2 will execute the defined check command.
 387
 388 Best practice is to define a [dummy](10-icinga-template-library.md#itl-dummy) `check_command` which gets
 389 executed when freshness checks fail.
 390
 391 ```
 392 apply Service "external-check" {
 393   check_command = "dummy"
 394   check_interval = 1m
 395
 396   /* Set the state to UNKNOWN (3) if freshness checks fail. */
 397   vars.dummy_state = 3
 398
 399   /* Use a runtime function to retrieve the last check time and more details. */
 400   vars.dummy_text = {{
 401     var service = get_service(macro("$host.name$"), macro("$service.name$"))
 402     var lastCheck = DateTime(service.last_check).to_string()
 403
 404     return "No check results received. Last result time: " + lastCheck
 405   }}
 406
 407   assign where "external" in host.vars.services
 408 }
 409 ```
 410
 411 References: [get_service](18-library-reference.md#objref-get_service), [macro](18-library-reference.md#scoped-functions-macro), [DateTime](18-library-reference.md#datetime-type).
 412
 413 Example output in Icinga Web 2:
 414
 415 ![Icinga 2 Freshness Checks](images/advanced-topics/icinga2_external_checks_freshness_icingaweb2.png)
 416
 417
 418 ## Check Flapping <a id="check-flapping"></a>
 419
 420 Icinga 2 supports optional detection of hosts and services that are "flapping".
 421
 422 Flapping occurs when a service or host changes state too frequently, which would result in a storm of problem and
 423 recovery notifications. With flapping detection enabled a flapping notification will be sent while other notifications are
 424 suppresed until it calms down after receiving the same status from checks a few times. Flapping detection can help detect
 425
 426 configuration problems (wrong thresholds), troublesome services, or network problems.
 427
 428 Flapping detection can be enabled or disabled using the `enable_flapping` attribute.
 429 The `flapping_threshold_high` and `flapping_threshold_low` attributes allows to specify the thresholds that control
 430 when a [host](09-object-types.md#objecttype-host) or [service](objecttype-service) is considered to be flapping.
 431
 432 The default thresholds are 30% for high and 25% for low. If the computed flapping value exceeds the high threshold a
 433 host or service is considered flapping until it drops below the low flapping threshold.
 434
 435 `FlappingStart` and `FlappingEnd` notifications will be sent out accordingly, if configured. See the chapter on
 436 [notifications](alert-notifications) for details
 437
 438 > Note: There is no distinctions between hard and soft states with flapping. All state changes count and notifications
 439 > will be sent out regardless of the objects state.
 440
 441 ### How it works <a id="check-flapping-how-it-works"></a>
 442
 443 Icinga 2 saves the last 20 state changes for every host and service. See the graphic below:
 444
 445 ![Icinga 2 Flapping State Timeline](images/advanced-topics/flapping-state-graph.png)
 446
 447 All the states ware weighted, with the most recent one being worth the most (1.15) and the 20th the least (0.8). The
 448 states in between are fairly distributed. The final flapping value are the weighted state changes divided by the total
 449 count of 20.
 450
 451 In the example above, the added states would have a total value of 7.82 (`0.84 + 0.86 + 0.88 + 0.9 + 0.98 + 1.06 + 1.12 + 1.18`).
 452 This yields a flapping percentage of 39.1% (`7.82 / 20 * 100`). As the default upper flapping threshold is 30%, it would be
 453 considered flapping.
 454
 455 If the next seven check results then would not be state changes, the flapping percentage would fall below the lower threshold
 456 of 25% and therefore the host or service would recover from flapping.
 457
 458 ## Volatile Services and Hosts <a id="volatile-services-hosts"></a>
 459
 460 The `volatile` option, if enabled for a host or service, makes it treat every [state change](03-monitoring-basics.md#hard-soft-states)
 461 as a `HARD` state change. It is comparable to `max_check_attempts = 1`. With this any `NOT-OK` result will
 462 ignore `max_check_attempts` and trigger notifications etc. It will further cause any additional `NOT-OK`
 463 result to re-send notifications.
 464
 465 It may be reasonable to have a volatile service which stays in a `HARD` state if the service stays in a `NOT-OK`
 466 state. That way each service recheck will automatically trigger a notification unless the service is acknowledged or
 467 in a scheduled downtime.
 468
 469 A common example are security checks where each `NOT-OK` check result should immediately trigger a notification.
 470
 471 The default for this option is `false` and should only be enabled when required.
 472
 473
 474 ## Monitoring Icinga 2 <a id="monitoring-icinga"></a>
 475
 476 Why should you do that? Icinga and its components run like any other
 477 service application on your server. There are predictable issues
 478 such as "disk space is running low" and your monitoring suffers from just
 479 that.
 480
 481 You would also like to ensure that features and backends are running
 482 and storing required data. Be it the database backend where Icinga Web 2
 483 presents fancy dashboards, forwarded metrics to Graphite or InfluxDB or
 484 the entire distributed setup.
 485
 486 This list isn't complete but should help with your own setup.
 487 Windows client specific checks are highlighted.
 488
 489 Type            | Description                   | Plugins and CheckCommands
 490 ----------------|-------------------------------|-----------------------------------------------------
 491 System          | Filesystem                    | [disk](10-icinga-template-library.md#plugin-check-command-disk), [disk-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 492 System          | Memory, Swap                  | [mem](10-icinga-template-library.md#plugin-contrib-command-mem), [swap](10-icinga-template-library.md#plugin-check-command-swap), [memory](10-icinga-template-library.md#windows-plugins) (Windows Client)
 493 System          | Hardware                      | [hpasm](10-icinga-template-library.md#plugin-contrib-command-hpasm), [ipmi-sensor](10-icinga-template-library.md#plugin-contrib-command-ipmi-sensor)
 494 System          | Virtualization                | [VMware](10-icinga-template-library.md#plugin-contrib-vmware), [esxi_hardware](10-icinga-template-library.md#plugin-contrib-command-esxi-hardware)
 495 System          | Processes                     | [procs](10-icinga-template-library.md#plugin-check-command-processes), [service-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 496 System          | System Activity Reports       | [check_sar_perf](https://github.com/dnsmichi/icinga-plugins/blob/master/scripts/check_sar_perf.py)
 497 System          | I/O                           | [iostat](10-icinga-template-library.md#plugin-contrib-command-iostat)
 498 System          | Network interfaces            | [nwc_health](10-icinga-template-library.md#plugin-contrib-command-nwc_health), [interfaces](10-icinga-template-library.md#plugin-contrib-command-interfaces)
 499 System          | Users                         | [users](10-icinga-template-library.md#plugin-check-command-users), [users-windows](10-icinga-template-library.md#windows-plugins) (Windows Client)
 500 System          | Logs                          | Forward them to [Elastic Stack](14-features.md#elastic-stack-integration) or [Graylog](14-features.md#graylog-integration) and add your own alerts.
 501 System          | NTP                           | [ntp_time](10-icinga-template-library.md#plugin-check-command-ntp-time)
 502 System          | Updates                       | [apt](10-icinga-template-library.md#plugin-check-command-apt), [yum](10-icinga-template-library.md#plugin-contrib-command-yum)
 503 Icinga          | Status & Stats                | [icinga](10-icinga-template-library.md#itl-icinga) (more below)
 504 Icinga          | Cluster & Clients             | [health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks)
 505 Database        | MySQL                         | [mysql_health](10-icinga-template-library.md#plugin-contrib-command-mysql_health)
 506 Database        | PostgreSQL                    | [postgres](10-icinga-template-library.md#plugin-contrib-command-postgres)
 507 Database        | Housekeeping                  | Check the database size and growth and analyse metrics to examine trends.
 508 Database        | DB IDO                        | [ido](10-icinga-template-library.md#itl-icinga-ido) (more below)
 509 Webserver       | Apache2, Nginx, etc.          | [http](10-icinga-template-library.md#plugin-check-command-http), [apache_status](10-icinga-template-library.md#plugin-contrib-command-apache_status), [nginx_status](10-icinga-template-library.md#plugin-contrib-command-nginx_status)
 510 Webserver       | Certificates                  | [http](10-icinga-template-library.md#plugin-check-command-http)
 511 Webserver       | Authorization                 | [http](10-icinga-template-library.md#plugin-check-command-http)
 512 Notifications   | Mail (queue)                  | [smtp](10-icinga-template-library.md#plugin-check-command-smtp), [mailq](10-icinga-template-library.md#plugin-check-command-mailq)
 513 Notifications   | SMS (GSM modem)               | [check_sms3_status](https://exchange.icinga.com/netways/check_sms3status)
 514 Notifications   | Messengers, Cloud services    | XMPP, Twitter, IRC, Telegram, PagerDuty, VictorOps, etc.
 515 Metrics         | PNP, RRDTool                  | [check_pnp_rrds](https://github.com/lingej/pnp4nagios/tree/master/scripts) checks for stale RRD files.
 516 Metrics         | Graphite                      | [graphite](10-icinga-template-library.md#plugin-contrib-command-graphite)
 517 Metrics         | InfluxDB                      | [check_influxdb](https://exchange.icinga.com/Mikanoshi/InfluxDB+data+monitoring+plugin)
 518 Metrics         | Elastic Stack                 | [elasticsearch](10-icinga-template-library.md#plugin-contrib-command-elasticsearch), [Elastic Stack integration](14-features.md#elastic-stack-integration)
 519 Metrics         | Graylog                       | [Graylog integration](14-features.md#graylog-integration)
 520
 521
 522 The [icinga](10-icinga-template-library.md#itl-icinga) CheckCommand provides metrics for the runtime stats of
 523 Icinga 2. You can forward them to your preferred graphing solution.
 524 If you require more metrics you can also query the [REST API](12-icinga2-api.md#icinga2-api) and write
 525 your own custom check plugin. Or you keep using the built-in [object accessor functions](08-advanced-topics.md#access-object-attributes-at-runtime)
 526 to calculate stats in-memory.
 527
 528 There is a built-in [ido](10-icinga-template-library.md#itl-icinga-ido) check available for DB IDO MySQL/PostgreSQL
 529 which provides additional metrics for the IDO database.
 530
 531 ```
 532 apply Service "ido-mysql" {
 533   check_command = "ido"
 534
 535   vars.ido_type = "IdoMysqlConnection"
 536   vars.ido_name = "ido-mysql" //the name defined in /etc/icinga2/features-enabled/ido-mysql.conf
 537
 538   assign where match("master*.localdomain", host.name)
 539 }
 540 ```
 541
 542 More specific database queries can be found in the [DB IDO](14-features.md#db-ido) chapter.
 543
 544 Distributed setups should include specific [health checks](06-distributed-monitoring.md#distributed-monitoring-health-checks).
 545 You might also want to add additional checks for SSL certificate expiration.
 546
 547
 548 ## Advanced Configuration Hints <a id="advanced-configuration-hints"></a>
 549
 550 ### Advanced Use of Apply Rules <a id="advanced-use-of-apply-rules"></a>
 551
 552 [Apply rules](03-monitoring-basics.md#using-apply) can be used to create a rule set which is
 553 entirely based on host objects and their attributes.
 554 In addition to that [apply for and custom attribute override](03-monitoring-basics.md#using-apply-for)
 555 extend the possibilities.
 556
 557 The following example defines a dictionary on the host object which contains
 558 configuration attributes for multiple web servers. This then used to add three checks:
 559
 560 * A `ping4` check using the local IP `address` of the web server.
 561 * A `tcp` check querying the TCP port where the HTTP service is running on.
 562 * If the `url` key is defined, the third apply for rule will create service objects using the `http` CheckCommand.
 563 In addition to that you can optionally define the `ssl` attribute which enables HTTPS checks.
 564
 565 Host definition:
 566
 567     object Host "webserver01" {
 568       import "generic-host"
 569       address = "192.168.56.200"
 570       vars.os = "Linux"
 571
 572       vars.webserver = {
 573         instance["status"] = {
 574           address = "192.168.56.201"
 575           port = "80"
 576           url = "/status"
 577         }
 578         instance["tomcat"] = {
 579           address = "192.168.56.202"
 580           port = "8080"
 581         }
 582         instance["icingaweb2"] = {
 583           address = "192.168.56.210"
 584           port = "443"
 585           url = "/icingaweb2"
 586           ssl = true
 587         }
 588       }
 589     }
 590
 591 Service apply for definitions:
 592
 593     apply Service "webserver_ping" for (instance => config in host.vars.webserver.instance) {
 594       display_name = "webserver_" + instance
 595       check_command = "ping4"
 596
 597       vars.ping_address = config.address
 598
 599       assign where host.vars.webserver.instance
 600     }
 601
 602     apply Service "webserver_port" for (instance => config in host.vars.webserver.instance) {
 603       display_name = "webserver_" + instance + "_" + config.port
 604       check_command = "tcp"
 605
 606       vars.tcp_address = config.address
 607       vars.tcp_port = config.port
 608
 609       assign where host.vars.webserver.instance
 610     }
 611
 612     apply Service "webserver_url" for (instance => config in host.vars.webserver.instance) {
 613       display_name = "webserver_" + instance + "_" + config.url
 614       check_command = "http"
 615
 616       vars.http_address = config.address
 617       vars.http_port = config.port
 618       vars.http_uri = config.url
 619
 620       if (config.ssl) {
 621         vars.http_ssl = config.ssl
 622       }
 623
 624       assign where config.url != ""
 625     }
 626
 627 The variables defined in the host dictionary are not using the typical custom attribute
 628 prefix recommended for CheckCommand parameters. Instead they are re-used for multiple
 629 service checks in this example.
 630 In addition to defining check parameters this way, you can also enrich the `display_name`
 631 attribute with more details. This will be shown in in Icinga Web 2 for example.
 632
 633 ### Use Functions in Object Configuration <a id="use-functions-object-config"></a>
 634
 635 There is a limited scope where functions can be used as object attributes such as:
 636
 637 * As value for [Custom Attributes](03-monitoring-basics.md#custom-attributes-functions)
 638 * Returning boolean expressions for [set_if](08-advanced-topics.md#use-functions-command-arguments-setif) inside command arguments
 639 * Returning a [command](08-advanced-topics.md#use-functions-command-attribute) array inside command objects
 640
 641 The other way around you can create objects dynamically using your own global functions.
 642
 643 > **Note**
 644 >
 645 > Functions called inside command objects share the same global scope as runtime macros.
 646 > Therefore you can access host custom attributes like `host.vars.os`, or any other
 647 > object attribute from inside the function definition used for [set_if](08-advanced-topics.md#use-functions-command-arguments-setif) or [command](08-advanced-topics.md#use-functions-command-attribute).
 648
 649 Tips when implementing functions:
 650
 651 * Use [log()](18-library-reference.md#global-functions-log) to dump variables. You can see the output
 652 inside the `icinga2.log` file depending in your log severity
 653 * Use the `icinga2 console` to test basic functionality (e.g. iterating over a dictionary)
 654 * Build them step-by-step. You can always refactor your code later on.
 655
 656 #### Register and Use Global Functions <a id="use-functions-global-register"></a>
 657
 658 [Functions](17-language-reference.md#functions) can be registered into the global scope. This allows custom functions being available
 659 in objects and other functions. Keep in mind that these functions are not marked
 660 as side-effect-free and as such are not available via the REST API.
 661
 662 Add a new configuration file `functions.conf` and include it into the [icinga2.conf](04-configuring-icinga-2.md#icinga2-conf)
 663 configuration file in the very beginning, e.g. after `constants.conf`. You can also manage global
 664 functions inside `constants.conf` if you prefer.
 665
 666 The following function converts a given state parameter into a returned string value. The important
 667 bits for registering it into the global scope are:
 668
 669 * `globals.<unique_function_name>` adds a new globals entry.
 670 * `function()` specifies that a call to `state_to_string()` executes a function.
 671 * Function parameters are defined inside the `function()` definition.
 672
 673 ```
 674 globals.state_to_string = function(state) {
 675   if (state == 2) {
 676     return "Critical"
 677   } else if (state == 1) {
 678     return "Warning"
 679   } else if (state == 0) {
 680     return "OK"
 681   } else if (state == 3) {
 682     return "Unknown"
 683   } else {
 684     log(LogWarning, "state_to_string", "Unknown state " + state + " provided.")
 685   }
 686 }
 687 ```
 688
 689 The else-condition allows for better error handling. This warning will be shown in the Icinga 2
 690 log file once the function is called.
 691
 692 > **Note**
 693 >
 694 > If these functions are used in a distributed environment, you must ensure to deploy them
 695 > everywhere needed.
 696
 697 In order to test-drive the newly created function, restart Icinga 2 and use the [debug console](11-cli-commands.md#cli-command-console)
 698 to connect to the REST API.
 699
 700 ```
 701 $ ICINGA2_API_PASSWORD=icinga icinga2 console --connect 'https://root@localhost:5665/'
 702 Icinga 2 (version: v2.8.1-373-g4bea6d25c)
 703 <1> => globals.state_to_string(1)
 704 "Warning"
 705 <2> => state_to_string(2)
 706 "Critical"
 707 ```
 708
 709 You can see that this function is now registered into the [global scope](17-language-reference.md#variable-scopes). The function call
 710 `state_to_string()` can be used in any object at static config compile time or inside runtime
 711 lambda functions.
 712
 713 The following service object example uses the service state and converts it to string output.
 714 The function definition is not optimized and is enrolled for better readability including a log message.
 715
 716 ```
 717 object Service "state-test" {
 718   check_command = "dummy"
 719   host_name = NodeName
 720
 721   vars.dummy_state = 2
 722
 723   vars.dummy_text = {{
 724     var h = macro("$host.name$")
 725     var s = macro("$service.name$")
 726
 727     var state = get_service(h, s).state
 728
 729     log(LogInformation, "dummy_state", "Host: " + h + " Service: " + s + " State: " + state)
 730
 731     return state_to_string(state)
 732   }}
 733 }
 734 ```
 735
 736
 737 #### Use Custom Functions as Attribute <a id="custom-functions-as-attribute"></a>
 738
 739 To use custom functions as attributes, the function must be defined in a
 740 slightly unexpected way. The following example shows how to assign values
 741 depending on group membership. All hosts in the `slow-lan` host group use 300
 742 as value for `ping_wrta`, all other hosts use 100.
 743
 744     globals.group_specific_value = function(group, group_value, non_group_value) {
 745         return function() use (group, group_value, non_group_value) {
 746             if (group in host.groups) {
 747                 return group_value
 748             } else {
 749                 return non_group_value
 750             }
 751         }
 752     }
 753
 754     apply Service "ping4" {
 755         import "generic-service"
 756         check_command = "ping4"
 757
 758         vars.ping_wrta = group_specific_value("slow-lan", 300, 100)
 759         vars.ping_crta = group_specific_value("slow-lan", 500, 200)
 760
 761         assign where true
 762     }
 763
 764 #### Use Functions in Assign Where Expressions <a id="use-functions-assign-where"></a>
 765
 766 If a simple expression for matching a name or checking if an item
 767 exists in an array or dictionary does not fit, you should consider
 768 writing your own global [functions](17-language-reference.md#functions).
 769 You can call them inside `assign where` and `ignore where` expressions
 770 for [apply rules](03-monitoring-basics.md#using-apply-expressions) or
 771 [group assignments](03-monitoring-basics.md#group-assign-intro) just like
 772 any other global functions for example [match](18-library-reference.md#global-functions-match).
 773
 774 The following example requires the host `myprinter` being added
 775 to the host group `printers-lexmark` but only if the host uses
 776 a template matching the name `lexmark*`.
 777
 778     template Host "lexmark-printer-host" {
 779       vars.printer_type = "Lexmark"
 780     }
 781
 782     object Host "myprinter" {
 783       import "generic-host"
 784       import "lexmark-printer-host"
 785
 786       address = "192.168.1.1"
 787     }
 788
 789     /* register a global function for the assign where call */
 790     globals.check_host_templates = function(host, search) {
 791       /* iterate over all host templates and check if the search matches */
 792       for (tmpl in host.templates) {
 793         if (match(search, tmpl)) {
 794           return true
 795         }
 796       }
 797
 798       /* nothing matched */
 799       return false
 800     }
 801
 802     object HostGroup "printers-lexmark" {
 803       display_name = "Lexmark Printers"
 804       /* call the global function and pass the arguments */
 805       assign where check_host_templates(host, "lexmark*")
 806     }
 807
 808
 809 Take a different more complex example: All hosts with the
 810 custom attribute `vars_app` as nested dictionary should be
 811 added to the host group `ABAP-app-server`. But only if the
 812 `app_type` for all entries is set to `ABAP`.
 813
 814 It could read as wildcard match for nested dictionaries:
 815
 816     where host.vars.vars_app["*"].app_type == "ABAP"
 817
 818 The solution for this problem is to register a global
 819 function which checks the `app_type` for all hosts
 820 with the `vars_app` dictionary.
 821
 822     object Host "appserver01" {
 823       check_command = "dummy"
 824       vars.vars_app["ABC"] = { app_type = "ABAP" }
 825     }
 826     object Host "appserver02" {
 827       check_command = "dummy"
 828       vars.vars_app["DEF"] = { app_type = "ABAP" }
 829     }
 830
 831     globals.check_app_type = function(host, type) {
 832       /* ensure that other hosts without the custom attribute do not match */
 833       if (typeof(host.vars.vars_app) != Dictionary) {
 834         return false
 835       }
 836
 837       /* iterate over the vars_app dictionary */
 838       for (key => val in host.vars.vars_app) {
 839         /* if the value is a dictionary and if contains the app_type being the requested type */
 840         if (typeof(val) == Dictionary && val.app_type == type) {
 841           return true
 842         }
 843       }
 844
 845       /* nothing matched */
 846       return false
 847     }
 848
 849     object HostGroup "ABAP-app-server" {
 850       assign where check_app_type(host, "ABAP")
 851     }
 852
 853
 854 #### Use Functions in Command Arguments set_if <a id="use-functions-command-arguments-setif"></a>
 855
 856 The `set_if` attribute inside the command arguments definition in the
 857 [CheckCommand object definition](09-object-types.md#objecttype-checkcommand) is primarily used to
 858 evaluate whether the command parameter should be set or not.
 859
 860 By default you can evaluate runtime macros for their existence. If the result is not an empty
 861 string, the command parameter is passed. This becomes fairly complicated when want to evaluate
 862 multiple conditions and attributes.
 863
 864 The following example was found on the community support channels. The user had defined a host
 865 dictionary named `compellent` with the key `disks`. This was then used inside service apply for rules.
 866
 867     object Host "dict-host" {
 868       check_command = "check_compellent"
 869       vars.compellent["disks"] = {
 870         file = "/var/lib/check_compellent/san_disks.0.json",
 871         checks = ["disks"]
 872       }
 873     }
 874
 875 The more significant problem was to only add the command parameter `--disk` to the plugin call
 876 when the dictionary `compellent` contains the key `disks`, and omit it if not found.
 877
 878 By defining `set_if` as [abbreviated lambda function](17-language-reference.md#nullary-lambdas)
 879 and evaluating the host custom attribute `compellent` containing the `disks` this problem was
 880 solved like this:
 881
 882     object CheckCommand "check_compellent" {
 883       command   = [ "/usr/bin/check_compellent" ]
 884       arguments   = {
 885         "--disks"  = {
 886           set_if = {{
 887             var host_vars = host.vars
 888             log(host_vars)
 889             var compel = host_vars.compellent
 890             log(compel)
 891             compel.contains("disks")
 892           }}
 893         }
 894       }
 895     }
 896
 897 This implementation uses the dictionary type method [contains](18-library-reference.md#dictionary-contains)
 898 and will fail if `host.vars.compellent` is not of the type `Dictionary`.
 899 Therefore you can extend the checks using the [typeof](17-language-reference.md#types) function.
 900
 901 You can test the types using the `icinga2 console`:
 902
 903     # icinga2 console
 904     Icinga (version: v2.3.0-193-g3eb55ad)
 905     <1> => srv_vars.compellent["check_a"] = { file="outfile_a.json", checks = [ "disks", "fans" ] }
 906     null
 907     <2> => srv_vars.compellent["check_b"] = { file="outfile_b.json", checks = [ "power", "voltages" ] }
 908     null
 909     <3> => typeof(srv_vars.compellent)
 910     type 'Dictionary'
 911     <4> =>
 912
 913 The more programmatic approach for `set_if` could look like this:
 914
 915         "--disks" = {
 916           set_if = {{
 917             var srv_vars = service.vars
 918             if(len(srv_vars) > 0) {
 919               if (typeof(srv_vars.compellent) == Dictionary) {
 920                 return srv_vars.compellent.contains("disks")
 921               } else {
 922                 log(LogInformationen, "checkcommand set_if", "custom attribute compellent_checks is not a dictionary, ignoring it.")
 923                 return false
 924               }
 925             } else {
 926               log(LogWarning, "checkcommand set_if", "empty custom attributes")
 927               return false
 928             }
 929           }}
 930         }
 931
 932
 933 #### Use Functions as Command Attribute <a id="use-functions-command-attribute"></a>
 934
 935 This comes in handy for [NotificationCommands](09-object-types.md#objecttype-notificationcommand)
 936 or [EventCommands](09-object-types.md#objecttype-eventcommand) which does not require
 937 a returned checkresult including state/output.
 938
 939 The following example was taken from the community support channels. The requirement was to
 940 specify a custom attribute inside the notification apply rule and decide which notification
 941 script to call based on that.
 942
 943     object User "short-dummy" {
 944     }
 945
 946     object UserGroup "short-dummy-group" {
 947       assign where user.name == "short-dummy"
 948     }
 949
 950     apply Notification "mail-admins-short" to Host {
 951        import "mail-host-notification"
 952        command = "mail-host-notification-test"
 953        user_groups = [ "short-dummy-group" ]
 954        vars.short = true
 955        assign where host.vars.notification.mail
 956     }
 957
 958 The solution is fairly simple: The `command` attribute is implemented as function returning
 959 an array required by the caller Icinga 2.
 960 The local variable `mailscript` sets the default value for the notification scrip location.
 961 If the notification custom attribute `short` is set, it will override the local variable `mailscript`
 962 with a new value.
 963 The `mailscript` variable is then used to compute the final notification command array being
 964 returned.
 965
 966 You can omit the `log()` calls, they only help debugging.
 967
 968     object NotificationCommand "mail-host-notification-test" {
 969       command = {{
 970         log("command as function")
 971         var mailscript = "mail-host-notification-long.sh"
 972         if (notification.vars.short) {
 973            mailscript = "mail-host-notification-short.sh"
 974         }
 975         log("Running command")
 976         log(mailscript)
 977
 978         var cmd = [ SysconfDir + "/icinga2/scripts/" + mailscript ]
 979         log(LogCritical, "me", cmd)
 980         return cmd
 981       }}
 982
 983       env = {
 984       }
 985     }
 986
 987
 988 ### Access Object Attributes at Runtime <a id="access-object-attributes-at-runtime"></a>
 989
 990 The [Object Accessor Functions](18-library-reference.md#object-accessor-functions)
 991 can be used to retrieve references to other objects by name.
 992
 993 This allows you to access configuration and runtime object attributes. A detailed
 994 list can be found [here](09-object-types.md#object-types).
 995
 996 #### Access Object Attributes at Runtime: Cluster Check <a id="access-object-attributes-at-runtime-cluster-check"></a>
 997
 998 This is a simple cluster example for accessing two host object states and calculating a virtual
 999 cluster state and output:
1000
1001 ```
1002 object Host "cluster-host-01" {
1003   check_command = "dummy"
1004   vars.dummy_state = 2
1005   vars.dummy_text = "This host is down."
1006 }
1007
1008 object Host "cluster-host-02" {
1009   check_command = "dummy"
1010   vars.dummy_state = 0
1011   vars.dummy_text = "This host is up."
1012 }
1013
1014 object Host "cluster" {
1015   check_command = "dummy"
1016   vars.cluster_nodes = [ "cluster-host-01", "cluster-host-02" ]
1017
1018   vars.dummy_state = {{
1019     var up_count = 0
1020     var down_count = 0
1021     var cluster_nodes = macro("$cluster_nodes$")
1022
1023     for (node in cluster_nodes) {
1024       if (get_host(node).state > 0) {
1025         down_count += 1
1026       } else {
1027         up_count += 1
1028       }
1029     }
1030
1031     if (up_count >= down_count) {
1032       return 0 //same up as down -> UP
1033     } else {
1034       return 2 //something is broken
1035     }
1036   }}
1037
1038   vars.dummy_text = {{
1039     var output = "Cluster hosts:\n"
1040     var cluster_nodes = macro("$cluster_nodes$")
1041
1042     for (node in cluster_nodes) {
1043       output += node + ": " + get_host(node).last_check_result.output + "\n"
1044     }
1045
1046     return output
1047   }}
1048 }
1049 ```
1050
1051 #### Time Dependent Thresholds <a id="access-object-attributes-at-runtime-time-dependent-thresholds"></a>
1052
1053 The following example sets time dependent thresholds for the load check based on the current
1054 time of the day compared to the defined time period.
1055
1056 ```
1057 object TimePeriod "backup" {
1058   import "legacy-timeperiod"
1059
1060   ranges = {
1061     monday = "02:00-03:00"
1062     tuesday = "02:00-03:00"
1063     wednesday = "02:00-03:00"
1064     thursday = "02:00-03:00"
1065     friday = "02:00-03:00"
1066     saturday = "02:00-03:00"
1067     sunday = "02:00-03:00"
1068   }
1069 }
1070
1071 object Host "webserver-with-backup" {
1072   check_command = "hostalive"
1073   address = "127.0.0.1"
1074 }
1075
1076 object Service "webserver-backup-load" {
1077   check_command = "load"
1078   host_name = "webserver-with-backup"
1079
1080   vars.load_wload1 = {{
1081     if (get_time_period("backup").is_inside) {
1082       return 20
1083     } else {
1084       return 5
1085     }
1086   }}
1087   vars.load_cload1 = {{
1088     if (get_time_period("backup").is_inside) {
1089       return 40
1090     } else {
1091       return 10
1092     }
1093   }}
1094 }
1095 ```
1096
1097
1098 ## Advanced Value Types <a id="advanced-value-types"></a>
1099
1100 In addition to the default value types Icinga 2 also uses a few other types
1101 to represent its internal state. The following types are exposed via the [API](12-icinga2-api.md#icinga2-api).
1102
1103 ### CheckResult <a id="advanced-value-types-checkresult"></a>
1104
1105   Name                      | Type                  | Description
1106   --------------------------|-----------------------|----------------------------------
1107   exit\_status              | Number                | The exit status returned by the check execution.
1108   output                    | String                | The check output.
1109   performance\_data         | Array                 | Array of [performance data values](08-advanced-topics.md#advanced-value-types-perfdatavalue).
1110   check\_source             | String                | Name of the node executing the check.
1111   state                     | Number                | The current state (0 = OK, 1 = WARNING, 2 = CRITICAL, 3 = UNKNOWN).
1112   command                   | Value                 | Array of command with shell-escaped arguments or command line string.
1113   execution\_start          | Timestamp             | Check execution start time (as a UNIX timestamp).
1114   execution\_end            | Timestamp             | Check execution end time (as a UNIX timestamp).
1115   schedule\_start           | Timestamp             | Scheduled check execution start time (as a UNIX timestamp).
1116   schedule\_end             | Timestamp             | Scheduled check execution end time (as a UNIX timestamp).
1117   active                    | Boolean               | Whether the result is from an active or passive check.
1118   vars\_before              | Dictionary            | Internal attribute used for calculations.
1119   vars\_after               | Dictionary            | Internal attribute used for calculations.
1120   ttl                       | Number                | Time-to-live duration in seconds for this check result. The next expected check result is `now + ttl` where freshness checks are executed.
1121
1122 ### PerfdataValue <a id="advanced-value-types-perfdatavalue"></a>
1123
1124 Icinga 2 parses performance data strings returned by check plugins and makes the information available to external interfaces (e.g. [GraphiteWriter](09-object-types.md#objecttype-graphitewriter) or the [Icinga 2 API](12-icinga2-api.md#icinga2-api)).
1125
1126   Name                      | Type                  | Description
1127   --------------------------|-----------------------|----------------------------------
1128   label                     | String                | Performance data label.
1129   value                     | Number                | Normalized performance data value without unit.
1130   counter                   | Boolean               | Enabled if the original value contains `c` as unit. Defaults to `false`.
1131   unit                      | String                | Unit of measurement (`seconds`, `bytes`. `percent`) according to the [plugin API](05-service-monitoring.md#service-monitoring-plugin-api).
1132   crit                      | Value                 | Critical threshold value.
1133   warn                      | Value                 | Warning threshold value.
1134   min                       | Value                 | Minimum value returned by the check.
1135   max                       | Value                 | Maximum value returned by the check.