Overview

The Curity Identity Server comes with an alarm subsystem that monitors functions in the server and dependencies to other services for problems. The alarm subsystem is based on the IETF standard RFC 8632. When an undesired state in the system is detected, an alarm is raised, and later when the issue is resolved the same alarm is cleared.

If the same issue is found several times on the same node, the same alarm changes state. This means that an alarm is a state object in the Curity Identity Server with a status history describing the changes over time.

Note

An Alarm in the Curity Identity Server signifies an undesirable state in a subsystem or external resource that requires corrective action.

The primary goal for an alarm is to reduce the time it takes to identify what system is causing the undesired condition. Many alarms in the Curity Identity Server are raised when the server is unable to communicate with configured dependencies such as databases or HTTP services. Errors stemming from such sources are often hard to identify since they manifest during login and token issuance. The time it takes for the operator to come to the conclusion that it is the http service that is down is valuable during an incident and the alarms will assist in this process.

Terminology

Term Definition
Alarm A faulty state detected by the system that requires administrative action
Alarm Type A string that identifies the particular type of alarm, such as failed-connection
Notification A message sent when an alarm has changed state
Cleared true if the alarm is currently raised or false if it is cleared
Alarming Resource The configured component in the Curity Identity Server that raised the alarm
Impacted Resource Other configured components that depend directly or indirectly on the alarming resource
Node The runtime-service that raised the alarm
Severity The level of impact the alarm may have on the system

The Alarm Object

An alarm is not an event. The alarm is an object with a state, that can be either raised or cleared. Whenever the state changes, the same alarm is updated with the new state. A notification is sent via the Alarm Handlers for each state change. The receiver of the notifications can correlate the incoming messages via the alarm id.

The Alarm Identifier

Internally, the Curity Identity Server will map the states to the same alarm objects keyed as follows:

nodeId + resourcePath + alarmType

Where:

  • nodeId is the ID of the runtime service that raised the alarm
  • resourcePath is the path to the configured element that raised the alarm
  • alarmType is the type of alarm raised

It is possible that the same alarm is raised on multiple nodes. This would be represented as separate alarms inside Curity but can of course be correlated in external systems by omitting the nodeId when correlating, if so desired; however, valuable information may be lost if the nodeId is not considered when acting on alarms.

Some alarms do not originate from a particular node. For example when the license is about to expire. In this case the alarm identifier only contains the resourcePath and the alarmType.

The Alarm Type

The alarm type defines the family of alarms that the alarm belongs to. It is predefined and has a specific meaning. The type of the alarm helps the operator identify the appropriate administrative actions to take in order to rectify the fault. It is recommended to read through the alarm type documentation to get a good idea of what types of errors the system will report on.

Alarm State

An alarm can have two meaningful states: raised or cleared. Once an alarm is raised, notifications are triggered. The alarm cannot be raised again until one of the following two conditions are met:

  1. The alarm has been cleared
  2. The node that raised the alarm has been restarted

All alarms in the Curity Identity Server will clear themselves once the condition that caused the alarm has been successfully passed. For example, if a failed connection alarm is raised for a data source, once a successful connection is established the alarm will be set to cleared. Jitter handling is in place to not toggle an alarm too frequently for an alarming resource that quickly switches between healthy and faulty states; e.g. an HTTP client under heavy use that gets unexpected response status codes for some paths but not for others.

However, if a node that raised the alarm is taken out of service or if the configured resource that raised the alarm is removed, the alarming state will remain raised. If the administrator wishes to remove that alarm from the list, it can be purged using the Admin Web UI, RESTCONF or the CLI.

Severity

An alarm always has a severity as defined in RFC 8632. This is an indication of the potential system impact. The following severities apply:

Severity Description
CLEARED The alarm is currently cleared.
WARNING The alarm indicates a potential service impacting problem, such as a certificate that is about to expire.
MINOR The alarm indicates the existence of a service impacting fault that does not yet degrade the operation fully.
MAJOR The alarm indicates the existence of a service impacting fault that has severe degradation of operation. Urgent corrective action is needed
CRITICAL The alarm indicates the existence of a service impacting fault that make operation completely halted. Immediate corrective action is needed

Note

Some alarms can raise the severity over time as the impact of the problem increases. However, in many cases the impact is deemed at least MAJOR if one or more profiles are affected by the raised alarm.

In the Admin UI, the severity will never be CLEARED. Instead there is a separate indicator that shows the clearance state and the severity will show the last raised severity.

Alarming Resource

Note

Non-configured resources such as dynamic clients or user accounts are never alarming resources.

The alarming resource is always a configured item in Curity. However, if the configuration has changed since the alarm was raised, the resource may no longer exist. The alarm can then be purged by the admin if no further investigations will be made.

An alarming resource has a type and an ID. This is represented by the XPATH expression to the element in the configuration.

Example /base:facilities/client/http[id='my-http-client']

The type is http-client and the ID is my-http-client. Depending on the alarm-handler configured for notifications, this information may be conveyed as the raw XPATH expression or in a more human-readable form.

Impacted Resources

Impacted resources are configured elements that directly or indirectly use the alarming resource. As an example: If the HTTP client used by the BankID authenticator raises an alarm; then BankID is an impacted resource, as well as any OAuth client using the BankID authenticator.

The impacted resources are presented as a tree structure in the Admin UI. They are passed as lists to to all alarm handlers; that in turn create notifications with the information to the recipient.

../../_images/impact.png

Fig. 1 An impact analysis of a HTTP client alarm

The alarming resource is at the top and all impacted resources are connected below in a hierarchy. The alarm is also present throughout the UI when visiting elements that are impacted by an active alarm.

../../_images/impact-authenticator.png

Fig. 2 Impact visible on the Authenticator

The system is considered impacted even if the alarm is in a cleared state since the alarm still is present. It will be removed once the administrator removes the alarm from the list of alarms, which can be done with the purge option in the UI.

Status History

Whenever the state changes for an alarm, either by the severity being changed or the cleared status changing, a new entry is added to the status history for the alarm. Each state change is accompanied by a notification via the alarm handlers. The latest state change is also the state of the alarm.

../../_images/status-history.png

Fig. 3 Status history for an alarm

Sliding Window Alarms

Under some circumstances, it is not desirable for all faults to immediately raise an alarm. For instance, a HTTP service might give the occasional 5XX response status, or a single Data Source query might be slower than expected, without there necessarily being something wrong with the service, and therefore no cause for immediate alarm. For faults of this kind, a sliding window is employed before an alarm is raised; requiring a given amount of faults to occur within a given time span before considered severe enough to raise the alarm.

Sliding windows are employed by the following Alarming Resources and Alarm Types.

All alarm types that employ a sliding window have two additional configurable parameters pertaining to the sliding window:

Parameter name Default value Description
faults-to-raise-alarm 2 The number of faults that must occur within the span if the sliding window before an alarm is raised. Setting this value to 1 effectively disables the sliding window; raising alarms immediately as faults occur.
sliding-window-duration 10 seconds The size, in seconds, of the sliding window. Setting this value to 0 effectively disables the sliding window; raising alarms immediately as faults occur.

Managing Alarms

When a node raises an alarm, it will create notifications using all configured alarm-handlers and create a node local log entry in a file called $IDSVR_HOME/var/log/alarms.log. It will also send an internal update to the Admin node, that maintains the cluster history of all alarms. This file is not purged by the Curity Identity Server at any point.

The cluster wide alarm list can be accessed using the CLI, the RESTCONF API and the admin Web UI. The UI provides the richest representation of the alarm, with the most complete impact analysis possible for all alarms.

Alarm Overview

../../_images/alarm-list.png

Fig. 4 Main dashboard alarm list

The main dashboard shows a list of currently active alarms. By clicking an entry in the list, the alarm details appear. The list can be filtered to only show alarms that are not cleared or by node.

The same overview can be found in the CLI by executing a show command in the view mode.

admin@localhost> show alarms alarm-list
Possible completions:
alarm            - The list of alarms.
last-changed     - A timestamp when the alarm list was last changed.
number-of-alarms - This object shows the total number of alarms in the system, i.e., the total number of entries in the alarm list.

Or to just view a summary:

show alarms alarm-list | display-level 1

Notifications

Notifications about alarms are sent using Alarm Handlers. Any state change of the alarm will result in a notification being sent by each alarm handler on the node that caused the alarm.

There is also a default handler that always runs that will create a log entry in $IDSVR_HOME/var/log/alarms.log for that node. This is a good place to look if a node has issues since it is guaranteed to be written to when alarms are raised or cleared. The alarm.log should always be empty on a system running without issues.

It is recommended to configure one or more alarm handlers since they can provide more rich information about the alarm than the log file.

Clusters

An alarm local to a particular run-time node. This means that several nodes may raise alarms for the same issue. It is up to the system handling the alarms to correlate between nodes. The Admin UI also provides correlation functions to help the administrator see the real issue quickly.

../../_images/clustering.svg

Fig. 5 Cluster notifications

When a node raises an alarm, it will send notifications to the configured handler, but it will also attempt to send a notification to the admin node in the cluster. This is an operation that will succeed if the cluster is connected. If this fails it still doesn’t prevent the alarm from being sent to all receivers of notifications.

Note

The admin node exposes a port on localhost for alarm collection. This port is not public to other nodes but needs to be kept open for alarms to function in the cluster. By default it is configured to listen on 4464, but can be changed via the ALARMS_PORT environment variable if necessary. It should not be accessible outside the admin node.

Some alarms are raised by the admin node only. These are not node local, but instead are triggered by conditions that affect the entire cluster. An example of such an alarm is the licence expiration alarm, that is raised when the license is about to expire.