Monitoring

This section of the admin guide describes information related to monitoring the Curity Identity Server.

Tip

🔥🔥🔥 If you just want to know how to determine if your instance of Curity is unhealthy and on fire, refer to the information below. 🔥🔥🔥

JMX

Java Management Extensions (JMX) is a commonly used interface for monitoring the internals of a Java-based application like the Curity Identity Server. This ability to peer inside the application, however, can be dangerous. It is for this reason that JMX is disabled by default. To enable it, the ENABLE_JMX can be set before starting the Curity Identity Server; the value is ignored and can can be any non-empty value (e.g., true, 1, etc.). This can be done on the command line like this, for instance:

Listing 72 Example of how to enable JMX from the command line by setting the ENABLE_JMX environment variable
$ ENABLE_JMX=1 idsvr

Status Endpoint

Curity Identity Server contains an HTTP endpoint providing node status information. Its operation is configured by the following environment variables.

Environment variable Description Default value
STATUS_CMD_ENABLED Endpoint enable state true
STATUS_CMD_PORT HTTP port to bind to 4465
STATUS_CMD_HOST Network host or address to listen on 0.0.0.0
STATUS_CMD_MAX_THREADS Maximum thread number 16

By default, this status endpoint is enabled, however it can be disabled by setting the STATUS_CMD_ENABLED environment variable to false or by starting idsvr with the --no-status parameter.

The status endpoint only supports HTTP GET requests to the / path. The response will have status code:

  • 200 if the node is ready to receive and process HTTP requests (even if it is temporarily disconnected from the admin node).
  • 503 if the node is not ready to receive requests (e.g. it is still booting or is shutting down).

In both cases, the response body will contain a JSON representation of the node status, containing the following fields:

  • isReady
    • false - the node is not ready to process requests (e.g. it is still booting or is shutting down).
    • true - the node is ready to receive and process requests (but may be disconnected from the admin).
  • nodeState
    • BOOTING - the node is starting up and not ready to process requests.
    • WAITING - the node is ready to process requests with the latest configuration that is has; however, it is still waiting to connect to the admin, so configuration may be stale.
    • RUNNING - the node is ready to process requests and is connected to the admin node.
    • ERROR - the node is in an unrecoverable error state.
    • STOPPING - the node is shutting down and not able to process requests.
../../_images/node_state.png
  • clusterState
    • STANDALONE - the node has clustering disabled.
    • CONNECTING_TO_CLUSTER - the node is not an admin node and is trying to connect (for the first time) or reconnect to the admin node.
    • CONNECTED - the node is not admin and is connected to the admin node.
    • ADMIN - the node is an admin node.
    • ERROR - an unexpected error occurred when checking the cluster state.
../../_images/cluster_state.png
  • configurationState
    • UNINITIALIZED - the node is not configured and therefore unable to correctly process requests.
    • CONFIGURED - the node is fully configured.
    • RECONFIGURING - the node is currently consuming a new configuration. A previous configuration is still valid and will be used in the meanwhile for any request processing.
../../_images/configuration_state.png
  • transactionId - an opaque string identifier for the last committed transaction seen by the current node.

Command line tool

The Curity Identity Server installation also contains the bin/status command line tool that can be used to probe the HTTP status endpoint. It uses the same environment variables the server uses and has two invocation parameters:

  • -j or --json - if present, the response written to the standard output is in the JSON format; otherwise it is written in plain text.
  • -h or --help - prints the synopsis of the tool
  • -v - not used but maintained for backward compatibility reasons

The status tool performs a request to the local node status endpoint and writes the response body to the standard output. The tool exit code is described in the following table.

Exit code Description
0 The probed node is ready.
1 The status endpoint is disabled and was not probed.
4 There was an IO error while communicating with the status endpoint.
103 A response with a 3xx status was received from the status endpoint.
104 A response with a 4xx status was received from the status endpoint.
105 A response with a 5xx status was received from the status endpoint.

Prometheus-compliant Metrics

Each run-time and admin node exposes an endpoint where certain information is published in a Prometheus-compliant format (i.e., Prometheus’ exposition format). This allows the Prometheus monitoring tool (or others that can process data in this format) to monitor certain metrics about the behavior of the node. This endpoint is exposed over HTTP and listening on the same interface as the status endpoint described above. The port used is one greater than the status endpoint (4466 by default).

The metrics exposed and their meanings is described in the following table:

Metric Name Type Labels Meaning
idsvr_authentication_login Counter acr The number of authentication events that have occurred
idsvr_authentication_sso Counter acr The number of Single Sign-on events that have occurred
idsvr_cpu_usage Gauge   The amount of CPU used (0 <= x <= 1) by the Java process that the node started
idsvr_datasource_account_sum Counter ds_id, ds_type The sum of total time (in seconds) that all account data sources are taking
idsvr_datasource_account_count Counter ds_id, ds_type The number of occurrences that all account data sources are taking
idsvr_datasource_attribute_sum Counter ds_id, ds_type The sum of total time (in seconds) that all attribute data sources are taking
idsvr_datasource_attribute_count Counter ds_id, ds_type The number of occurrences that all attribute data sources are taking
idsvr_datasource_credential_sum Counter ds_id, ds_type The sum of total time (in seconds) that all credential data sources are taking
idsvr_datasource_credential_count Counter ds_id, ds_type The number of occurrences that all credential data sources are taking
idsvr_datasource_dcr_sum Counter ds_id, ds_type The sum of total time (in seconds) that all dynamic client registration data sources are taking
idsvr_datasource_dcr_count Counter ds_id, ds_type The number of occurrences that all dynamic client registration data sources are taking
idsvr_datasource_delegation_sum Counter ds_id, ds_type The sum of total time (in seconds) that all delegation data sources are taking
idsvr_datasource_delegation_count Counter ds_id, ds_type The number of occurrences that all delegation data sources are taking
idsvr_datasource_device_sum Counter ds_id, ds_type The sum of total time (in seconds) that all device data sources are taking
idsvr_datasource_device_count Counter ds_id, ds_type The number of occurrences that all device data sources are taking
idsvr_datasource_nonce_sum Counter ds_id, ds_type The sum of total time (in seconds) that all nonce data sources are taking
idsvr_datasource_nonce_count Counter ds_id, ds_type The number of occurrences that all nonce data sources are taking
idsvr_datasource_session_sum Counter ds_id, ds_type The sum of total time (in seconds) that all session data sources are taking
idsvr_datasource_session_count Counter ds_id, ds_type The number of occurrences that all session data sources are taking
idsvr_datasource_token_sum Counter ds_id, ds_type The sum of total time (in seconds) that all token data sources are taking
idsvr_datasource_token_count Counter ds_id, ds_type The number of occurrences that all token data sources are taking
idsvr_datasource_bucket_sum Counter ds_id, ds_type The sum of total time (in seconds) that all bucket data sources are taking
idsvr_datasource_bucket_count Counter ds_id, ds_type The number of occurrences that all bucket data sources are taking
idsvr_http_server_request_time_sum Counter   The number of and amount of time (in seconds) that all HTTP requests are taking
idsvr_http_server_request_time_count Counter   The number of HTTP requests that have been made
idsvr_jvm_memory_used Gauge memory_id, memory_area The amount of memory used (in bytes) by the Java process that the node started
log4j2_appender_total Counter level The number and severity of log messages which have been written since start up
idsvr_oauth_delegation_issued Counter client_id The number of delegations issued
idsvr_oauth_delegation_revoked Counter client_id The number of delegations revoked
idsvr_oauth_token_issued Counter client_id, token_type The number of OAuth tokens (access, ID, refresh) issued
idsvr_oauth_token_revoked Counter client_id, token_type The number of OAuth tokens revoked event counter

The labels in the previous table have the meanings described in the following table:

Label Name Meaning
acr The authentication class context reference (ACR) of the authenticator used for login or SSO (as applicable)
client_id The identifier of the OAuth client to which the metric is related
ds_id The identifier of the data source to which the metric is related
ds_type The type of data source to which the metric is related (e.g., ldap, jdbc, etc.)
level The level of the log message (e.g., error, warn, etc.)
memory_id The identifier representing the pool of memory being measured (e.g., G1 Old Gen, etc.)
memory_area The type of memory being measured (heap, non-heap, etc.)
token_type The type of token to which the measurement is related (e.g., access_token, etc.)

Gathering of data can be disabled. If this is set when the node starts, no data will be published. To disable gathering of data, in the admin UI, go to System ‣ General. There, toggle off Enable Reporting. Once that change is committed, all nodes will stop gathering data.

Common Alerts

If you want to setup certain alerts when things go wrong in the Curity Identity Server, you can simply setup the following:

  • If datasource_*_sum / datasource_*_count >= 800 since the last poll to the metrics endpoint, your database is having issues. The result of this arithmetic is the average response time from the Curity Identity Server to the database (for the given period).
  • If log4j2_appender_total with a label of error is > 0, call support!
  • If log4j2_appender_total with a label of warn is greater than the last poll, look into the issue immediately, and raise a support case if you can’t figure out the problem.
  • If cpu_usage is >= 95% at an unexpected time or for a prolonged period of time, you should take action.
  • If http_server_request_time_sum / http_server_request_time_count >= 1000 since the last poll to the metrics endpoint. The result of this arithmetic is the average HTTP response time to the the Curity Identity Server Web server (for the given period).