Alerts Overview

Alerts are the mechanism through which the Oxide control plane notifies the outside world of events that occur within the system. Alerts inform operators and external software systems about particular events. A typical use case for alerts is to take action in the event of failures of hardware components, such as SSDs. Alerts may also be triggered for events that represent normal operation, such as an instance rebooting. They provide a mechanism for event-driven integration with external systems: while the Oxide HTTP API allows external software to make requests of the rack, alerts allow the rack to notify external software when an event occurs.

Alert Receivers

An alert receiver is an API resource representing a set of subscriptions to alerts along with mechanism for processing those alerts.

Alert receivers subscribe to alerts by specifying a set of alert classes. The alert receiver resource also specifies configuration for how alerts are sent to that receiver, which depends on the mechanism used to deliver alerts to that receiver. Presently, webhooks are the sole alert delivery mechanism.

See here for guidance on how to implement reliable alert receivers.

Identity and Access Management

Alert receivers may only be created or modified by users with the fleet.admin role, and may receive alerts relating to any resource, including hardware alerts.

See here for an overview of IAM policy in the Oxide rack.

Alert Classes

The events communicated by alerts are categorized into alert classes. These classes indicate the resource scope that an alert relates to, and the nature of the event represented by the alert. An alert class is a string consisting of one or more segments, separated by . characters. A segment consists of any number of alphanumeric characters and underscores. Each segment describes the category of the alert with increasing specificity, with the first segment being the most general. These classes form a hierarchy of event types. Top-level categories in the alert class hierarchy will represent broad categories of entities, followed by increasingly specific subcategories.

For example, the alert class instance.network_interface.delete consists of three segments. The first segment, instance, indicates the top-level API resource that the alert is associated with. The second segment, network_interface, indicates that the alert relates to a particular child resource of the instance (in this case, a virtual network interface). Finally, the last segment, delete, indicates what event occurred.

Alert receivers subscribe to one or more alert classes to determine what events they will be notified for. Alert class subscriptions may be specified when the receiver is created, or can be added to an existing receiver using the alert_receiver_subscription_add API endpoint.

Globbing

A receiver’s alert subscriptions may include simple globs to subscribe to multiple categories of alert. Globbing is performed on a per-segment basis, so a segment may be either a string of alphanumeric characters and underscores, or a glob segment. A segment consisting of the * character is a single-segment glob, and will match any single segment at that position in the alert class string. A segment consisting of the string ** is a multi-segment glob, and will match any number of segments.

Note

Globs are evaluated at the segment level, not within a segment.

A segment in an alert class subscription may either contain alphanumeric characters and underscores, or a glob, not both. Subscriptions patterns such as instance.*-interface.delete are not accepted.

For example, a receiver that subscribes to instance.* will receive notifications for instance.create", instance.delete, instance.start, and any other alert class consisting of the segment instance followed by one additional segment. Similarly, a receiver that subscribes to **.delete will receive notifications for project.delete, instance.network_interface.delete, and any other alert class ending with the segment delete, with any number of segments.

Alert UUIDs

Each alert generated by the system is uniquely identified by a UUID. These UUIDs are included in the data payload sent to alert receivers. If an alert results in notifications being delivered to multiple receivers, the alert UUID sent to each receiver will be the same. This may be used to correlate events across multiple receivers, and to de-duplicate repeated deliveries of the same alert.

Important

Two alerts with the same UUID refer to the same event, rather than two distinct events with equivalent data. On the other hand, two alerts notifications with different UUIDs refer to two distinct events, even if other data in the alert payload is the same.

Alert Delivery

When an alert occurs that matches an alert receiver’s subscriptions, the alert is dispatched to that receiver, creating an API resource called an alert delivery. The alert delivery represents metadata associated with the control plane’s attempts to actually transmit the alert’s data to the receiver. Each alert delivery is assigned a UUID, and is considered a child of the alert receiver resource that the alert is being delivered to. The component of the Oxide control plane responsible for dispatching alerts to receivers is referred to as the alert dispatcher.

The alert dispatcher attempts to provide at-least-once reliable delivery to live receivers. Unsuccessful delivery attempts are retried, in order to ensure that the alert is observed by the receiver.

Important
At-least-once delivery is not guaranteed in the case of a prolonged receiver outage, as the maximum number of retries is limited to prevent infinite retry loops.

In some cases, this means that a receiver may be sent an alert representing the same event multiple times. Therefore, alert receiver implementations should use the alert UUIDs included in the alert’s data payload to de-duplicate alerts.

The details of the actual communication mechanism used to deliver an alert depends on the type of the alert receiver. For webhook alert receivers, a delivery represents one or more HTTP POST requests sent to the receiver endpoint. See the webhook documentation for details on webhook delivery requests.

Delivery Failure and Retries

If an attempt to deliver an alert to a receiver is unsuccessful, the Oxide control plane will retry the delivery up to two additional times before considering it to have failed. The first retry attempt is made one minute after the first attempt. If that retry attempt fails, a second retry attempt is made five minutes after the first retry attempt. If neither retry attempt succeeds, the Oxide control plane will consider the delivery failed and will not continue to attempt delivery of the alert to that receiver. Delivery of other alerts will still be attempted. Requesting that the alert dispatcher resend the alert will begin a new delivery.

Naturally, the details of what constitutes a successful or unsuccessful delivery attempt depends on the type of the alert receiver. For webhook receivers, the way in which deliveries are classified into successes and failures is discussed here.

Alert delivery flow

Delivery States

A delivery resource has one of three possible states. These states will be displayed when viewing alert deliveries using the alert_delivery_list API or the oxide alert receiver log CLI command. They may also be provided as arguments to filter the returned list of deliveries.

Alert delivery states
StateDescription

pending

The delivery is in progress, and has neither completed successfully nor failed.

A delivery is in the pending state if the first delivery attempt has not yet been performed, or if a delivery attempt has failed but the delivery has retries remaining.

delivered

The alert has been delivered successfully.

Once a delivery attempt to the receiver succeeds, the delivery advances to the delivered state. A delivery for which one or more attempts have failed can still succeed, if a retry attempt delivers the alert successfully.

No additional attempts will be performed for a delivery once it is considered to have been delivered successfully.

failed

The delivery has failed permanently.

A delivery fails permanently if the initial delivery attempt and two subsequent retries have all failed.

Once this occurs, no additional retries will be performed. A new delivery for the alert may be requested using the alert_delivery_resend API.

Resending Alerts

The alert_delivery_resend API requests a new delivery of an alert to a given alert receiver. Requesting that an alert be resent creates a new delivery resource for that alert, with a new delivery UUID. Retries for the new delivery are tracked separately from any previous delivery of the alert to that receiver. This API may be used to request that an alert for which the initial delivery failed permanently be delivered again, such as when the receiver is restored following an outage. However, there are no limitations on the use of the resend API: an alert may be resent any number of times, even if it has been delivered successfully.

Receiver Liveness Probes

Liveness probes are synthetic delivery attempts sent by the alert dispatcher to a receiver, to determine whether the receiver is currently capable of receiving an alert. A liveness probe for a receiver may be requested using the alert_receiver_probe API endpoint.

Currently, the Oxide control plane does not automatically send liveness probes to alert receivers. Liveness probes are only performed when explicitly requested using the alert_receiver_probe API endpoint.

Tip
If periodic liveness probes of alert receivers are required, an external health-checking system should periodically request a liveness probe from the Oxide API.

Deliveries for liveness probes do not represent notifications for actual alerts, and should not be handled as alerts by the receiver implementation. Instead, probe requests are used to determine whether the receiver is currently available and could receive an actual alert if one were to be sent. Optionally, when requesting that a probe be sent, the caller may indicate that if the probe succeeds, the alert dispatcher should create new deliveries for any alerts which were not delivered successfully to the probed receiver.

Liveness probes have a number of uses:

  • They may be triggered by the receiver itself to indicate that it has become available after an outage, and to resend any missed alerts.

  • They may be triggered by an external system in order to monitor the health of the receiver. See the guide on implementing reliable receivers for details.

  • They may be triggered manually by an operator when testing a receiver implementation.

Liveness probes are particularly valuable when an alert receiver subscribes to infrequent but high-priority alerts, such as faults. When the system is operating normally, no alerts will be generated. If probes are not sent to the receiver while the rack is operating normally and no faults have been detected, the receiver could become unavailable at any time without being detected. Should a fault then be detected in the rack, the dispatcher will attempt to deliver an alert to the receiver, but if the receiver is unavailable, the alert will be lost and operators may be unaware of the fault.

The delivery payload for a liveness probe contains data in the same format as that for an actual alert, but with the special "probe" alert class. All probes have the same alert UUID, 001de000-7768-4000-8000-000000000001. The alert-specific data for a liveness probe will be empty, as it does not represent an actual alert.