When alerts are used to notify external systems of high-priority events, such as faults, it is important to ensure that alerts are delivered reliably. If an alert for an active problem is generated but the alert receiver does not receive the delivery of the alert, operators will not be notified of the fault. The Oxide alerts API provides mechanisms to ensure that alert delivery is as reliable as possible, but receiver implementations must take care to use these mechanisms correctly. This guide describes steps receiver implementations should take to avoid missing alerts.
Resending Failed Deliveries
If a receiver is unavailable, alerts dispatched to that receiver may not be received. The alert dispatcher will retry failed delivery attempts up to two times, with a one-minute backoff before the first retry and a five-minute backoff before the second retry. However, if a receiver is unavailable for long enough to miss both retry attempts, the alert dispatcher will not attempt to deliver that request again unless explicitly asked to by the receiver.
Therefore, it is recommended that receiver implementations request that any failed deliveries be resent when they start up.
This way, if the software implementing a receiver crashes and is restarted, it can trigger a new delivery attempt for any alerts that it may have missed due to the crash.
The simplest way to do so is for the receiver to call the alert_receiver_probe API endpoint with the ?resend=true
query parameter to trigger a liveness probe to itself.
When such a probe succeeds, the alert dispatcher will then resend all alerts which have not been delivered successfully to that receiver.
Alternatively, if more precise control over which alerts are resent is required, the alert_delivery_list API endpoint provides a list of alert delivery attempts to a receiver.
To list only failed delivery attempts, the query parameters ?failed=true&pending=false&delivered=false
can be added to requests to this endpoint.
The receiver may then request re-delivery of specific alerts using the alert_delivery_resend API endpoint.
This is more complex than triggering a liveness probe, but it may be useful in situations where the receiver only wishes to resend a subset of failed alerts.
For example, it may only request re-delivery of alert that occurred within a particular time window, or it may only resend alerts that were not processed by a redundant receiver subscribed to the same alert classes.
In addition to checking for failed delivery attempts on startup, receiver implementations may periodically attempt to resend failed deliveries while they are running. Even if a receiver has not crashed, event deliveries may have been missed due to network connectivity problems or other transient issues.
Monitoring Receiver Availability
Liveness probes may be used to monitor the health of a receiver. An external system which periodically triggers liveness probes to a receiver using the alert_receiver_probe API endpoint can detect situations where the receiver is unavailable and alert operators to the outage. This allows receiver failures to be detected and resolved proactively, reducing the likelihood of missing notifications for important events.
While sending requests directly to the receiver process can detect failures where the receiver is completely offline, it does not exercise communication between the Oxide control plane and the receiver. Outages may still occur when the receiver is online and capable of processing requests, if it is not reachable by the Oxide control plane. Calling the alert_receiver_probe API endpoint requests that the control plane can actually deliver an alert to the receiver, and will fail if the Oxide control plane cannot communicate with the receiver.
It may be valuable for a monitoring system to both trigger liveness probe requests from the Oxide control plane dispatcher using the alert_receiver_probe endpoint and to send its own probes directly to the receiver. This provides separate signals for whether the receiver is running at all, and for whether it is reachable by the Oxide control plane. Operators can then reason about whether an outage is due to the alert receiver process not running at all, or due to a network partition between it and the Oxide control plane.
Redundant Receiver Endpoints
If a receiver endpoint is unavailable, alerts may be missed. Therefore, when a receiver is subscribed to critical alerts such as hardware faults, operators are encouraged to run multiple independent instances of that receiver subscribed to the same alert classes. This can be achieved by creating multiple alert receiver API resources with the same event subscriptions and distinct alert delivery endpoints. When multiple receivers are subscribed to the same alert classes, the same alerts will be delivered to all receivers. Alert UUIDs should be used to de-duplicate events that are delivered to multiple receivers. If an alert is delivered to more than one receiver, the same alert UUID is used for each receiver that receives the alert. Note that alert deliveries are specific to an alert receiver; separate alert delivery API resources will be created for each receiver to which an alert is dispatched.
Both redundant receivers and periodically checking for failed deliveries (as described in the previous section) are orthogonal techniques to protect against missed alerts due to receiver downtime.
Depending on the receiver’s reliability requirements, it may be preferable to use one or both of these mechanisms.
Tolerance for delays in alert delivery due to receiver downtime is an important tradeoff to consider when selecting these mechanisms. When only a single receiver endpoint is used, checking for failed deliveries when that receiver starts up will ensure that deliveries missed during receiver downtime are eventually processed. However, during the period of time when the receiver was unavailable, no alerts will have been processed. In contrast, with multiple redundant receivers, alerts will still be received immediately as long as at least one receiver is available. If receiving alerts in a timely manner is important, e.g., for high-urgency fault-management alerts, consider the use of redundant receivers.
Nonetheless, there are some failure scenarios in which even redundant receivers could miss alerts. An outage may impact all redundant receivers, if they are running in the same failure domain, or if a network partition affects all communication to and from the Oxide rack. Therefore, even when redundant receivers are used, checking for failed deliveries periodically and on startup is still recommended if these classes of failures are a concern.
Zero-Downtime Webhook Secret Rotation
The Oxide webhook API permits multiple shared secrets to be configured for a single webhook receiver.
If a webhook receiver is configured with more than one secret, webhook requests will include an x-oxide-signature
header for each secret.
The value of these headers includes the secret’s ID in addition to the HMAC signature of the payload, allowing the receiver to determine which secret was used to generate that signature.
Multiple secrets may be used during secret key rotations in order to ensure that the webhook payload is always signed with a secret that can be verified by the receiver. When rotating secrets for a webhook receiver, perform the following steps in this order:
Create the new secret using the webhook_secrets_add API endpoint, which returns the ID assigned to that secret.
Configure the webhook receiver to verify signatures with the new secret.
During this time, any webhook alert payload will be signed with both the new secret and the old secret, allowing it to be verified by the receiver regardless of whether it has yet been configured to accept the new secret.
Once the receiver is ready to accept the new secret, delete the old secret using the webhook_secrets_remove API endpoint.