FAQ

Rack Maintenance

What happens when there is a power or network outage?

Power outage:

  • The rack hardware is designed to boot up on its own once the power supply is resumed. All user interfaces, i.e., API and Web Console, will be accessible again after the sleds and switches have come back up. No manual intervention or rack configuration data loss is expected.

  • Applications running on the rack instances may or may not fully recover automatically, depending on their running state prior to the power outage and the design of those applications. It is the responsibility of the instance owners to ensure instances are properly shut down prior to scheduled outage to prevent data loss.

Network outage:

  • Network connectivity from/to the rack and the VM instances should resume upon the restoration of uplink connection. No manual intervention or data loss is expected.

How do I prepare the rack for scheduled power maintenance?

We recommend shutting down instances on the rack preemptively and bringing them back up after the maintenance. Consider automating some of these operations with the instance_stop and instance_start API.

The LED is off for one of the drives/sleds. Should I attempt to reseat it?

No. It is possible that the component connection is intact but the LED itself is faulty. Please contact Oxide Support for investigation and refrain from making any hot-plug/unplug or reboot attempts.

Network Configuration and IP Address Management

Why are users getting "No external IP addresses available" error when most instances have only private NICs?

External IP addresses are allocated to instances for both inbound and outbound access outside the rack. All instances make use of NAT for outbound access. The NAT service also consumes IP addresses from the rack IP pool. Unlike inbound external IP addresses that are assigned one per instance, each NAT IP is shared by up to four instances and the address is not exposed to users.

To rectify the "No external IP addresses available" error, the fleet administrator can add one or more IP address ranges to the default IP pool using ip_pool_range_add.

How do I diagnose and fix Oxide API/Console connectivity issues?

Here are some possible causes for Oxide endpoint connection time-out errors:

  1. The Oxide endpoints are blocked by the allow-list settings configured with the service allow-list API. (Note: The allow-list is set to "any" by default.)

  2. The Oxide endpoints are blocked by firewall rules external to the rack.

  3. Some or all of the Oxide endpoints are not serving requests.

Case 1 usually happens right after a change is made to the service allow-list or when clients outside the allow-list are used to access the Oxide endpoints. If you can still make API calls from a workstation within the allow-list IP ranges, reconfigure the allow-list from there to cover the appropriate ranges; otherwise, follow the unlock steps below to update the settings via the technician port interface.

If the above does not apply to your situation, check your network firewall logs to see if there are denied access attempts against the Oxide endpoints - the second possible cause mentioned above.

If both cases 1 and 2 have been ruled out, the Oxide service is likely the problem. The next step is to contact Oxide Support for assistance.

How do I restore Oxide access when there is an accidental lockout?

If the service allow-list is misconfigured and the Oxide endpoints are inaccessible from anywhere on your internal network, you can make corrections through the technician port ("techport").

  1. Enable techport access by connecting a Linux host to one of the two RJ-45 ports on the front panel of the Oxide rack. The host may be a laptop or a secure jumpbox for out-of-band support. It must be configured for IPv6 autoconfiguration. You can download the oxide CLI Linux binaries from here or install the curl package to use the raw API.

  2. As someone with fleet administrator privileges who has previously authenticated using the CLI, locate your API credentials. By default, the credentials are stored in $HOME/.config/oxide/credentials.toml.

  3. This and the subsequent steps are done on the host connecting to the techport. First, set the following environment variables using the credentials obtained from Step 2:

    export OXIDE_HOST=${host URL}:12229
    export OXIDE_FQDN=${host name only}
    export OXIDE_TOKEN=${device token}

    For example

    export OXIDE_HOST=https://recovery.sys.oxide.acme.com:12229
    export OXIDE_FQDN=recovery.sys.oxide.acme.com
    export OXIDE_TOKEN=oxide-token-asdfasdff4482b4cba338e350477cedf33552244
  4. Locate the techport bootstrap addresses by running:

    ip neighbor show | grep fdb1

    The output will look like this

    fdb1:a840:2504:195::1 dev eno2 lladdr 02:08:20:36:5c:8d REACHABLE
    fdb1:a840:2504:352::1 dev eno1 lladdr 02:08:20:bb:26:4d REACHABLE
  5. Confirm that you can access the API endpoint by proxying through the tech port address.

    • Set the techport address environment variable to any of the available techport addresses in your environment, e.g.,

      export TP_ADDR="fdb1:a840:2504:195::1"
    • Now let’s ping the Oxide endpoint through the techport:

      oxide --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] ping
      
      # using curl
      curl -sSq --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] ${OXIDE_HOST}/v1/ping

      The ping request should return an "ok" if the API endpoint is reachable:

      {
      "status":"ok"
      }
  6. Confirm that your fleet admin credentials are working as expected by viewing the allow-list settings:

    oxide --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] system networking allow-list view

    # using curl
    curl -sSq -H "Authorization: Bearer ${OXIDE_TOKEN}" \
    --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] \
    ${OXIDE_HOST}/v1/system/networking/allow-list

    Sample output:

    {
    "time_created": "2024-05-06T06:25:37.128096Z",
    "time_modified": "2024-07-04T21:36:22.580198Z",
    "allowed_ips": {
    "allow": "list",
    "ips": [
    "172.20.3.69/28",
    "172.20.16.0/23",
    "172.20.2.15/27",
    "172.20.26.0/24"
    ]
    }
    }
  7. Reconfigure the allow-list to include the appropriate IP ranges.

    oxide --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] \
    system networking allow-list update \
    --ip "${subnet1}" \
    --ip "${subnet2}" \
    ...

    # using curl
    curl -sSq -H "Authorization: Bearer ${OXIDE_TOKEN}" \
    --header "Content-Type: application/json" \
    --resolve ${OXIDE_FQDN}:12229:[${TP_ADDR}] \
    ${OXIDE_HOST}/v1/system/networking/allow-list -X PUT \
    --data @- <<EOF
    {
    "allowed_ips": {
    "allow": "list",
    "ips": [ "${subnet1}", "${subnet2}", ... ]
    }
    }
    EOF

How do I modify the uplink network settings?

Certain changes to existing links, e.g., BGP configurations and VLAN ID, can be done through the operator API. The reconfiguration steps will likely involve some careful orchestration such that the rack’s upstream connectivity is maintained throughout the process. Please contact Oxide Support if you would like assistance with planning any rack networking changes.

What do I do if our DNS or NTP server endpoints have changed?

No operator API is available to reconfigure DNS or NTP configurations at this time. Please contact Oxide Support for assistance.

Password and SSL Certificate Management

How do I rotate a silo TLS certificate?

The ability to replace a TLS certificate is limited to users with the silo admin role via Oxide API.

Follow the steps here to replace TLS certificates.

How do I set/invalidate local user passwords?

The ability to set or reset password is limited to users with the fleet admin role at this time via the local_idp_user_set_password API. The requester must provide the fleet administrator with the specific user_id which can be obtained using user_list or current_user_view.

To change the password for a user, set the mode to password in the request.

To invalidate the password to revoke access, set the mode to login_disallowed in the request.

How do I update the "recovery" user password?

Recovery user is a built-in user created during rack setup time. While the account should not be altered, it can be managed just like any regular local user. You can therefore update the password of the recovery user in the same way as local users using local_idp_user_set_password.

Access Management

How do I offboard a user from a silo?

Users managed in an identity provider:

  • In the identity provider system, remove the user account from the realm that is associated with the user’s silo.

Users managed locally on the rack:

  • As a user with the fleet admin role, use local_idp_user_delete to completely remove the user from the system.

  • Fleet administrators may also use local_idp_user_set_password to revoke the user’s login but the change takes effect on the Web Console only and does not impact device token-based access.

Please note that there is no operator API for device token invalidation at this time (follow "Known Behavior and Limitations" in the Release Notes section for further updates). If the user being offboarded has been granted any fleet, silo or project roles directly in IAM, the corresponding silo or project administrators should also delete the user from the IAM role assignment to limit their access.

How do I offboard a user from a project?

If the user is given access to the project via the identity provider group membership, you can simply disassociate the user from the groups being granted the project IAM roles.

If the user is given access directly in IAM, the project administrator can remove them from the project IAM role assignment.

How do I modify identity provider configurations such as silo admin mapping?

Identity provider configurations are not modifiable to prevent silo membership from going out of sync with the identity provider system. The only way to modify the "silo admin" role mapping is to delete and re-create the silo.

Last updated