Set the host maintenance policy for an instance


This document describes how to set the host maintenance policy for a virtual machine (VM) or bare metal instance to control how the instance behaves when a host event occurs.

Before you begin

  • If you haven't already, then set up authentication. Authentication is the process by which your identity is verified for access to Google Cloud services and APIs. To run code or samples from a local development environment, you can authenticate to Compute Engine by selecting one of the following options:

    Select the tab for how you plan to use the samples on this page:

    Console

    When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.

    gcloud

    1. Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init
    2. Set a default region and zone.

    REST

    To use the REST API samples on this page in a local development environment, you use the credentials you provide to the gcloud CLI.

      Install the Google Cloud CLI, then initialize it by running the following command:

      gcloud init

    For more information, see Authenticate for using REST in the Google Cloud authentication documentation.

Limitations

  • You can't change the maintenance behavior of a preemptible VM. When there is a maintenance event, the preemptible VM stops and it does not migrate. You must manually restart the preempted VM.
  • After you create a VM using an E2 machine type, you can't change the maintenance behavior for the VM from MIGRATE to TERMINATE or the other way around.
  • You can't change the maintenance behavior for bare metal instances like c3-standard-192-metal or x4-megamem-1920-metal, which are set to TERMINATE and automatically restart.

Available host maintenance properties

You can configure a compute instance's maintenance behavior, restart behavior, and host error wait behavior. Compute Engine configures each instance with the default values unless you specify otherwise.

During host events, depending on the configured host maintenance policy, instances that don't support live migration are terminated and automatically restarted.

  • onHostMaintenance: determines the behavior when a maintenance event occurs that might cause your instance to restart.

    • MIGRATE: causes Compute Engine to live migrate an instance when there is a maintenance event. This is the default for most VMs.
    • TERMINATE: stops the instance instead of using live migration. This is the default option for Z3, bare metal instances, and instances with accelerators such as GPUs and TPUs. For these instance types, you can't change the setting for onHostMaintenance.
  • automaticRestart: determines the behavior when an instance crashes or is stopped by the system.

    • true (Default): Compute Engine restarts an instance if the instance crashes or is stopped.
    • false: Compute Engine does not restart an instance if the instance crashes or is stopped.
  • localSsdRecoveryTimeout: Sets the Local SSD recovery timeout. This is the maximum amount of time, in hours, that Compute Engine waits to recover Local SSD data after a host error. This setting only applies to VMs with attached Local SSD disks. If you configure this setting for an instance that doesn't have attached Local SSD disks, then the setting is ignored.

    • Unset (Default): Compute Engine waits up to 1 hour to recover the Local SSD data. For Z3 VMs, the default wait time is 6 hours.
    • An integer from 0 to 168: specifies the number of hours that Compute Engine waits to recover the Local SSD data. The maximum value is equivalent to 7 days. A value of 0 means that Compute Engine doesn't wait to recover the Local SSD data and restarts the instance immediately.
  • hostErrorTimeoutSeconds (Preview): Sets the maximum amount of time, in seconds, that Compute Engine waits to restart or terminate a compute instance after detecting that the instance is unresponsive.

    • Unset (Default): Compute Engine waits up to 5.5 minutes (330 seconds) before restarting an unresponsive instance.
    • An integer from 90 to 330: the number of seconds, specified in increments of 30, that Compute Engine waits before restarting an unresponsive compute instance.

Set host maintenance policy of an instance

You can change the host maintenance policy of an instance when you first create the instance or after the instance is created.

Set host maintenance policy during instance creation

The information in this section focuses on how to set the host maintenance policy when you create an instance. For more instance creation examples, see Create and start a Compute Engine instance.

You can set the host maintenance policy of a compute instance at creation time using the Google Cloud console, gcloud CLI or REST.

Console

  1. In the Google Cloud console, go to the Create an instance page.

    Go to Create an instance

  2. Specify a Name for the instance.

  3. Select a Region and Zone for the instance.

  4. In the Machine configuration section, do the following:

    1. Specify the details of the machine type for the instance.
    2. Expand the VM provisioning model advanced settings menu.
    3. In the On host maintenance menu, select one of the following steps:
      1. To migrate VMs during maintenance events, select Migrate VM instance.
      2. To stop instances during maintenance events, select Terminate VM instance.
  5. To create the instance, click Create.

gcloud

To set the host maintenance policy of a new instance, use the gcloud compute instances create command.

To set the --host-error-timeout-seconds property (Preview), you must use the gcloud beta compute instances create command.

You can set the host maintenance policy of a new instance with the following command. If you omit any of the flags, the default value for the flag is used.

  gcloud compute instances create INSTANCE_NAME \
      --zone=ZONE \
      --maintenance-policy=MAINTENANCE_BEHAVIOR \
      --RESTART_ON_FAILURE_BEHAVIOR \
      --local-ssd-recovery-timeout=SSD_RECOVERY_TIMEOUT \
      --host-error-timeout-seconds=ERROR_DETECTION_TIMEOUT

Replace the following:

  • INSTANCE_NAME: the instance name.
  • ZONE: the zone where the instance is located,
  • MAINTENANCE_BEHAVIOR: the maintenance event behavior of an instance, either TERMINATE or MIGRATE. For most machine types, the VM is migrated by default if you omit this property. Z3 and bare metal instances terminate.
  • RESTART_ON_FAILURE_BEHAVIOR: Restart behaviour for terminated or unresponsive instances, set to either restart-on-failure (default) or no-restart-on-failure.
  • SSD_RECOVERY_TIMEOUT: the number of hours to spend recovering Local SSD disks attached to a terminated or unresponsive instance. Valid values are from 0 to 168, in increments of 1 hour.
  • ERROR_DETECTION_TIMEOUT: the number of seconds Compute Engine waits before restarting an unresponsive instance, from 90 to 330 seconds (5.5 minutes), in 30-second increments.

REST

To set the host maintenance policy of a new instance using REST, use the instances.insert method.

You can set the host maintenance policy of a new instance with the following command. If you omit any of the fields, the default value for the field is used.

      POST https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances

      {
        "name": "INSTANCE_NAME",

        "scheduling": {
          "onHostMaintenance": "MAINTENANCE_BEHAVIOR",
          "automaticRestart": "RESTART_POLICY,
          "localSsdRecoveryTimeout": SSD_RECOVERY_TIMEOUT
        }
      }

Replace the following:

  • PROJECT_ID: the project for the instance.
  • ZONE: the zone where you want to create the instance.
  • INSTANCE_NAME: the instance name.
  • MAINTENANCE_BEHAVIOR: the maintenance event behavior of an instance, either TERMINATE or MIGRATE. For most machine types, the VM is migrated by default if you omit this field. Z3 and bare metal instances terminate.
  • RESTART_POLICY: whether the instance restarts automatically after a maintenance event or a host error, either true (default) or false.
  • SSD_RECOVERY_TIMEOUT: the number of hours Compute Engine spends recovering any Local SSD disks attached to an unresponsive or terminated instance. Valid values are from 0 to 168, in increments of 1 hour. The default value for Z3 is 6 hours, and for all other VMs the default is 1 hour.

Set the host error detection timeout

To set the maximum amount of time Compute Engine waits to restart or terminate an unresponsive instance, use the beta instances.insert method because this option is in Preview.

Add the hostErrorTimeoutSeconds property to the scheduling object of the request body, where HOST_ERROR_TIMEOUT is the number of seconds that Compute Engine waits before restarting or terminating an unresponsive instance. Valid values are from 90 to 330 (5.5 minutes), in 30-second increments.


   POST https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances

   {
      "name": "INSTANCE_NAME",

      "scheduling": {
        "onHostMaintenance": "MAINTENANCE_BEHAVIOR",
        "automaticRestart": "RESTART_POLICY,
        "localSsdRecoveryTimeout": SSD_RECOVERY_TIMEOUT
        "hostErrorTimeoutSeconds": HOST_ERROR_TIMEOUT,
      }
    }

Update the host maintenance policy of an existing instance

Console

  1. In the Google Cloud console, go to the VM instances page.

    Go to VM instances

  2. Click the name of the instance for which you want to change settings. The instance details page displays.

  3. With the Details tab selected, complete the following steps:

    1. Click the Edit button at the top of the page.
    2. Go to the Management section. In the Availability policies section, you can change the host maintenance options.
    3. Click Save.

gcloud

Update the host maintenance policy of an existing instance with the gcloud compute instances set-scheduling command. Use the same parameters as for the instance creation command in the preceding section.

To update the maximum amount of time Compute Engine waits to restart or terminate an unresponsive instance (Preview), use the gcloud beta compute instances set-scheduling command and include --host-error-timeout-seconds=NUMBER_OF_SECONDS.

    gcloud compute instances set-scheduling INSTANCE_NAME \
      --maintenance-policy=MAINTENANCE_BEHAVIOR \
      --RESTART_ON_FAILURE_BEHAVIOR \
      --local-ssd-recovery-timeout=SSD_RECOVERY_TIMEOUT

Replace the following:

  • NUMBER_OF_SECONDS: the number of seconds Compute Engine waits before restarting or terminating an unresponsive VM, from 90 to 330 (5.5 minutes), in 30-second increments.
    • INSTANCE_NAME: the instance name.
    • MAINTENANCE_BEHAVIOR: the maintenance event behavior of an instance, either TERMINATE or MIGRATE. For most machine types, the VM is migrated by default if you omit this property. Z3 and bare metal instances terminate.
    • RESTART_ON_FAILURE_BEHAVIOR: Restart behaviour for terminated or unresponsive instances, set to either restart-on-failure (default) or no-restart-on-failure.
    • SSD_RECOVERY_TIMEOUT: the number of hours to spend recovering Local SSD disks attached to a terminated or unresponsive instance. Valid values are from 0 to 168, in increments of 1 hour.
    • NUMBER_OF_SECONDS: the number of seconds Compute Engine waits before restarting an unresponsive instance, from 90 to 330 seconds (5.5 minutes), in 30-second increments.

REST

Update the host maintenance policy of an existing instance using a POST request to the instances.setScheduling method.

    POST https://compute.googleapis.com/compute/v1projects/PROJECT_ID/zones/ZONE/instances/INSTANCE_NAME/setScheduling

    {
      "onHostMaintenance": "MAINTENANCE_BEHAVIOR",
      "automaticRestart": RESTART_POLICY,
      "localSsdRecoveryTimeout": SSD_RECOVERY_TIMEOUT
    }

Replace the following:

  • PROJECT_ID: the project for the instance.
  • ZONE: the zone where the instance is located.
  • INSTANCE_NAME: the instance name.
  • MAINTENANCE_BEHAVIOR: the maintenance event behavior of this instance, either TERMINATE or MIGRATE.
  • RESTART_POLICY: whether the instance is automatically restarted, either true or false.
    • SSD_RECOVERY_TIMEOUT: the number of hours to spend recovering Local SSD disks attached to the instance. Valid values are from 0 to 168, in increments of 1 hour.

Update the host error detection timeout

To update the maximum amount of time Compute Engine waits to restart or terminate an unresponsive VM, you must use the beta instances.setScheduling method because this feature is in Preview.

Add the hostErrorTimeoutSeconds property to request body, where HOST_ERROR_TIMEOUT is the number of seconds that Compute Engine waits before restarting or terminating an unresponsive instance. Valid values are from 90 to 330 (5.5 minutes), in 30-second increments.

  POST https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/INSTANCE_NAME/setScheduling

  {
    ...
    "hostErrorTimeoutSeconds": NUMBER_OF_SECONDS
  }

View host maintenance policy settings for an instance

Console

  1. Go to the VM instances page.

    Go to VM instances

  2. Click the Name of the instance for which you want to view settings. The instance details page opens.

  3. Go to the Management section. The Availability policies subsection shows your current settings for the following:

    • On host maintenance
    • Automatic restart
    • Host error timeout

gcloud

View the host maintenance option settings for an instance with the gcloud compute instances describe command.

To view the current value of the hostErrorTimeoutSeconds setting (Preview), use the gcloud beta compute instances describe command.

 gcloud compute instances describe INSTANCE_NAME \
 --zone ZONE --format="yaml(scheduling)"

Replace the following:

  • INSTANCE_NAME: the name of the instance
  • ZONE: the zone where the instance is located

The output includes the current settings for the host maintenance policy, for example:

scheduling:
  automaticRestart: true
  hostErrorTimeoutSeconds: 120
  localSsdRecoveryTimeout:
    nanos: 0
    seconds: '10800'
  onHostMaintenance: MIGRATE
  preemptible: false
  provisioningModel: STANDARD

REST

To view the host maintenance settings for an instance, use the instances.get method:

  GET https://compute.googleapis.com/compute/v1/projects/PROJECT_ID/zones/ZONE/instances/INSTANCE_NAME

Replace the following:

  • PROJECT_ID: the project where the instance is located.
  • ZONE: the zone where the instance is located.
  • INSTANCE_NAME: the instance name.

In the output, the scheduling object contains the settings for the instance's host maintenance policy, for example:

{
...
  "scheduling": {
      "onHostMaintenance": "MIGRATE",
      "automaticRestart": true,
      "preemptible": false,
      "provisioningModel": "STANDARD",
      "localSsdRecoveryTimeout": {
        "seconds": "10800",
        "nanos": 0
      }
    },
...
}

View the host error timeout settings

View the current hostErrorTimeoutSeconds setting by constructing a GET request using the beta instances.get method.

 GET https://compute.googleapis.com/compute/beta/projects/PROJECT_ID/zones/ZONE/instances/INSTANCE_NAME

Replace the following:

  • PROJECT_ID: the project for the instance.
  • ZONE: the zone where the instance is located.
  • INSTANCE_NAME: the instance name.

In the output, the scheduling object includes the instance's host error detection timeout, for example:

{
...
  "scheduling": {
    "onHostMaintenance": "MIGRATE",
    "automaticRestart": true,
    "preemptible": false,
    "provisioningModel": "STANDARD",
    "hostErrorTimeoutSeconds": 120,
    "localSsdRecoveryTimeout": {
      "seconds": "10800",
      "nanos": 0
    }
  },
...
}

What's next