Troubleshooting VM shutdowns and reboots

This document describes the common causes of unexpected shutdowns and reboots of virtual machine (VM) instances and how to prevent them.

VM shutdowns and reboots can be caused by system events or admin activities. System event shutdowns and reboots are generated by Google systems or your VM's operating system. Admin activity shutdowns and reboots are generated by a user- or service account-generated API call. All shutdowns and reboots are logged, except for reboots that are initiated from within the VM.

Before you begin

If you haven't already, then set up authentication. Authentication is the process by which your identity is verified for access to Google Cloud services and APIs. To run code or samples from a local development environment, you can authenticate to Compute Engine by selecting one of the following options:

Select the tab for how you plan to use the samples on this page:
Console

When you use the Google Cloud console to access Google Cloud services and APIs, you don't need to set up authentication.
gcloud
1. Install the Google Cloud CLI, then initialize it by running the following command:
  gcloud init
  Note: If you installed the gcloud CLI previously, make sure you have the latest version by running gcloud components update.
2. Set a default region and zone.

Diagnosing VM shutdowns and reboots

To diagnose the cause of a VM's spontaneous shutdown or reboot, you must query your VM's logs. To quickly identify the cause of future VM shutdowns or reboots, build a dashboard that contains the logs. After you query the logs, review the method and principalEmail fields to determine what event and which user or service initiated the shutdown or reboot.

Querying Cloud Audit Logs

Query Cloud Audit Logs to display a list of system events and admin activities that might have caused the shutdown or reboot.

Permissions required for this task

To perform this task, you must have the following permissions:

The Logging/Logs Viewer role or the Project/Viewer role.

Console

In the Google Cloud console, go to the Logs Explorer page.

Go to Logs Explorer

Note: You might need to click Upgrade to use Logs Explorer instead of the Legacy Logs Viewer.

In the Query field, enter the following query:

resource.type="gce_instance"
"VM_NAME"
logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")

Replace VM_NAME with the name of the VM that shut down or rebooted.

If the event you're looking for happened more than an hour ago, set a custom time frame by clicking the clock symbol and entering a custom range.
Click Run query. The results are displayed in the Query results section.

Tip: To increase the size of the Query results section, click Enter fullscreen Query results.
Click the expander arrow next to each result to show detailed information.
See Reviewing Cloud Audit Logs to learn more about the method and principalEmail fields that are associated with shutdowns and reboots, and what you can do to prevent them.

gcloud

View Cloud Audit Logs using the gcloud logging read command:
```
gcloud logging read --freshness=TIME 'resource.type="gce_instance" "VM_NAME" logName:("logs/cloudaudit.googleapis.com%2Fsystem_event" OR "logs/cloudaudit.googleapis.com%2Factivity")'
```
Replace the following:
- TIME: the amount of time you want to query. For example, 1h queries log entries in the past hour. For information about date and time formats, see gcloud topic datetimes.
- VM_NAME: the name of the VM that shutdown or rebooted.
The results display.
See Reviewing Cloud Audit Logs to learn more about the method and principalEmail fields that are associated with shutdowns and reboots, and what you can do to prevent them.

Reviewing Cloud Audit Logs

Review the method and principalEmail fields of the Cloud Audit Logs to determine why your VM was shut down or rebooted.

Review the method fields of the Cloud Audit Logs and compare them with the methods listed in the following table.

Method	Shutdown type	Description
`compute.instances.repair.recreateInstance`	System event	If your VM belongs to a managed instance group (MIG), the MIG recreates the VM if the VM's state changes from `RUNNING` and the MIG did not initiate the change in state. Changes of instance state that are not initiated by the MIG include: Hardware failures. Terminating a preemptible instance. Infrastructure maintenance events when the VM instance is not set to live migrate. Deleting a MIG instance by using one of the following methods: The `instances.delete` API method The `gcloud compute instances delete` command Note: To ensure that your configuration changes aren't reverted by the MIG, it's important to use the group's methods. For example, to delete a managed instance, use one of the following methods: For a zonal MIG: `instanceGroupManagers.deleteInstances` For a regional MIG: `regionInstanceGroupManagers.deleteInstances` In gcloud: `gcloud compute instance-groups managed delete-instances`
`compute.instances.hostError`	System event	A host error (`compute.instances.hostError`) means that there was a hardware or software issue on the physical machine or the data center infrastructure hosting your compute instance that caused your instance to crash. A host error involving a total hardware failure or other hardware issues might prevent the live migration of your instance. If your instance is set to automatically restart, which is the default setting, Compute Engine restarts your instance, typically within three minutes from the time the error was detected. Depending on the issue, the restart might take up to 5.5 minutes. Occasionally, a compute instance might become unresponsive before a host error is signaled. You can reduce the amount of time Compute Engine waits to restart or terminate the instance by setting the host error recovery timeout (Preview). For more information, see Set availability policies. Physical hardware and software failures can happen occasionally but are rare occurrences. To protect your applications and services from these potentially disruptive system events, review the following resources: Designing robust systems Patterns for scalable and resilient apps Creating managed instance groups Google also offers managed services such as App Engine and the App Engine flexible environment.
`compute.instances.automaticRestart`	System event	This event occurs after a `hostError` event or a `terminateOnHostMaintenance` event if your VM's `automaticRestart` host maintenance policy is set to `true`. In the logs, a `hostError` or a `terminateOnHostMaintenance` log entry precedes this log. If you want to change your VM's host maintenance policy, see Updating options for an instance.
`compute.instances.guestTerminate`	System event	Your VM's operating system initiated the shutdown.
`compute.instances.terminateOnHostMaintenance`	System event	If you set your VM's `onHostMaintenance` host maintenance policy to `TERMINATE`, Compute Engine stops your VM when there is a maintenance event where Google must move your VM to another host. If you want to change your VM's `onHostMaintenance` policy, see Updating options for an instance.
`compute.instances.preempted`	System event	Compute Engine preempted your Spot VM or legacy preemptible VM: When Compute Engine preempts a Spot VM, Compute Engine either stops or deletes the Spot VM based on its termination action. Spot VMs do not have a maximum runtime. When Compute Engine preempts a preemptible VM, Compute Engine stops the VM after a maximum runtime of 24 hours. To avoid these limitations, use Spot VMs instead. Spot VMs and preemptible VMs are excess Compute Engine capacity, so Compute Engine might preempt them any time that capacity is needed elsewhere. You can help mitigate the effects of preemption by following the best practices. Alternatively, if you require VMs with user-controlled runtimes, create standard VMs instead.
`compute.instances.stop`	Admin activity	A user or service account stopped your VM. Continue to the next step to identify the user or service account that stopped your VM. For information about restarting your VM, see Restarting a stopped instance.
`compute.instances.delete`	Admin activity or system event	A user or service account deleted your VM, or the VM was configured to be automatically deleted. Important: Requests to delete your VM, which are indicated in Cloud Audit Logs by the `compute.instances.delete` method, might inconsistently override other requests for your VM that were made at a similar time. Even when those other requests were successful, the `compute.instances.delete` method might inconsistently prevent the methods from those other requests from appearing in the Cloud Audit Logs for your VM. Specifically, a log for the `compute.instances.delete` method might indicate any of the following requests for your VM: Requests from a user or service account to directly delete your VM are indicated only by a `compute.instances.delete` method from the user or service account. Requests that automatically delete your VM are indicated by a `compute.instances.delete` method from `system@google.com`, but the method that explains the cause of automatic deletion might or might not appear in Cloud Audit Logs. For example, if a Spot VM is configured to be automatically deleted during preemption and is preempted, you see a `compute.instances.delete` method from `system@google.com`, but you might or might not also see a `compute.instances.preempted` method. Requests to the VM that happened shortly before or after a `compute.instances.delete` method might or might not appear in Cloud Audit Logs. For example, if a VM is stopped due to host maintenance shortly before the VM is deleted, you see a `compute.instances.delete` method, but you might or might not also see a `compute.instances.terminateOnHostMaintenance` method. Continue to the next step to identify the user or service account that deleted your VM. For information about creating a new VM, see Creating and starting a VM.
`compute.instances.insert`	Admin activity	A user or service account created your VM. Continue to the next step to identify the user or service account that created your VM. For information about creating a new VM, see Creating and starting a VM.
`compute.instances.reset`	Admin activity	A user or service account reset your VM. Continue to the next step to identify the user or service account that stopped your VM.

Review the principalEmail fields of the Cloud Audit Logs to identify the user or service that initiated the shutdown or reboot. The following table include common Google managed services that initiate shutdowns or reboots.

Email Description

system@google.com A system event caused the shutdown or reboot.

Email	Description
`system@google.com`	A system event caused the shutdown or reboot.
`project-number@cloudservices.gserviceaccount.com`	A service agent initiated the shutdown. To determine which project the service initiated the shutdown from, review the service agent's `project-number`. To determine which Google service made the request, review the `protoPayload.requestMetadata.callerSuppliedUserAgent` field.

project-number@cloudservices.gserviceaccount.com

A service agent initiated the shutdown.

To determine which project the service initiated the shutdown from, review the service agent's project-number.

To determine which Google service made the request, review the protoPayload.requestMetadata.callerSuppliedUserAgent field.

If a user triggered the shutdown or reboot, their email address appears in the principalEmail field. For example, cloudysanfrancisco@gmail.com.

Administrators can prevent users from changing the state of project VMs by changing Identity and Access Management permissions on user accounts. For more information, see Granting, changing, and revoking access to resources.

Monitor VM lifecycle events

You can monitor VM lifecycle events (including shutdowns, reboots, and host errors) by building a Cloud Monitoring dashboard.

This dashboard lets you to visualize system events and admin activities that are described in further detail in the Reviewing Audit Logs section of this document.

VM Lifecycle Dashboard: Stop and Start events Figure 1. An example dashboard showing the availability of an instance and its lifecycle events such as a stopped instance.

Create log-based metric

To capture VM lifecycle events, create a user-defined log-based metric. This metric uses Audit Logs to keep count of the number of times a particular VM lifecycle event has occurred.

To get the permissions that you need to create the metric, ask your administrator to grant you the Logs Writer (roles/logging.logWriter) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Create a user-defined log-based metric by doing the following:

In the Google Cloud console, go to the Log-based Metrics page.

Go to Log-based Metrics
Click Create Metric.

In the Metric Type section, do the following:

Select Counter.
Leave Distribution at the default setting of unselected.

In the Details section, enter the following information:

Log-based metric name: vm-lifecycle-events. You must use this exact name for the dashboard to work correctly.
Description: Optional — Enter a description for this metric.
Units: 1

In the Filter selection section, specify the following:

From the Select project or log bucket menu, select: Project logs

In the Build filter enter:

resource.type = "gce_instance" AND
log_id("cloudaudit.googleapis.com/activity") OR
log_id("cloudaudit.googleapis.com/system_event")
operation.first="true"

In the Labels section, click Add label.

Specify the following:

Label name: method
Label type: STRING
Field name: protoPayload.methodName

Regular expression:

(recreateInstance|hostError|automaticRestart|guestTerminate|terminateOnHostMaintenance|preempted|insert|stop|delete|reset|start)

Click Done
Click Create metric.

Use the dashboard

No data appears on the dashboard until a VM experiences a system event or an admin activity. To test that the dashboard works, perform an admin activity, such as a stop and start operation:

Perform a stop and start operation on any existing VM, or create a new VM for testing purposes.

To get the permissions that you need to use the dashboard, ask your administrator to grant you the Monitoring Dashboard Viewer (roles/monitoring.dashboardViewer) IAM role on the project. For more information about granting roles, see Manage access to projects, folders, and organizations.

You might also be able to get the required permissions through custom roles or other predefined roles.

Open Dashboards in the Google Cloud console.

Go to Dashboards
From the Dashboard List tab open the GCE VM Lifecycle Events Monitoring dashboard.
Select the VM from the Name drop-down menu.
Narrow the time series to a relevant timeframe.

For more ways to filter the dashboard see Add a temporary filter.

The dashboard contains two charts that display a timeline of system events and admin activities that occur on a VM:

The VM Lifecycle Timeline chart displays the following:
- The compute.googleapis.com/instance/uptime metric that indicates whether the VM was running at a given point in time, where 1 is up and 0 is down. Note this metric reflects availability as a result of user activity and system events, and is not an indication of Compute Engine SLA.
- The vm-lifecycle-events log-based metric to count the number of lifecycle actions, such as stop or start that performed were performed against the VM at a given point in time
The Events chart shows the same vm-lifecycle-events log-based metric but in a magnified view for easier readability. Note that although the X-axes are aligned, the colors are not synchronized between the two charts.

Investigating mass VM shutdown across projects

Compute Engine might shut down multiple VMs that are connected to a Shared VPC host project, if the Shared VPC host project's billing is inactive or disabled.

To determine if your VMs have been shut down by a mass shutdown request, look for stop operations initiated by cloud-cluster-manager@prod.google.com.

Starting an affected instance returns an error similar to the following:

Starting instance(s) INSTANCE_NAME...failed.
ERROR: (gcloud.compute.instances.start) The default network interface [nic0] is frozen.

To resolve this issue, do the following:

Identify the Shared VPC used by the VMs, by using the gcloud compute instances describe command:

gcloud compute instances describe VM_NAME \
   --format="flattened(networkInterfaces[].network)"

The output is similar to the following:

networkInterfaces[0].network: https://www.googleapis.com/compute/v1/projects/SHARED_VPC_PROJECT/global/networks/FROZEN_NETWORK

Verify in the Shared VPC's host project if billing has been disabled.

resource.type="project"
protoPayload.request.@type="type.googleapis.com/google.internal.cloudbilling.billingaccount.v1.DisableResourceBillingRequest"
protoPayload.response.resourceBillingInfo.billingAccountAssignmentType="DISABLED"

If applicable, Enable billing on the host project.

To help prevent this issue from recurring, read Secure the link between a project and its billing account.

Investigating VM termination issues with gcpdiag

gcpdiag is an open source tool. It is not an officially supported Google Cloud product. You can use the gcpdiag tool to help you identify and fix Google Cloud project issues. For more information, see the gcpdiag project on GitHub.

This gcpdiag runbook investigates VM termination issues, examining the following areas:

System event-triggered shutdowns and reboot: Identifies terminations initiated by internal Google Cloud systems due to system maintenance events, normal hardware failures, resource constraints.
System admin activities-triggered shutdowns/reboots: Investigates terminations caused by direct actions, such as API calls made by users or service accounts. These actions may include manual shutdowns, restarts, or automated processes impacting VM states.
Unofficial RCA text generation: Provides a detailed Root Cause Analysis text, outlining the identified cause of termination, the involved systems or activities, and recommendations to prevent future occurrences where applicable.

Google Cloud console

Complete and then copy the following command.

gcpdiag runbook gce/vm-termination \
    --parameter project_id=PROJECT_ID \
    --parameter name=VM_NAME \
    --parameter zone=ZONE

Open the Google Cloud console and activate Cloud Shell.

Open Cloud console

Paste the copied command.
Run the gcpdiag command, which downloads the gcpdiag docker image, and then performs diagnostic checks. If applicable, follow the output instructions to fix failed checks.

Docker

You can run gcpdiag using a wrapper that starts gcpdiag in a Docker container. Docker or Podman must be installed.

Copy and run the following command on your local workstation.

curl https://gcpdiag.dev/gcpdiag.sh >gcpdiag && chmod +x gcpdiag

Execute the gcpdiag command.

./gcpdiag runbook gce/vm-termination \
    --parameter project_id=PROJECT_ID \
    --parameter name=VM_NAME \
    --parameter zone=ZONE

View available parameters for this runbook.

Replace the following:

PROJECT_ID: The ID of the project containing the resource
VM_NAME: The name of the target VM within your project.
ZONE: The zone in which your target VM is located.

Useful flags:

--universe-domain: If applicable, the Trusted Partner Sovereign Cloud domain hosting the resource
--parameter or -p: Runbook parameters

For a list and description of all gcpdiag tool flags, see the gcpdiag usage instructions.