A Cloud Storage import topic lets you continuously ingest data from Cloud Storage into Pub/Sub. Then you can stream the data into any of the destinations that Pub/Sub supports. Pub/Sub automatically detects new objects added to the Cloud Storage bucket and ingests them.
Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. Buckets can also contain managed folders, which you use to provide expanded access to groups of objects with a shared name prefix.
For more information about Cloud Storage, see the Cloud Storage documentation.
For more information about import topics, see About import topics.
Before you begin
A Cloud Storage bucket must already exist before you create a Cloud Storage import topic. If you are using the console to create the import topic, the workflow lets you create a Cloud Storage bucket. For other configuration methods, see Create buckets.
If applicable, ensure that the message storage policy of the Pub/Sub topic overlaps with the region where your Cloud Storage bucket is located. For more information, see Message storage policy is compliant with the bucket location.
Some Google Cloud services have Google Cloud-managed service accounts that lets the services access your resources. These service accounts are known as service agents. Pub/Sub creates and maintains a service account for each project in the format
service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com
. Configure the required roles and permissions on the Pub/Sub service account to manage Cloud Storage import topics including the following:Grant the Pub/Sub publisher role (
roles/pubsub.publisher
) to the Pub/Sub service account. This service account is going to publish to the import topic. To grant this role, you require a user account with the Pub/Sub Admin role (roles/pubsub.admin
). For more information, see Add the Pub/Sub publisher role to the Pub/Sub service account.Grant Cloud Storage permissions to the Pub/Sub service account. To grant these permissions, you require a user account with the Storage Admin role (
roles/storage.admin
). For more information, see Assign Cloud Storage roles to the Pub/Sub service account.
Required roles and permissions to manage Cloud Storage import topics
To get the permissions that you need to create and manage a Cloud Storage import topic,
ask your administrator to grant you the
Pub/Sub Editor (roles/pubsub.editor
) IAM role on your topic or project.
For more information about granting roles, see Manage access to projects, folders, and organizations.
This predefined role contains the permissions required to create and manage a Cloud Storage import topic. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to create and manage a Cloud Storage import topic:
-
Create an import topic:
pubsub.topics.create
-
Delete an import topic:
pubsub.topics.delete
-
Get an import topic:
pubsub.topics.get
-
List an import topic:
pubsub.topics.list
-
Publish to an import topic:
pubsub.topics.publish
-
Update an import topic:
pubsub.topics.update
-
Get the IAM policy for an import topic:
pubsub.topics.getIamPolicy
-
Configure the IAM policy for an import topic:
pubsub.topics.setIamPolicy
You might also be able to get these permissions with custom roles or other predefined roles.
You can configure access control at the project level and the individual resource level.
Message storage policy is compliant with the bucket location
The message storage policy of the Pub/Sub topic must overlap with the regions where your Cloud Storage bucket is located. This policy dictates where Pub/Sub is allowed to store your message data.
For buckets with location type as region: The policy must include that specific region. For example, if your bucket is in the
us-central1
region, the message storage policy must also includeus-central1
.For buckets with location type as dual-region or multi-region: The policy must include at least one region within the dual-region or multi-region location. For example, if your bucket is in the
US multi-region
, the message storage policy could includeus-central1
,us-east1
, or any other region within theUS multi-region
.If the policy doesn't include the bucket's region, topic creation fails. For example, if your bucket is in
europe-west1
and your message storage policy only includesasia-east1
, you'll receive an error.If the message storage policy includes only one region that overlaps with the bucket's location, multi-region redundancy might be compromised. This is because if that single region becomes unavailable, your data might not be accessible. To ensure full redundancy, it's recommended to include at least two regions within the message storage policy that are part of the bucket's multi-region or dual-region location.
For more information about the bucket locations, see the documentation.
Add the Pub/Sub publisher role to the Pub/Sub service account
You must assign the Pub/Sub publisher role to the Pub/Sub service account so that Pub/Sub is able to publish to the Cloud Storage import topic.
To enable publishing to all topics in a project, see Enable publishing to all topics. Use this method if you have not created any Cloud Storage import topics.
To enable publishing to a specific topic, see Enable publishing to a single topic. Use this method only if the Cloud Storage import topic already exists.
Enable publishing to all Cloud Storage import topics
Choose this option when you don't have a Cloud Storage import topic available in your project.
In the Google Cloud console, go to the IAM page.
Select the Include Google-provided role grants checkbox.
Look for the Pub/Sub service account that has the format:
service-{PROJECT_NUMBER}@gcp-sa-pubsub.iam.gserviceaccount.com
For this service account, click the Edit Principal button.
If required, click Add another role.
Search and select the Pub/Sub publisher role (
roles/pubsub.publisher
).Click Save.
Enable publishing to a single Cloud Storage import topic
If you want to grant Pub/Sub the permission to publish to a specific Cloud Storage import topic that already exists, follow these steps:
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
Run the
gcloud pubsub topics add-iam-policy-binding
command:gcloud pubsub topics add-iam-policy-binding TOPIC_ID\ --member="serviceAccount:service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com"\ --role="roles/pubsub.publisher"
Replace the following:
TOPIC_ID is the ID or name of the Cloud Storage import topic.
PROJECT_NUMBER is the project number. To view the project number, see Identifying projects.
Assign Cloud Storage roles to the Pub/Sub service account
To create a Cloud Storage import topic, the Pub/Sub service account must have permission to read from the specific Cloud Storage bucket. The following permissions are required:
storage.objects.list
storage.objects.get
storage.buckets.get
To assign these permissions to the Pub/Sub service account, choose one of the following procedures:
Grant permissions at the bucket level. On the specific Cloud Storage bucket, grant the Storage Legacy Object Reader (
roles/storage.legacyObjectReader
) role and the Storage Legacy Bucket Reader (roles/storage.legacyBucketReader
) roles to the Pub/Sub service account.If you must grant roles at the project level, you might instead grant the Storage Admin (
roles/storage.admin
) role on the project containing the Cloud Storage bucket. Grant this role to the Pub/Sub service account.
Bucket permissions
Perform the following steps to grant the Storage Legacy Object Reader
(roles/storage.legacyObjectReader
) role and the
Storage Legacy Bucket Reader (roles/storage.legacyBucketReader
) roles to the
Pub/Sub service account at the bucket level:
In the Google Cloud console, go to the Cloud Storage page.
Click the Cloud Storage bucket from which you would like to read messages and import to the Cloud Storage import topic.
The Bucket details page opens.
In the Bucket details page, click the Permissions tab.
In the Permissions > View by Principals tab, click Grant access.
The Grant access page opens.
In the Add Principals section, enter the name of your Pub/Sub service account.
The format of the service account is
service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com
. For example, for a project with PROJECT_NUMBER=112233445566
, the service account is of the formatservice-112233445566@gcp-sa-pubsub.iam.gserviceaccount.com
.In the Assign roles > Select a role drop-down, enter
Object Reader
and select the Storage Legacy Object Reader role.Click Add another role.
In the Select a role drop-down, enter
Bucket Reader
, and select the Storage Legacy Bucket Reader role.Click Save.
Project permissions
Perform the following steps to grant the Storage Admin
(roles/storage.admin
) role at the project level:
In the Google Cloud console, go to the IAM page.
In the Permissions > View by Principals tab, click Grant access.
The Grant access page opens.
In the Add Principals section, enter the name of your Pub/Sub service account.
The format of the service account is
service-PROJECT_NUMBER@gcp-sa-pubsub.iam.gserviceaccount.com
. For example, for a project with PROJECT_NUMBER=112233445566
, the service account is of the formatservice-112233445566@gcp-sa-pubsub.iam.gserviceaccount.com
.In the Assign roles > Select a role drop-down, enter
Storage Admin
and select the Storage Admin role.Click Save.
For more information about Cloud Storage IAM, see Cloud Storage Identity and Access Management.
Properties of Cloud Storage import topics
For more information about the common properties across all topics, see Properties of a topic.
Bucket name
This is the name of the Cloud Storage bucket from which Pub/Sub reads the data that is published to a Cloud Storage import topic.
Input format
When you create a Cloud Storage import topic, you can specify the format of the objects to be ingested as Text, Avro, or Pub/Sub Avro.
Text. Objects are assumed to hold data with plain text. This input format attempts to ingest all objects in the bucket as long as the object meets the minimum object creation time and matches the glob pattern criteria.
Delimiter. You can also specify a delimiter by which objects are split into messages. If unset, this defaults to the newline character (
\n
). The delimiter must only be a single character.Avro. Objects are in the Apache Avro binary format. Any object that is not in a valid Apache Avro format is not ingested. Here are the limitations regarding Avro:
- Avro versions 1.1.0 and 1.2.0 are not supported.
- The maximum size of an Avro block is 16 MB.
Pub/Sub Avro. Objects are in the Apache Avro binary format with a schema matching that of an object written to Cloud Storage using a Pub/Sub Cloud Storage subscription with the Avro file format. Here are some important guidelines for Pub/Sub Avro:
The data field of the Avro record is used to populate the data field of the generated Pub/Sub message.
If the write_metadata option is specified for the Cloud Storage subscription, any values in the attributes field are populated as the attributes of the generated Pub/Sub message.
If an ordering key is specified in the original message written to Cloud Storage, this field is populated as an attribute with the name
original_message_ordering_key
in the generated Pub/Sub message.
Minimum object creation time
You can optionally specify a minimum object creation time when creating a
Cloud Storage import topic. Only objects that were created at or after
this timestamp are ingested. This timestamp must be provided
in a format like YYYY-MM-DDThh:mm:ssZ
.
Any date, past or future,
from 0001-01-01T00:00:00Z
to 9999-12-31T23:59:59Z
inclusive, is valid.
Match glob pattern
You can optionally specify a match glob pattern when creating a
Cloud Storage import topic. Only objects with names that match this pattern
are ingested. For example, to ingest all object with suffix .txt
,
you can specify the glob pattern as **.txt
.
For information about supported syntax for glob patterns, see the Cloud Storage documentation.
Create a Cloud Storage import topic
Ensure that you have completed the following procedures:
Create a Cloud Storage bucket. If you are using the console to create the import topic, the workflow lets you create a Cloud Storage bucket as part of the workflow.
Add the Pub/Sub publisher role to the Pub/Sub service account for all topics.
Grant Cloud Storage permissions to the Pub/Sub service account.
Creating the topic and subscription separately, even if done in rapid succession, can lead to data loss. There's a short window where the topic exists without a subscription. If any data is sent to the topic during this time, it is lost. By creating the topic first, creating the subscription, and then converting the topic to an import topic, you guarantee that no messages are missed during the import process.
To create a Cloud Storage import topic, follow these steps:
Console
-
In the Google Cloud console, go to the Topics page.
-
Click Create topic.
The topic details page opens.
-
In the Topic ID field, enter an ID for your Cloud Storage import topic.
For more information about naming topics, see the naming guidelines.
-
Select Add a default subscription.
-
Select Enable ingestion.
-
For ingestion source, select Google Cloud Storage.
-
For the Cloud Storage bucket, click Browse.
The Select bucket page opens. Select one of the following options:
-
Select an existing bucket from any appropriate project.
-
Click the create icon and follow the instructions on the screen to create a new bucket. After you create the bucket, select the bucket for the Cloud Storage import topic.
-
-
When you specify the bucket, Pub/Sub checks for the appropriate permissions on the bucket for the Pub/Sub service account. If there are permissions issues, you see a message similar to the following:
Unable to verify if the Pub/Sub service agent has write permissions on this bucket. You may be lacking permissions to view or set permissions.
If you get permission issues, click Set permissions. For more information, see Grant Cloud Storage permissions to the Pub/Sub service account.
-
For Object format, select Text, Avro, or Pub/Sub Avro.
If you select Text, you can optionally specify a Delimiter with which to split objects into messages.
For more information about these options, see Input format.
- Optional. You can specify a Minimum object creation time for your
topic. If set, only objects created after the minimum object creation time
are ingested.
For more information see Minimum object creation time.
- You must specify a Glob pattern. To ingest all objects in the bucket,
use
**
as the glob pattern. If set, only objects that match the given pattern are ingested.For more information, see Match a glob pattern.
- Retain the other default settings.
- Click Create topic.
gcloud
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
-
Run the
gcloud pubsub topics create
command:gcloud pubsub topics create TOPIC_ID\ --cloud-storage-ingestion-bucket=BUCKET_NAME\ --cloud-storage-ingestion-input-format=INPUT_FORMAT\ --cloud-storage-ingestion-text-delimiter=TEXT_DELIMITER\ --cloud-storage-ingestion-minimum-object-create-time=MINIMUM_OBJECT_CREATE_TIME\ --cloud-storage-ingestion-match-glob=MATCH_GLOB
In the command, only
TOPIC_ID
, the--cloud-storage-ingestion-bucket
flag, and the--cloud-storage-ingestion-input-format
flag are required. The remaining flags are optional and can be omitted.Replace the following:
-
TOPIC_ID: The name or ID of your topic.
-
BUCKET_NAME: Specifies the name of an existing bucket. For example,
prod_bucket
. The bucket name must not include the project ID. To create a bucket, see Create buckets. -
INPUT_FORMAT: Specifies the format of the objects that is ingested. This can be
text
,avro
, orpubsub_avro
. For more information about these options, see Input format. -
TEXT_DELIMITER: Specifies the delimiter with which to split text objects into Pub/Sub messages. This should be a single character and should only be set when
INPUT_FORMAT
istext
. It defaults to the newline character (\n
).When using gcloud CLI to specify the delimiter, pay close attention to the handling of special characters like newline
\n
. Use the format'\n'
to ensure the delimiter is correctly interpreted. Simply using\n
without quotes or escaping results in a delimiter of"n"
. -
MINIMUM_OBJECT_CREATE_TIME: Specifies the minimum time at which an object was created in order for it to be ingested. This should be in UTC in the format
YYYY-MM-DDThh:mm:ssZ
. For example,2024-10-14T08:30:30Z
.Any date, past or future, from
0001-01-01T00:00:00Z
to9999-12-31T23:59:59Z
inclusive, is valid. -
MATCH_GLOB: Specifies the glob pattern to match in order for an object to be ingested. When you are using gcloud CLI, a match glob with
*
characters must have the*
character formatted as escaped in the form\*\*.txt
or the whole match glob must be in quotes"**.txt"
or'**.txt'
. For information about supported syntax for glob patterns, see the Cloud Storage documentation.
-
Go
Before trying this sample, follow the Go setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub Go API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Java
Before trying this sample, follow the Java setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub Java API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js
Before trying this sample, follow the Node.js setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub Node.js API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Python
Before trying this sample, follow the Python setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub Python API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
C++
Before trying this sample, follow the C++ setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub C++ API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
Node.js (TypeScript)
Before trying this sample, follow the Node.js setup instructions in the Pub/Sub quickstart using client libraries. For more information, see the Pub/Sub Node.js API reference documentation.
To authenticate to Pub/Sub, set up Application Default Credentials. For more information, see Set up authentication for a local development environment.
If you run into issues, see Troubleshooting a Cloud Storage import topic.
Edit a Cloud Storage import topic
You can edit a Cloud Storage import topic to update its properties.
For example, to restart ingestion, you can change the bucket or update the minimum object creation time.
To edit a Cloud Storage import topic, perform the following steps:
Console
-
In the Google Cloud console, go to the Topics page.
-
Click the Cloud Storage import topic.
-
In the topic details page, click Edit.
-
Update the fields that you want to change.
-
Click Update.
gcloud
-
In the Google Cloud console, activate Cloud Shell.
At the bottom of the Google Cloud console, a Cloud Shell session starts and displays a command-line prompt. Cloud Shell is a shell environment with the Google Cloud CLI already installed and with values already set for your current project. It can take a few seconds for the session to initialize.
To avoid losing your settings for the import topic, make sure to include all of them every time you update the topic. If you leave something out, Pub/Sub resets the setting to its original default value.
Run the
gcloud pubsub topics update
command with all the flags mentioned in the following sample:gcloud pubsub topics update TOPIC_ID \ --cloud-storage-ingestion-bucket=BUCKET_NAME\ --cloud-storage-ingestion-input-format=INPUT_FORMAT\ --cloud-storage-ingestion-text-delimiter=TEXT_DELIMITER\ --cloud-storage-ingestion-minimum-object-create-time=MINIMUM_OBJECT_CREATE_TIME\ --cloud-storage-ingestion-match-glob=MATCH_GLOB
Replace the following:
-
TOPIC_ID is the topic ID or name. This field cannot be updated.
-
BUCKET_NAME: Specifies the name of an existing bucket. For example,
prod_bucket
. The bucket name must not include the project ID. To create a bucket, see Create buckets. -
INPUT_FORMAT: Specifies the format of the objects that is ingested. This can be
text
,avro
, orpubsub_avro
. See Input format for more information on these options. -
TEXT_DELIMITER: Specifies the delimiter with which to split text objects into Pub/Sub messages. This should be a single character and should only be set when
INPUT_FORMAT
istext
. It defaults to the newline character (\n
).When using gcloud CLI to specify the delimiter, pay close attention to the handling of special characters like newline
\n
. Use the format'\n'
to ensure the delimiter is correctly interpreted. Simply using\n
without quotes or escaping results in a delimiter of"n"
. -
MINIMUM_OBJECT_CREATE_TIME: Specifies the minimum time at which an object was created in order for it to be ingested. This should be in UTC in the format
YYYY-MM-DDThh:mm:ssZ
. For example,2024-10-14T08:30:30Z
.Any date, past or future, from
0001-01-01T00:00:00Z
to9999-12-31T23:59:59Z
inclusive, is valid. -
MATCH_GLOB: Specifies the glob pattern to match in order for an object to be ingested. When you are using gcloud CLI, a match glob with
*
characters must have the*
character formatted as escaped in the form\*\*.txt
or the whole match glob must be in quotes"**.txt"
or'**.txt'
. For information about supported syntax for glob patterns, see the Cloud Storage documentation.
-
Quotas and limits for Cloud Storage import topics
The publisher throughput for import topics is bound by the publish quota of the topic. For more information, see Pub/Sub quotas and limits.
What's next
Enable platform logs for a Cloud Storage import topic.
Choose the type of subscription for your topic.
Learn how to publish a message to a topic.