Orchestrate workloads
BigQuery tasks are usually part of larger workloads, with external tasks triggering and then being triggered by BigQuery operations. Workload orchestration helps data administrators, analysts, and developers organize and optimize this chain of actions, creating a seamless connection across data resources and processes. Orchestration methods and tools assist in designing, building, implementing, and monitoring these complex data workloads.
Choose an orchestration method
To select an orchestration method, you should identify whether your workloads are event-driven, time-driven, or both. An event is defined as a state change, such as a change to data in a database or a file added to a storage system. In event-driven orchestration, an action on a website might trigger a data activity, or an object landing in a certain bucket might need to be processed immediately on arrival. In time-driven orchestration, new data might need to be loaded once per day or frequently enough to produce hourly reports. You can use event-driven and time-driven orchestration in scenarios where you need to load objects into a data lake in real time, but activity reports on the data lake are only generated daily.
Choose an orchestration tool
Orchestration tools assist with tasks that are involved in managing complex data workloads, such as combining multiple Google Cloud or third-party services with BigQuery jobs, or running multiple BigQuery jobs in parallel. Each workload has unique requirements for dependency and parameter management to ensure that tasks are executed in the correct order using the correct data. Google Cloud provides several orchestration options that are based on orchestration method and workload requirements.
We recommend using Dataform, Workflows, Cloud Composer, or Vertex AI Pipelines for most use cases. Consult the following chart for a side-by-side comparison:
Dataform | Workflows | Cloud Composer | Vertex AI Pipelines | |
---|---|---|---|---|
Focus | Data transformation | Microservices | ETL or ELT | Machine learning |
Complexity | * | ** | *** | ** |
User profile | Data analyst or admin | Data architect | Data engineer | Data analyst |
Code type | JavaScript and SQL | YAML or JSON | Python | Python |
Serverless? | Yes | Yes | Fully managed | Yes |
Not suitable for | Chains of external services | Data transformation and processing | Low latency or event-driven pipelines | Infrastructure tasks |
The following sections detail these orchestration tools and several others.
Scheduled queries
The simplest form of workload orchestration is scheduling recurring queries directly in BigQuery. While this is the least complex approach to orchestration, we recommend it only for straightforward query chains with no external dependencies. Queries scheduled in this way must be written in GoogleSQL and can include data definition language (DDL) and data manipulation language (DML) statements.
Orchestration method: time-driven
Dataform
Dataform is a free, SQL-based, opinionated transformation framework that orchestrates complex data transformation tasks in BigQuery. When raw data is loaded into BigQuery, Dataform helps you create an organized, tested, version-controlled collection of datasets and tables. To learn more about using Dataform with BigQuery, see Create and execute a SQL workflow.
Orchestration method: event-driven
Workflows
Workflows is a serverless tool that orchestrates HTTP-based services with very low latency. It is best for chaining microservices together, automating infrastructure tasks, integrating with external systems, or creating a sequence of operations in Google Cloud. To learn more about using Workflows with BigQuery, see Run multiple BigQuery jobs in parallel.
Orchestration method: event-driven and time-driven
Cloud Composer
Cloud Composer is a fully managed tool built on Apache Airflow. It is best for extract, transform, load (ETL) or extract, load, transform (ELT) workloads as it supports several operator types and patterns, as well as task execution across other Google Cloud products and external targets. To learn more about using Cloud Composer with BigQuery, see Run a data analytics DAG in Google Cloud.
Orchestration method: time-driven
Vertex AI Pipelines
Vertex AI Pipelines is a serverless tool based on Kubeflow Pipelines specially designed for orchestrating machine learning workloads. It automates and connects all tasks of your model development and deployment, from training data to code, giving you a complete view of how your models work. To learn more about using Vertex AI Pipelines with BigQuery, see Export and deploy a BigQuery machine learning model for prediction.
Orchestration method: event-driven
Apigee Integration
Apigee Integration is an extension of the Apigee platform that includes connectors and data transformation tools. It is best for integrating with external enterprise applications, like Salesforce. To learn more about using Apigee Integration with BigQuery, see Get started with Apigee Integration and a Salesforce trigger.
Orchestration method: event-driven and time-driven
Cloud Data Fusion
Cloud Data Fusion is a data integration tool that offers code-free ELT/ETL pipelines and over 150 preconfigured connectors and transformations. To learn more about using Cloud Data Fusion with BigQuery, see Replicating data from MySQL to BigQuery.
Orchestration method: event-driven and time-driven
Cloud Scheduler
Cloud Scheduler is a fully managed scheduler for jobs like batch streaming or infrastructure operations that should occur on defined time intervals. To learn more about using Cloud Scheduler with BigQuery, see Scheduling workflows with Cloud Scheduler.
Orchestration method: time-driven
Cloud Tasks
Cloud Tasks is a fully managed service for asynchronous task distribution of jobs that can execute independently, outside of your main workload. It is best for delegating slow background operations or managing API call rates. To learn more about using Cloud Tasks with BigQuery, see Add a task to a Cloud Tasks queue.
Orchestration method: event-driven
Third-party tools
You can also connect to BigQuery using a number of popular third-party tools such as CData and SnapLogic. The BigQuery Ready program offers a full list of validated partner solutions.
Messaging tools
Many data workloads require additional messaging connections between decoupled microservices that only need to be activated when certain events occur. Google Cloud provides two tools that are designed to integrate with BigQuery.
Pub/Sub
Pub/Sub is an asynchronous messaging tool for data integration pipelines. It is designed to ingest and distribute data like server events and user interactions. It can also be used for parallel processing and data streaming from IoT devices. To learn more about using Pub/Sub with BigQuery, see Stream from Pub/Sub to BigQuery.
Eventarc
Eventarc is an event-driven tool that lets you manage the flow of state changes throughout your data pipeline. This tool has a wide range of use cases including automated error remediation, resource labeling, image retouching, and more. To learn more about using Eventarc with BigQuery, see Build a BigQuery processing pipeline with Eventarc.
What's next
- Learn to schedule recurring queries directly in BigQuery.
- Get started with Dataform.
- Get started with Workflows.
- Get started with Cloud Composer.
- Get started with Vertex AI Pipelines.
- Get started with Apigee Integration.
- Get started with Cloud Data Fusion.
- Get started with Cloud Scheduler.
- Get started with Pub/Sub.
- Get started with Eventarc.