Discover and catalog Cloud Storage data
This document explains how to use automatic discovery of Cloud Storage data.
Dataplex automatic discovery is a feature in BigQuery that lets you scan data in Cloud Storage buckets to extract and then catalog metadata. As part of the scan, automatic discovery creates BigLake or external tables for structured data and object tables for unstructured data, which you can then use for analytics and AI. The tables are automatically cataloged in Dataplex Catalog, which you can search or browse.
To use automatic discovery of Cloud Storage data, you create and run a discovery scan.
Overview
A discovery scan does the following:
- Scans the data in the Cloud Storage bucket or path.
- Groups structured and semi-structured data into tables.
- Collects metadata, such as the table name, schema, and partition definition.
- Creates and updates BigLake, external, or object tables in BigQuery using the schema and partition definition.
For unstructured data, such as images and videos, the discovery scan
detects and registers groups of files that share the same media type as
BigLake object tables. For example, if gs://images/group1
contains GIF images, and gs://images/group2
contains JPEG images,
the discovery scan detects and registers two filesets.
For structured data, such as Avro, the discovery scan registers groups of files as BigLake external tables and detects files only if they're located in folders that contain the same data format and compatible schema.
The discovery scan supports the following structured and semi-structured data formats:
- Parquet
- Avro
- ORC
- JSON (only the newline-delimited format)
- CSV (but not CSV files that have comment rows)
The discovery scan supports the following compression formats for structured and semi-structured data:
Internal compression for the following formats:
Compression File extension sample Supported format gzip .gz.parquet
Parquet lz4 .lz4.parquet
Parquet Snappy .snappy.parquet
Parquet, ORC, Avro lzo .lzo.parquet
Parquet, ORC External compression for JSON and CSV files:
- gzip
- bzip2
The discovered tables are registered in BigQuery as BigLake external tables, BigLake object tables, or external tables. This makes their data available for analysis in BigQuery. Metadata caching for BigLake tables and object tables is also enabled. All the BigLake tables are automatically ingested into Dataplex Catalog for search and discovery.
Before you begin
Ensure that you have the required Identity and Access Management (IAM) permissions to perform the tasks in this document.
Required roles for the service account
Before you begin, assign the IAM permissions to the Dataplex service account in your project:
service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com
Replace PROJECT_NUMBER
with the project in which the
Dataplex API is enabled.
To ensure that the Dataplex service account has the necessary permissions to run a discovery scan, ask your administrator to grant the Dataplex service account the following IAM roles:
-
BigQuery User (
roles/bigquery.user
) on the data source project -
Storage Object Viewer (
roles/storage.objectViewer
) on data source bucket -
Provide a connection:
BigQuery Connection Admin (
roles/bigquery.connectionAdmin
)
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to run a discovery scan. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to run a discovery scan:
-
bigquery.datasets.create
on the data source project -
storage.buckets.get
on the data source bucket -
storage.objects.get
on the data source bucket -
storage.objects.list
on the data source bucket -
bigquery.datasets.get
on the data source project -
Provide a connection:
-
bigquery.connections.delegate
-
bigquery.connections.use
-
Your administrator might also be able to give the Dataplex service account these permissions with custom roles or other predefined roles.
Required roles for end users
To ensure that you has the necessary permissions to use the DataScan API, ask your administrator to grant you the following IAM roles:
-
Full access to DataScan resources:
Dataplex DataScan Administrator (
roles/dataplex.dataScanAdmin
) on your project -
Write access to DataScan resources:
Dataplex DataScan Editor (
roles/dataplex.dataScanEditor
) on your project -
Read access to DataScan resources, excluding the results:
Dataplex DataScan Viewer (
roles/dataplex.dataScanViewer
) on your project -
Read access to DataScan resources, including the results:
Dataplex DataScan DataViewer (
roles/dataplex.dataScanDataViewer
) on your project
For more information about granting roles, see Manage access to projects, folders, and organizations.
These predefined roles contain the permissions required to use the DataScan API. To see the exact permissions that are required, expand the Required permissions section:
Required permissions
The following permissions are required to use the DataScan API:
-
Create a DataScan:
dataplex.datascans.create
on your project -
Delete a DataScan:
dataplex.datascans.delete
on your project or a DataScanresource -
View DataScan details excluding results:
dataplex.datascans.get
on your projector a DataScan resource -
View DataScan details including results:
dataplex.datascans.getData
on your project or a DataScan resource -
List DataScans:
dataplex.datascans.list
on your project or a DataScan resource -
Run a DataScan:
dataplex.datascans.run
on your project or a DataScan resource -
Update the description of a DataScan:
dataplex.datascans.update
on your projector a DataScan resource -
View the IAM permissions of the DataScan:
dataplex.datascans.getIamPolicy
on your project or a DataScan resource -
Set the IAM permissions on the DataScan:
dataplex.datascans.setIamPolicy
on your project or a DataScan resource
Your administrator might also be able to give you these permissions with custom roles or other predefined roles.
Create a discovery scan
To discover data, you must create and run a discovery scan. You can
set a schedule for the scan or run the scan on demand. To create and run a
discovery scan, you must have the dataplex.datascans.create
permission.
When the discovery scan runs, it creates a new dataset in
BigQuery that corresponds to the Cloud Storage
bucket that was scanned. The BigQuery dataset name is the
same as the Cloud Storage bucket name. Invalid characters in the
bucket name are replaced by an underscore. If the dataset name isn't
available, a postfix is appended (for example, _discovered_001
). The
dataset contains the BigLake external or
non-BigLake external tables that were created by the
discovery scan for further analysis.
Console
In the Google Cloud console, go to the BigQuery page.
In Explorer, click
Add.On the Add pane in the Popular sources section, click Auto-create external and BigLake tables from GCS.
On the Create table pane in the Source section, configure the following details about the data to scan:
- Enter a name for the scan.
- In the Scan ID field, enter a unique ID that follows the resource naming convention. If you don't provide an ID, the discovery scan generates the scan ID.
- Optional: Provide a description of the scan.
- To specify the Cloud Storage bucket that contains the files to scan, in the Bucket field, browse to and select the bucket.
Optional: Define the data to include or exclude from the discovery scan by providing a list of glob patterns.
- Include: if only a subset of the data should be scanned, provide a list of glob patterns that match the objects to include.
- Exclude: provide a list of glob patterns that match the objects to exclude.
For example, if you want to exclude
gs://test_bucket/foo/..
from the discovery scan, enter**/foo/*
as the exclude path. Quotation marks cause errors. Make sure to enter**/foo/*
instead of"**/foo/*"
.If you provide both include patterns and exclude patterns, the exclude patterns are applied first.
To create BigLake tables from the scanned data, in the Connection ID field, provide your Google Cloud resource connection ID. For more information, see Google Cloud resource connections.
If you don't provide a resource connection ID, the discovery scan creates non-BigLake external tables.
In the Discovery frequency section, configure when you want the discovery scan to run:
Repeat: the scan runs on a predefined schedule. Provide the start time, days to run the scan, and the frequency, such as hourly.
On demand: the scan runs on demand.
Optional: In the JSON or CSV specifications section, specify how the scan should process JSON and CSV files. Click JSON or CSV specifications.
- To configure JSON options, select Enable JSON parsing options.
- Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for JSON data, all columns are registered as their primitive types, such as string, number, or boolean.
- Encoding format: the character encoding of the data, such as UTF-8, US-ASCII, or ISO-8859-1. If you don't specify a value, UTF-8 is used as the default.
- To configure CSV options, check Enable CSV parsing options.
- Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for CSV data, all columns are registered as strings.
- Header rows: the number of header rows, either
0
or1
. If you specify the value0
, the discovery scan infers headings and extracts the column names from the file. The default is0
. - Column delimiter character: the character that is used to
separate values. Provide a single character,
\r
(carriage return), or\n
(newline). The default is a comma (,
). - Encoding format: the character encoding of the data, such
as
UTF-8
,US-ASCII
, orISO-8859-1
. If you don't specify a value, UTF-8 is used as the default.
- To configure JSON options, select Enable JSON parsing options.
When you're finished configuring the data discovery scan, click Create (for a scheduled scan) or Run now (for an on-demand scan).
A scheduled scan is run according to the schedule that you set.
An on-demand scan is run once initially when you create it, and you can run the scan at any time. It can take several minutes for the scan to run.
gcloud
gcloud dataplex datascans create data-discovery --location=LOCATION --data-source-resource=GCS_BUCKET_PATH
Replace the following:
LOCATION
: the location in which you'd like to create your discovery scanGCS_BUCKET_PATH
: the Cloud Storage path of the bucket you want to scan
REST
To create a discovery scan, use the dataScans.create
method.
Monitor a discovery scan
To monitor the results of a discovery scan, you can query the logs created when a scan runs.
Console
In the Google Cloud console, go to the Logs explorer page.
In the Logs Explorer view, find the Query tab.
Click the Resource menu.
Select Cloud Dataplex DataScan. Click Apply.
Click the Log name menu.
In the Search log names field, enter
dataplex.googleapis.com%2Fdata_scan
. Select data_scan and click Apply.Optional: Filter the logs to a specific data scan ID or location by adding the following filters in the log query:
resource.type="dataplex.googleapis.com/DataScan" AND resource.labels.resource_container="projects/PROJECT_ID" AND resource.labels.datascan_id="DATA_SCAN_ID"
Replace the following:
PROJECT_ID
: your Google Cloud project IDDATA_SCAN_ID
: the DataScan ID
Click Run query.
gcloud
gcloud alpha dataplex datascans describe test-datascan --project=PROJECT_ID --location=LOCATION
Replace the following:
PROJECT_ID
: your Google Cloud project IDLOCATION
: the location of the discovery scan you want to monitor
REST
To monitor a discovery scan, use the
dataScans.get
method.
Query published BigLake tables
After you run the discovery scan, BigLake tables are published in a new dataset in BigQuery and are available for analysis in BigQuery using SQL, or in Dataproc using Apache Spark or Dataproc or HiveQL.
Query using SQL
You can view or query tables in BigQuery. For more information about how to run queries in BigQuery, see Run a query.
Query using Apache Spark
To query BigLake tables using Spark SQL on a Dataproc serverless job, follow these steps:
Create a PySpark script similar to the following sample script:
from pyspark.sql import SparkSession session = ( SparkSession.builder.appName("testing") .config("viewsEnabled","true") .config("materializationDataset", "DATASET_ID") .config("spark.hive.metastore.bigquery.project.id", "PROJECT_ID") .config("spark.hive.metastore.client.factory.class", "com.google.cloud.bigquery.metastore.client.BigQueryMetastoreClientFactory") .enableHiveSupport() .getOrCreate() ) session.sql("show databases").show() session.sql("use TABLE_NAME").show() session.sql("show tables").show() sql = "SELECT * FROM DATASET_ID.TABLE_ID LIMIT 10" df = session.read.format("bigquery").option("dataset", "DATASET_ID").load(sql) df.show()
Replace the following:
DATASET_ID
: ID of dataset for which users have create permissionPROJECT_ID
: ID of project with BigLake tableTABLE_NAME
: Name of BigLake tableTABLE_ID
: ID of BigLake table
Manage published BigLake tables
Published BigLake tables are created in BigQuery
by the discovery scan. Unless the metadata-managed-mode label is set to
user_managed
, the discovery scan manages the published BigLake
tables. The discovery scan handles new data discovery, schema inferences, and
schema evolution every time the scheduled or on-demand DataScans are run.
Update published BigLake tables
For BigLake tables published using the discovery scan jobs with the default configuration, the schema and other metadata is automatically updated with every datascan job run at the scheduled frequency.
To update a published BigLake table, follow these steps:
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand your project and dataset, then select the table.
In the Details pane, in the Labels section, make sure the metadata-managed-mode is set to
user_managed
. If it is set to a different value, follow these steps:Click
Edit details.Next to the metadata-managed-mode key, in the value field, enter
user_managed
.
Delete published BigLake tables
To delete a published BigLake table, follow these steps:
Delete the data files for the table in Cloud Storage bucket.
In the Google Cloud console, go to the BigQuery page.
In the Explorer pane, expand your project and dataset, then select the table.
In the Details pane, in the Labels section, make sure the metadata-managed-mode label isn't set to
user_managed
. If it's set touser_managed
, follow these steps:Click Edit details
.Next to the metadata-managed-mode key, in the value field, enter a value that isn't
user_managed
.
Click Run. The discovery scan runs on demand.
After the discovery scan runs, the BigLake table is deleted in BigQuery and isn't available to list or query through Spark.
Run a discovery scan on demand
To run a discovery scan on demand, use the
dataScans.run
method
in the Dataplex API.
List discovery scans
gcloud
gcloud dataplex datascans list --location=LOCATION --project=PROJECT_ID
Replace the following:
LOCATION
: the location of your projectPROJECT_ID
: your Google Cloud project ID
REST
To retrieve the list of scans in your project, use the
dataScans.list
method
in the Dataplex API.
Update a discovery scan
To change the schedule of a scan, for example changing the schedule from on-demand to recurring, you need to update the DataScan.
gcloud
gcloud dataplex datascans update data-discovery SCAN_ID --location=LOCATION
Replace the following:
SCAN_ID
: the ID of the discovery scan you want to updateLOCATION
: the location of the discovery scan you want to update
REST
To update a discovery scan, use the
dataScans.patch
method
in the Dataplex API.
Delete a discovery scan
gcloud
gcloud dataplex datascans delete SCAN_ID --location=LOCATION --async
Replace the following:
SCAN_ID
: the ID of the discovery scan you want to deleteLOCATION
: the location of the discovery scan you want to delete
REST
To delete a discovery scan, use the
dataScans.delete
method
in the Dataplex API.