Discover and catalog Cloud Storage data

This document explains how to use automatic discovery of Cloud Storage data.

Dataplex automatic discovery is a feature in BigQuery that lets you scan data in Cloud Storage buckets to extract and then catalog metadata. As part of the scan, automatic discovery creates BigLake or external tables for structured data and object tables for unstructured data, which you can then use for analytics and AI. The tables are automatically cataloged in Dataplex Catalog, which you can search or browse.

To use automatic discovery of Cloud Storage data, you create and run a discovery scan.

Overview

A discovery scan does the following:

  • Scans the data in the Cloud Storage bucket or path.
  • Groups structured and semi-structured data into tables.
  • Collects metadata, such as the table name, schema, and partition definition.
  • Creates and updates BigLake, external, or object tables in BigQuery using the schema and partition definition.

For unstructured data, such as images and videos, the discovery scan detects and registers groups of files that share the same media type as BigLake object tables. For example, if gs://images/group1 contains GIF images, and gs://images/group2 contains JPEG images, the discovery scan detects and registers two filesets.

For structured data, such as Avro, the discovery scan registers groups of files as BigLake external tables and detects files only if they're located in folders that contain the same data format and compatible schema.

The discovery scan supports the following structured and semi-structured data formats:

The discovery scan supports the following compression formats for structured and semi-structured data:

  • Internal compression for the following formats:

    Compression File extension sample Supported format
    gzip .gz.parquet Parquet
    lz4 .lz4.parquet Parquet
    Snappy .snappy.parquet Parquet, ORC, Avro
    lzo .lzo.parquet Parquet, ORC
  • External compression for JSON and CSV files:

    • gzip
    • bzip2

The discovered tables are registered in BigQuery as BigLake external tables, BigLake object tables, or external tables. This makes their data available for analysis in BigQuery. Metadata caching for BigLake tables and object tables is also enabled. All the BigLake tables are automatically ingested into Dataplex Catalog for search and discovery.

Before you begin

Ensure that you have the required Identity and Access Management (IAM) permissions to perform the tasks in this document.

Required roles for the service account

Before you begin, assign the IAM permissions to the Dataplex service account in your project:

  service-PROJECT_NUMBER@gcp-sa-dataplex.iam.gserviceaccount.com
  

Replace PROJECT_NUMBER with the project in which the Dataplex API is enabled.

To ensure that the Dataplex service account has the necessary permissions to run a discovery scan, ask your administrator to grant the Dataplex service account the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to run a discovery scan. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to run a discovery scan:

  • bigquery.datasets.create on the data source project
  • storage.buckets.get on the data source bucket
  • storage.objects.get on the data source bucket
  • storage.objects.list on the data source bucket
  • bigquery.datasets.get on the data source project
  • Provide a connection:
    • bigquery.connections.delegate
    • bigquery.connections.use

Your administrator might also be able to give the Dataplex service account these permissions with custom roles or other predefined roles.

Required roles for end users

To ensure that you has the necessary permissions to use the DataScan API, ask your administrator to grant you the following IAM roles:

For more information about granting roles, see Manage access to projects, folders, and organizations.

These predefined roles contain the permissions required to use the DataScan API. To see the exact permissions that are required, expand the Required permissions section:

Required permissions

The following permissions are required to use the DataScan API:

  • Create a DataScan: dataplex.datascans.create on your project
  • Delete a DataScan: dataplex.datascans.delete on your project or a DataScanresource
  • View DataScan details excluding results: dataplex.datascans.get on your projector a DataScan resource
  • View DataScan details including results: dataplex.datascans.getData on your project or a DataScan resource
  • List DataScans: dataplex.datascans.list on your project or a DataScan resource
  • Run a DataScan: dataplex.datascans.run on your project or a DataScan resource
  • Update the description of a DataScan: dataplex.datascans.update on your projector a DataScan resource
  • View the IAM permissions of the DataScan: dataplex.datascans.getIamPolicy on your project or a DataScan resource
  • Set the IAM permissions on the DataScan: dataplex.datascans.setIamPolicy on your project or a DataScan resource

Your administrator might also be able to give you these permissions with custom roles or other predefined roles.

Create a discovery scan

To discover data, you must create and run a discovery scan. You can set a schedule for the scan or run the scan on demand. To create and run a discovery scan, you must have the dataplex.datascans.create permission.

When the discovery scan runs, it creates a new dataset in BigQuery that corresponds to the Cloud Storage bucket that was scanned. The BigQuery dataset name is the same as the Cloud Storage bucket name. Invalid characters in the bucket name are replaced by an underscore. If the dataset name isn't available, a postfix is appended (for example, _discovered_001). The dataset contains the BigLake external or non-BigLake external tables that were created by the discovery scan for further analysis.

Console

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. In Explorer, click Add.

  3. On the Add pane in the Popular sources section, click Auto-create external and BigLake tables from GCS.

  4. On the Create table pane in the Source section, configure the following details about the data to scan:

    1. Enter a name for the scan.
    2. In the Scan ID field, enter a unique ID that follows the resource naming convention. If you don't provide an ID, the discovery scan generates the scan ID.
    3. Optional: Provide a description of the scan.
    4. To specify the Cloud Storage bucket that contains the files to scan, in the Bucket field, browse to and select the bucket.
    5. Optional: Define the data to include or exclude from the discovery scan by providing a list of glob patterns.

      • Include: if only a subset of the data should be scanned, provide a list of glob patterns that match the objects to include.
      • Exclude: provide a list of glob patterns that match the objects to exclude.

      For example, if you want to exclude gs://test_bucket/foo/.. from the discovery scan, enter **/foo/* as the exclude path. Quotation marks cause errors. Make sure to enter **/foo/* instead of "**/foo/*".

      If you provide both include patterns and exclude patterns, the exclude patterns are applied first.

    6. To create BigLake tables from the scanned data, in the Connection ID field, provide your Google Cloud resource connection ID. For more information, see Google Cloud resource connections.

      If you don't provide a resource connection ID, the discovery scan creates non-BigLake external tables.

  5. In the Discovery frequency section, configure when you want the discovery scan to run:

    • Repeat: the scan runs on a predefined schedule. Provide the start time, days to run the scan, and the frequency, such as hourly.

    • On demand: the scan runs on demand.

  6. Optional: In the JSON or CSV specifications section, specify how the scan should process JSON and CSV files. Click JSON or CSV specifications.

    1. To configure JSON options, select Enable JSON parsing options.
      • Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for JSON data, all columns are registered as their primitive types, such as string, number, or boolean.
      • Encoding format: the character encoding of the data, such as UTF-8, US-ASCII, or ISO-8859-1. If you don't specify a value, UTF-8 is used as the default.
    2. To configure CSV options, check Enable CSV parsing options.
      • Disable type inference: whether the discovery scan should infer data types when scanning data. If you disable type inference for CSV data, all columns are registered as strings.
      • Header rows: the number of header rows, either 0 or 1. If you specify the value 0, the discovery scan infers headings and extracts the column names from the file. The default is 0.
      • Column delimiter character: the character that is used to separate values. Provide a single character, \r (carriage return), or \n (newline). The default is a comma (,).
      • Encoding format: the character encoding of the data, such as UTF-8, US-ASCII, or ISO-8859-1. If you don't specify a value, UTF-8 is used as the default.
  7. When you're finished configuring the data discovery scan, click Create (for a scheduled scan) or Run now (for an on-demand scan).

    A scheduled scan is run according to the schedule that you set.

    An on-demand scan is run once initially when you create it, and you can run the scan at any time. It can take several minutes for the scan to run.

gcloud

gcloud dataplex datascans create data-discovery --location=LOCATION
--data-source-resource=GCS_BUCKET_PATH

Replace the following:

  • LOCATION: the location in which you'd like to create your discovery scan
  • GCS_BUCKET_PATH: the Cloud Storage path of the bucket you want to scan

REST

To create a discovery scan, use the dataScans.create method.

Monitor a discovery scan

To monitor the results of a discovery scan, you can query the logs created when a scan runs.

Console

  1. In the Google Cloud console, go to the Logs explorer page.

    Go to Logs explorer

  2. In the Logs Explorer view, find the Query tab.

  3. Click the Resource menu.

  4. Select Cloud Dataplex DataScan. Click Apply.

  5. Click the Log name menu.

  6. In the Search log names field, enter dataplex.googleapis.com%2Fdata_scan. Select data_scan and click Apply.

  7. Optional: Filter the logs to a specific data scan ID or location by adding the following filters in the log query:

    resource.type="dataplex.googleapis.com/DataScan"
    AND resource.labels.resource_container="projects/PROJECT_ID"
    AND resource.labels.datascan_id="DATA_SCAN_ID"
    

    Replace the following:

    • PROJECT_ID: your Google Cloud project ID
    • DATA_SCAN_ID: the DataScan ID
  8. Click Run query.

gcloud

gcloud alpha dataplex datascans describe test-datascan --project=PROJECT_ID --location=LOCATION

Replace the following:

  • PROJECT_ID: your Google Cloud project ID
  • LOCATION: the location of the discovery scan you want to monitor

REST

To monitor a discovery scan, use the dataScans.get method.

Query published BigLake tables

After you run the discovery scan, BigLake tables are published in a new dataset in BigQuery and are available for analysis in BigQuery using SQL, or in Dataproc using Apache Spark or Dataproc or HiveQL.

Query using SQL

You can view or query tables in BigQuery. For more information about how to run queries in BigQuery, see Run a query.

Query using Apache Spark

To query BigLake tables using Spark SQL on a Dataproc serverless job, follow these steps:

  1. Create a PySpark script similar to the following sample script:

      from pyspark.sql import SparkSession
      session = (
        SparkSession.builder.appName("testing")
          .config("viewsEnabled","true")
          .config("materializationDataset", "DATASET_ID")
          .config("spark.hive.metastore.bigquery.project.id", "PROJECT_ID")
          .config("spark.hive.metastore.client.factory.class", "com.google.cloud.bigquery.metastore.client.BigQueryMetastoreClientFactory")
          .enableHiveSupport()
          .getOrCreate()
      )
    
      session.sql("show databases").show()
      session.sql("use TABLE_NAME").show()
      session.sql("show tables").show()
    
      sql = "SELECT * FROM DATASET_ID.TABLE_ID LIMIT 10"
      df = session.read.format("bigquery").option("dataset", "DATASET_ID").load(sql)
      df.show()

    Replace the following:

    • DATASET_ID: ID of dataset for which users have create permission
    • PROJECT_ID: ID of project with BigLake table
    • TABLE_NAME: Name of BigLake table
    • TABLE_ID: ID of BigLake table
  2. Submit the batch job.

Manage published BigLake tables

Published BigLake tables are created in BigQuery by the discovery scan. Unless the metadata-managed-mode label is set to user_managed, the discovery scan manages the published BigLake tables. The discovery scan handles new data discovery, schema inferences, and schema evolution every time the scheduled or on-demand DataScans are run.

Update published BigLake tables

For BigLake tables published using the discovery scan jobs with the default configuration, the schema and other metadata is automatically updated with every datascan job run at the scheduled frequency.

To update a published BigLake table, follow these steps:

  1. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  2. Update one or more table properties.

  3. In the Explorer pane, expand your project and dataset, then select the table.

  4. In the Details pane, in the Labels section, make sure the metadata-managed-mode is set to user_managed. If it is set to a different value, follow these steps:

    1. Click Edit details.

    2. Next to the metadata-managed-mode key, in the value field, enter user_managed.

Delete published BigLake tables

To delete a published BigLake table, follow these steps:

  1. Delete the data files for the table in Cloud Storage bucket.

  2. In the Google Cloud console, go to the BigQuery page.

    Go to BigQuery

  3. In the Explorer pane, expand your project and dataset, then select the table.

  4. In the Details pane, in the Labels section, make sure the metadata-managed-mode label isn't set to user_managed. If it's set to user_managed, follow these steps:

    1. Click Edit details .

    2. Next to the metadata-managed-mode key, in the value field, enter a value that isn't user_managed.

  5. Click Run. The discovery scan runs on demand.

After the discovery scan runs, the BigLake table is deleted in BigQuery and isn't available to list or query through Spark.

Run a discovery scan on demand

To run a discovery scan on demand, use the dataScans.run method in the Dataplex API.

List discovery scans

gcloud

gcloud dataplex datascans list --location=LOCATION --project=PROJECT_ID

Replace the following:

  • LOCATION: the location of your project
  • PROJECT_ID: your Google Cloud project ID

REST

To retrieve the list of scans in your project, use the dataScans.list method in the Dataplex API.

Update a discovery scan

To change the schedule of a scan, for example changing the schedule from on-demand to recurring, you need to update the DataScan.

gcloud

gcloud dataplex datascans update data-discovery SCAN_ID --location=LOCATION

Replace the following:

  • SCAN_ID: the ID of the discovery scan you want to update
  • LOCATION: the location of the discovery scan you want to update

REST

To update a discovery scan, use the dataScans.patch method in the Dataplex API.

Delete a discovery scan

gcloud

gcloud dataplex datascans delete SCAN_ID --location=LOCATION --async

Replace the following:

  • SCAN_ID: the ID of the discovery scan you want to delete
  • LOCATION: the location of the discovery scan you want to delete

REST

To delete a discovery scan, use the dataScans.delete method in the Dataplex API.