Visualize BigQuery data in Jupyter notebooks


This tutorial describes how to explore and visualize data by using the BigQuery client library for Python and pandas in a managed Jupyter notebook instance on Vertex AI Workbench. Data visualization tools can help you to analyze your BigQuery data interactively, and to identify trends and communicate insights from your data. This tutorial uses data found in the Google Trends BigQuery public dataset.

Objectives

  • Create a managed Jupyter notebook instance using Vertex AI Workbench.
  • Query BigQuery data using magic commands in notebooks.
  • Query and visualize BigQuery data using the BigQuery Python client library and pandas.

Costs

BigQuery is a paid product, so you incur BigQuery usage costs when accessing BigQuery. The first 1 TB of query data processed each month is free. For more information, see the BigQuery pricing page.

Vertex AI Workbench is a paid product, and you incur compute, storage, and management costs when using Vertex AI Workbench instances. For more information, see the Vertex AI Workbench pricing page.

Before you begin

  1. In the Google Cloud console, on the project selector page, select or create a Google Cloud project.

    Go to project selector

  2. Make sure that billing is enabled for your Google Cloud project.

  3. Enable the BigQuery API.

    Enable the API

    For new projects, BigQuery is automatically enabled.

  4. Enable the Notebooks API.

    Enable the Notebook API

Overview: Jupyter notebooks

A notebook provides an environment in which to author and execute code. A notebook is essentially a source artifact, saved as an IPYNB file. It can contain descriptive text content, executable code blocks, and output rendered as interactive HTML.

Structurally, a notebook is a sequence of cells. A cell is a block of input text that is evaluated to produce results. Cells can be of three types:

  • Code cells contain code to evaluate. The output or results of executed code are rendered in line with the executed code.
  • Markdown cells contain Markdown text that is converted to HTML to produce headers, lists, and formatted text.
  • Raw cells can be used to render different code formats into HTML or LaTeX.

The following image shows a Markdown cell that's followed by a Python code cell, and then followed by the output:

Jupyter Markdown and code cells.

Each opened notebook is associated with a running session (also known as a kernel in Python). This session executes all the code in the notebook, and it manages the state. The state includes the variables with their values, functions and classes, and any existing Python modules that you load.

In Google Cloud, you can use a Vertex AI Workbench notebook-based environment to query and explore data, develop and train a model, and run your code as part of a pipeline. In this tutorial, you create a managed notebook instance on Vertex AI Workbench and then explore BigQuery data within the JupyterLab interface.

Create a managed notebooks instance

In this section, you set up a JupyterLab instance on Google Cloud so that you can to create managed notebooks.

  1. In the Google Cloud console, go to the Workbench page.

    Go to Workbench

  2. Click  New notebook.

  3. In the Notebook name field, enter a name for your instance.

  4. In the Region list, select a region for your instance.

  5. In the Permission section, select an option to define which users can access the managed notebooks instance:

    • Service account: This option gives access to all users who have access to the Compute Engine service account that you link to the runtime. To specify your own service account, clear the Use Compute Engine default service account checkbox, and then enter the service account email address that you want to use. For more information about service accounts, see Types of service accounts.
    • Single user only: This option gives access only to a specific user. In the User email field, enter the user account email address of the user who will use the managed notebooks instance.
  6. Optional: To modify your instance's advanced settings, click Advanced settings. For more information, see Create an instance by using advanced settings.

  7. Click Create.

    Allow a few minutes for the instance to be created. Vertex AI Workbench automatically starts the instance. When the instance is ready to use, Vertex AI Workbench activates an Open JupyterLab link.

Browse BigQuery resources in JupyterLab

In this section, you open JupyterLab and explore the BigQuery resources that are available in a managed notebooks instance.

  1. On the row for the managed notebooks instance that you created, click Open JupyterLab.

    If you're prompted, click Authenticate if you agree to the terms. Your managed notebooks instance opens JupyterLab in a new browser tab.

  2. In the JupyterLab navigation menu, click BigQuery BigQuery in Notebooks.

    The BigQuery pane lists available projects and datasets, where you can perform tasks as follows:

    • To view a description of a dataset, double-click the dataset name.
    • To show a dataset's tables, views, and models, expand the dataset.
    • To open a summary description as a tab in JupyterLab, double-click a table, view, or model.

    Note: On the summary description for a table, click the Preview tab to preview a table's data. The following image shows a preview of the international_top_terms table found in the google_trends dataset in the bigquery-public-data project:

    International top terms list.

Query notebook data using the %%bigquery magic command

In this section, you write SQL directly in notebook cells and read data from BigQuery into the Python notebook.

Magic commands that use a single or double percentage character (% or %%) let you use minimal syntax to interact with BigQuery within the notebook. The BigQuery client library for Python is automatically installed in a managed notebooks instance. Behind the scenes, the %%bigquery magic command uses the BigQuery client library for Python to run the given query, convert the results to a pandas DataFrame, optionally save the results to a variable, and then display the results.

Note: As of version 1.26.0 of the google-cloud-bigquery Python package, the BigQuery Storage API is used by default to download results from the %%bigquery magics.

  1. To open a notebook file, select File > New > Notebook.

  2. In the Select Kernel dialog, select Python (Local), and then click Select.

    Your new IPYNB file opens.

  3. To get the number of regions by country in the international_top_terms dataset, enter the following statement:

    %%bigquery
    SELECT
      country_code,
      country_name,
      COUNT(DISTINCT region_code) AS num_regions
    FROM
      `bigquery-public-data.google_trends.international_top_terms`
    WHERE
      refresh_date = DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)
    GROUP BY
      country_code,
      country_name
    ORDER BY
      num_regions DESC;
  4. Click  Run cell.

    The output is similar to the following:

    Query complete after 0.07s: 100%|██████████| 4/4 [00:00<00:00, 1440.60query/s]
    Downloading: 100%|██████████| 41/41 [00:02<00:00, 20.21rows/s]
    country_code      country_name    num_regions
    0   TR  Turkey         81
    1   TH  Thailand       77
    2   VN  Vietnam        63
    3   JP  Japan          47
    4   RO  Romania        42
    5   NG  Nigeria        37
    6   IN  India          36
    7   ID  Indonesia      34
    8   CO  Colombia       33
    9   MX  Mexico         32
    10  BR  Brazil         27
    11  EG  Egypt          27
    12  UA  Ukraine        27
    13  CH  Switzerland    26
    14  AR  Argentina      24
    15  FR  France         22
    16  SE  Sweden         21
    17  HU  Hungary        20
    18  IT  Italy          20
    19  PT  Portugal       20
    20  NO  Norway         19
    21  FI  Finland        18
    22  NZ  New Zealand    17
    23  PH  Philippines    17
    ...
    
  5. In the next cell (below the output from the previous cell), enter the following command to run the same query, but this time save the results to a new pandas DataFrame that's named regions_by_country. You provide that name by using an argument with the %%bigquery magic command.

    %%bigquery regions_by_country
    SELECT
      country_code,
      country_name,
      COUNT(DISTINCT region_code) AS num_regions
    FROM
      `bigquery-public-data.google_trends.international_top_terms`
    WHERE
      refresh_date = DATE_SUB(CURRENT_DATE, INTERVAL 1 DAY)
    GROUP BY
      country_code, country_name
    ORDER BY
      num_regions DESC;

    Note: For more information about available arguments for the %%bigquery command, see the client library magics documentation.

  6. Click  Run cell.

  7. In the next cell, enter the following command to look at the first few rows of the query results that you just read in:

    regions_by_country.head()
    
  8. Click  Run cell.

    The pandas DataFrame regions_by_country is ready to plot.

Query data in a notebook using the BigQuery client library directly

In this section, you use the BigQuery client library for Python directly to read data into the Python notebook.

The client library gives you more control over your queries and lets you use more complex configurations for queries and jobs. The library's integrations with pandas enable you to combine the power of declarative SQL with imperative code (Python) to help you analyze, visualize, and transform your data.

Note: You can use a number of Python data analysis, data wrangling, and visualization libraries, such as numpy, pandas, matplotlib, and many others. Several of these libraries are built on top of a DataFrame object.

  1. In the next cell, enter the following Python code to import the BigQuery client library for Python and initialize a client:

    from google.cloud import bigquery
    
    client = bigquery.Client()
    

    The BigQuery client is used to send and receive messages from the BigQuery API.

  2. Click  Run cell.

  3. In the next cell, enter the following code to retrieve the percentage of daily top terms in the US top_terms that overlap across time by number of days apart. The idea here is to look at each day's top terms and see what percentage of them overlap with the top terms from the day before, 2 days prior, 3 days prior, and so on (for all pairs of dates over about a month span).

    sql = """
    WITH
      TopTermsByDate AS (
        SELECT DISTINCT refresh_date AS date, term
        FROM `bigquery-public-data.google_trends.top_terms`
      ),
      DistinctDates AS (
        SELECT DISTINCT date
        FROM TopTermsByDate
      )
    SELECT
      DATE_DIFF(Dates2.date, Date1Terms.date, DAY)
        AS days_apart,
      COUNT(DISTINCT (Dates2.date || Date1Terms.date))
        AS num_date_pairs,
      COUNT(Date1Terms.term) AS num_date1_terms,
      SUM(IF(Date2Terms.term IS NOT NULL, 1, 0))
        AS overlap_terms,
      SAFE_DIVIDE(
        SUM(IF(Date2Terms.term IS NOT NULL, 1, 0)),
        COUNT(Date1Terms.term)
        ) AS pct_overlap_terms
    FROM
      TopTermsByDate AS Date1Terms
    CROSS JOIN
      DistinctDates AS Dates2
    LEFT JOIN
      TopTermsByDate AS Date2Terms
      ON
        Dates2.date = Date2Terms.date
        AND Date1Terms.term = Date2Terms.term
    WHERE
      Date1Terms.date <= Dates2.date
    GROUP BY
      days_apart
    
    ORDER BY
      days_apart;
    """
    pct_overlap_terms_by_days_apart = client.query(sql).to_dataframe()
    
    pct_overlap_terms_by_days_apart.head()

    The SQL being used is encapsulated in a Python string and then passed to the query() method to run a query. The to_dataframe method waits for the query to finish and downloads the results to a pandas DataFrame by using the BigQuery Storage API.

  4. Click  Run cell.

    The first few rows of query results appear below the code cell.

       days_apart   num_date_pairs  num_date1_terms overlap_terms   pct_overlap_terms
     0          0             32               800            800            1.000000
     1          1             31               775            203            0.261935
     2          2             30               750             73            0.097333
     3          3             29               725             31            0.042759
     4          4             28               700             23            0.032857
    

For more information about using BigQuery client libraries, see the quickstart Using client libraries.

Visualize BigQuery data

In this section, you use plotting capabilities to visualize the results from the queries that you previously ran in your Jupyter notebook.

  1. In the next cell, enter the following code to use the pandas DataFrame.plot() method to create a bar chart that visualizes the results of the query that returns the number of regions by country:

    regions_by_country.plot(kind="bar", x="country_name", y="num_regions", figsize=(15, 10))
    
  2. Click  Run cell.

    The chart is similar to the following:

    International top terms country results

  3. In the next cell, enter the following code to use the pandas DataFrame.plot() method to create a scatter plot that visualizes the results from the query for the percentage of overlap in the top search terms by days apart:

    pct_overlap_terms_by_days_apart.plot(
      kind="scatter",
      x="days_apart",
      y="pct_overlap_terms",
      s=len(pct_overlap_terms_by_days_apart["num_date_pairs"]) * 20,
      figsize=(15, 10)
      )
    
  4. Click  Run cell.

    The chart is similar to the following. The size of each point reflects the number of date pairs that are that many days apart in the data. For example, there are more pairs that are 1 day apart than 30 days apart because the top search terms are surfaced daily over about a month's time.

    International top terms days apart chart.

For more information about data visualization, see the pandas documentation.

Use the %bigquery_stats magic to get statistics and visualizations for all table columns

In this section, you use a notebook shortcut to get summary statistics and visualizations for all fields of a BigQuery table.

The BigQuery client library provides a magic command, %bigquery_stats, that you can call with a specific table name to provide an overview of the table and detailed statistics on each of the table's columns.

  1. In the next cell, enter the following code to run that analysis on the US top_terms table:

    %bigquery_stats bigquery-public-data.google_trends.top_terms
    
  2. Click  Run cell.

    After running for some time, an image appears with various statistics on each of the 7 variables in the top_terms table. The following image shows part of some example output:

    International top terms overview of statistics.

View your query history and reuse queries

To view your query history as a tab in JupyterLab, perform the following steps:

  1. In the JupyterLab navigation menu, click BigQuery BigQuery in Notebooks to open the BigQuery pane.

  2. In the BigQuery pane, scroll down and click Query history.

    Highlighted query history at bottom of left-nav

    A list of your queries opens in a new tab, where you can perform tasks such as the following:

    • To view the details of a query such as its Job ID, when the query was run, and how long it took, click the query.
    • To revise the query, run it again, or copy it into your notebook for future use, click Open query in editor.

Save and download your notebook

In this section, you save your notebook and download it if you want it for future use after cleaning up the resources used in this tutorial.

  1. Select File > Save Notebook.
  2. Select File > Download to download a local copy of your notebook as an IPYNB file on your computer.

Clean up

The easiest way to eliminate billing is to delete the Google Cloud project that you created for this tutorial.

  1. In the Google Cloud console, go to the Manage resources page.

    Go to Manage resources

  2. In the project list, select the project that you want to delete, and then click Delete.
  3. In the dialog, type the project ID, and then click Shut down to delete the project.

What's next