Querying Externally Partitioned Data in BigQuery

BigQuery with Google Cloud Storage

Data Analysts within organisations often have use case where they receive files i.e. CSV,Parquet etc at a scheduled frequency i.e. daily/weekly etc and are required to analyse them. They often face challenge to build an ingestion pipeline to load the data into Data Warehouse and make it available in a tabular format for further analysis.

BigQuery supports query externally partitioned data available in Google Cloud Storage without having to actually load the data into BigQuery. We will see how easy it is to achieve the same.

UseCase : Let’s assume that a stock broker receives daily files for each of the stocks at the end of the day in CSV format. He wants to analyse the data in these files by running SQL queries.

Solution : Let’s start by first creating the desired folder structure in Google Cloud Storage so that the files are stored in a partitioned structure. To start with, we will create a bucket.

— gcs_federated_queries

Within this bucket, we will create folder structures at day level so that we can easily identify files across the days. In the below example, we have created folders to store data for 2 days i.e. 2021–01–01 and 2021–01–02 . The prefix “dt=” is very important and must be a logical name as per business needs. Ensure that the prefix is not a key word like date, timestamp etc otherwise we will receive error in future steps.

— gcs_federated_queries

— — dt=2021–01–01

— —dt=2021–01–02

Within each of these folder structures, we will create folder for each of the stock files and we will follow similar pattern as discussed above.

— gcs_federated_queries

— — dt=2021–01–01

— — — stock=AdaniPorts

— — — stock=AsianPaints

— — dt=2021–01–02

— — — stock=AdaniPorts

— — — stock=AsianPaints

We are ready with our folder structure now. Data Analysts will upload the file to the respective date and stock folder when the file is received. Now let’s create a table structure that can query any of the data that will be uploaded to the above folder structure.

Let’s upload a few sample files to the above folder created. Remember we will be uploading files at the leaf folder i.e. at the “stock=” folder level. I have uploaded a few CSV files and our structure now look like this. The sample files are downloaded from the public dataset at https://www.kaggle.com/rohanrao/nifty50-stock-market-data

— gcs_federated_queries

— — dt=2021–01–01

— — — stock=AdaniPorts

— — — — ADANIPORTS.csv

— — — stock=AsianPaints

— — — — ASIANPAINTS.csv

— — dt=2021–01–02

— — — stock=AdaniPorts

— — — — ADANIPORTS.csv

— — — stock=AsianPaints

— — — — ASIANPAINTS.csv

Navigate to the Google Cloud Console -> BigQuery. Create a dataset if it doesn’t exist. Within the dataset, create a table and provide parameters as shown below:

Create External Table in BigQuery
  1. Select “Google Cloud Storage” as the option under “Create Table from”. In “Select file from GCS bucket”, just provide the bucketname with *.csv and suffix. In our example it will be “gcs_federated_queries/*.csv” . Under “file_format”, select as “CSV
  2. Select the checkbox for “Source Data Partitioning”. Provide the “Select Source URI Prefix” as “gs://gcs_federated_queries/” which is our bucket name and select “Partition Inference Mode” as “Automatically infer types”.
  3. Select the “Table type” as “External table”.
  4. Provide the appropriate table name. I have given it as stock_prices.
  5. Select the checkbox for “Auto detect Schema and Input parameters”. This will enable to auto detect the column names and data types. If you have csv files which doesn’t have column name, you can manually define the structure as well. Under Advanced options, you can define the delimiter and option to skip header rows.
  6. Finally, click on the create button.

We will have a table called as stock_prices as shown below.

BigQuery External Table
BigQuery External Table Details

We can now run a query to test the data. Please replace the projectid and dataset name appropriately.

SELECT * FROM `<<projectid>>.<<dataset>>.stock_prices` LIMIT 1000
Output

If you notice the output, we got our prefix as the column names. We can use these prefix to filter the data just like any other column.

Prefix as column names.

That’s it for this topic. You can read more about this here.

All Views are my own.