Query TeraBytes of Data in Seconds using BigQuery

6 min readApr 15, 2020

Today, we will explore google cloud platform’s datawarehouse managed service “BigQuery” and see how we can build a massive enterprise grade datawarehouse quickly.

Google describes BigQuery as “A serverless, highly scalable, and cost-effective cloud data warehouse designed to help you make informed decisions quickly, so you can transform your business with ease”. Let’s break this statement and understand the power of it.

Serverless : It’s a completely managed service thereby enabling organisation not to plan the capacity of servers upfront & do any sort of any commit on the infrastructure. It also ensures that organisation can start small and grow as the data grows without any impact on the business.
Highly Scalable : BigQuery can query over Petabytes of data storage. It can quickly analyze gigabytes to petabytes of data using ANSI SQL at blazing-fast speeds, and with zero operational overhead.
Cost-Effective : As it’s a managed service , organisations need to pay only for the storage and compute capacity that is being used. For enterprises , there is an offering of flat pricing; thereby enabling unlimited querying.

Objective : Query TBs of data and observe the response time.

Please create a google cloud account at https://console.cloud.google.com/getting-started . You will get 300$ credit.
Once you have create the account, please log into google cloud console: http://console.cloud.google.com/. You will be visiting to a dashboard showing the overall status.
Click the Hamburger icon at top left and visit IAM&Admin -> Manage Resources as show below:

Click on “Create Project” at the top. Please create a billing account if not already created. Provide the required details for project creation:

Once the project is created, please visit the Home -> Big Data -> BigQuery as shown below:

Below BigQuery Web UI will open. You can see the project at the top. If its not showing the newly created project, you can go ahead and change the same.

Now we have reached the BigQuery Web UI Interface. This interface provides the capability to perform all operations for the BigQuery managed service.
Now there are 2 important terminologies that we need to remember while working on BigQuery. DataSet and Table. DataSet can be describe as collection of Tables and Table is the physical entity that will persist the data.
For this session ,we will be referring a dataset that has a few tables that stores Petabytes of data and will see the power of BigQuery.
We will be using a retail petabyte dataset for this article’s analysis.Please refer the below query. In the below query “google.com:bigquery-petabyte” is a project being made public which has “retail_petabyte” dataset and has “sales_partitioned_clustered” table within the dataset.

SELECT * 
  FROM `google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE customerKey = "1440806400000-262"

Let’s paste the query and observe a few things. We have just pasted the query and not executed anything yet. Please refer the green tick icon on the middle of the extreme right of the image. BigQuery checks the data and lets us know the estimated size of the data that this query will process. Its 1.1 PetaByte data.

Well definitely the query that we pasted above is not a great query to run because fetching PB of data is not advisable.
We will now use the same table and rewrite the query for more meaningful analysis. We will analyze the sales data by day. Please refer the below query.

SELECT
  ROUND(SUM(totalSale),2) AS sale,
  DATE(orderDate) day
FROM
  `google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE
  customerKey = '1297468800000-526'
GROUP BY
  day
ORDER BY
  day DESC

Let’s see how much data this query will process in order to get the required output. It’s estimated to be 31.9 TeraByte of data.

We can further optimize it by processing only the required days of data as opposed to all the days. Please refer below query. Now we have added the time orderDate condition and this will ensure limited data is getting processed.

SELECT
  CAST(SUM(totalSale) AS INT64) AS sale,
  DATE(orderDate) day
FROM
  `google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE
  customerKey = "1104796800000-155"
  AND orderDate BETWEEN TIMESTAMP("2005-01-04")
  AND TIMESTAMP("2005-02-04")
GROUP BY
  day
ORDER BY
  day DESC

Let’s see how much data this query will process in order to get the required output. It’s estimated to be 265.7 GB.

As we see the data no getting processed is considerably less for our required output. Let’s run the query and see how much time it takes to get the output. Click on the Run button and wait for the query to complete.

Wow!!! As we see above the querying took 5.4 seconds to complete and and processed only 360.9 MB. It’s very important to remember that 265.7 GB was an estimate and 360.9 MB is the actual data processed.
Please rerun the query again and see the response times.

Wowwww!!! It took 0.1 second to complete. Why did this happen? The reason for this behavior is because the data from previous output got cached. If you rerun the query again , BigQuery understand that the data is all ready cached and gives output from cache. By default , caching is enabled in the Web UI Console. You can disable the same by visiting More-Query Settings and unselect the “Use Cached Results”.

Let’s run a bit of more complex query and see the response times. In this query , we are calculating day was running average of each product.

SELECT
  AVG(s.totalSale) OVER (PARTITION BY p.name ORDER BY orderDay ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS runningAverage,
  p.name AS productName,
  s.orderDay AS orderDay
FROM (
  SELECT
    SUM(lines.totalSale) AS totalSale,
    DATE(orderDate) AS orderDay,
    lines.productKey AS productKey
  FROM
    `google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`,
    UNNEST(salesOrderLines) AS lines
  WHERE
    orderdate BETWEEN "2011-01-01"
    AND "2011-01-07"
  GROUP BY
    orderDay,
    productKey ) s
JOIN
  `google.com:bigquery-petabyte.retail_petabyte.products` p
ON
  s.productKey = p.productKey
WHERE
  REGEXP_CONTAINS(s.productKey, r'^110453[0-9]{7}\-[0-9]{3}$')
ORDER BY
  orderDay

Let’s run the query and observe the output. It took 12 seconds to complete and processed 426.2GB of data.

I hope this has given a good understanding of the power of BigQuery for building massive datawarehouse.
BigQuery offer many other features with respect to Data processing. Please visit documentation for more details : https://cloud.google.com/bigquery/docs
It also provides BigQuery Machine Learning via SQL . Please refer the article (https://medium.com/analytics-vidhya/introduction-to-machine-learning-in-google-bigquery-452194ec2bed)and documentation(https://cloud.google.com/bigquery-ml/docs/bigqueryml-intro) for more details:
Please comment for any queries and I will be happy to help.

Query TeraBytes of Data in Seconds using BigQuery

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by Sourabh Jain

No responses yet