Query TeraBytes of Data in Seconds using BigQuery

Sourabh Jain
6 min readApr 15, 2020

--

Google BigQuery

Today, we will explore google cloud platform’s datawarehouse managed service “BigQuery” and see how we can build a massive enterprise grade datawarehouse quickly.

Google describes BigQuery as “A serverless, highly scalable, and cost-effective cloud data warehouse designed to help you make informed decisions quickly, so you can transform your business with ease”. Let’s break this statement and understand the power of it.

  • Serverless : It’s a completely managed service thereby enabling organisation not to plan the capacity of servers upfront & do any sort of any commit on the infrastructure. It also ensures that organisation can start small and grow as the data grows without any impact on the business.
  • Highly Scalable : BigQuery can query over Petabytes of data storage. It can quickly analyze gigabytes to petabytes of data using ANSI SQL at blazing-fast speeds, and with zero operational overhead.
  • Cost-Effective : As it’s a managed service , organisations need to pay only for the storage and compute capacity that is being used. For enterprises , there is an offering of flat pricing; thereby enabling unlimited querying.

Objective : Query TBs of data and observe the response time.

IAM & Admin -> Manage Resource
  • Click on “Create Project” at the top. Please create a billing account if not already created. Provide the required details for project creation:
New Project Creation.
  • Once the project is created, please visit the Home -> Big Data -> BigQuery as shown below:
  • Below BigQuery Web UI will open. You can see the project at the top. If its not showing the newly created project, you can go ahead and change the same.
BigQuery WebUI
  • Now we have reached the BigQuery Web UI Interface. This interface provides the capability to perform all operations for the BigQuery managed service.
  • Now there are 2 important terminologies that we need to remember while working on BigQuery. DataSet and Table. DataSet can be describe as collection of Tables and Table is the physical entity that will persist the data.
  • For this session ,we will be referring a dataset that has a few tables that stores Petabytes of data and will see the power of BigQuery.
  • We will be using a retail petabyte dataset for this article’s analysis.Please refer the below query. In the below query “google.com:bigquery-petabyte” is a project being made public which has “retail_petabyte” dataset and has “sales_partitioned_clustered” table within the dataset.
SELECT * 
FROM `google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE customerKey = "1440806400000-262"
  • Let’s paste the query and observe a few things. We have just pasted the query and not executed anything yet. Please refer the green tick icon on the middle of the extreme right of the image. BigQuery checks the data and lets us know the estimated size of the data that this query will process. Its 1.1 PetaByte data.
Querying Data
  • Well definitely the query that we pasted above is not a great query to run because fetching PB of data is not advisable.
  • We will now use the same table and rewrite the query for more meaningful analysis. We will analyze the sales data by day. Please refer the below query.
SELECT
ROUND(SUM(totalSale),2) AS sale,
DATE(orderDate) day
FROM
`google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE
customerKey = '1297468800000-526'
GROUP BY
day
ORDER BY
day DESC
  • Let’s see how much data this query will process in order to get the required output. It’s estimated to be 31.9 TeraByte of data.
Querying Day wise Sales
  • We can further optimize it by processing only the required days of data as opposed to all the days. Please refer below query. Now we have added the time orderDate condition and this will ensure limited data is getting processed.
SELECT
CAST(SUM(totalSale) AS INT64) AS sale,
DATE(orderDate) day
FROM
`google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`
WHERE
customerKey = "1104796800000-155"
AND orderDate BETWEEN TIMESTAMP("2005-01-04")
AND TIMESTAMP("2005-02-04")
GROUP BY
day
ORDER BY
day DESC
  • Let’s see how much data this query will process in order to get the required output. It’s estimated to be 265.7 GB.
Querying Day wise Sales for a month
  • As we see the data no getting processed is considerably less for our required output. Let’s run the query and see how much time it takes to get the output. Click on the Run button and wait for the query to complete.
Running the Query
  • Wow!!! As we see above the querying took 5.4 seconds to complete and and processed only 360.9 MB. It’s very important to remember that 265.7 GB was an estimate and 360.9 MB is the actual data processed.
  • Please rerun the query again and see the response times.
Query from Cached Results
  • Wowwww!!! It took 0.1 second to complete. Why did this happen? The reason for this behavior is because the data from previous output got cached. If you rerun the query again , BigQuery understand that the data is all ready cached and gives output from cache. By default , caching is enabled in the Web UI Console. You can disable the same by visiting More-Query Settings and unselect the “Use Cached Results”.
Query Settings
Query Settings
  • Let’s run a bit of more complex query and see the response times. In this query , we are calculating day was running average of each product.
SELECT
AVG(s.totalSale) OVER (PARTITION BY p.name ORDER BY orderDay ROWS BETWEEN 7 PRECEDING AND CURRENT ROW) AS runningAverage,
p.name AS productName,
s.orderDay AS orderDay
FROM (
SELECT
SUM(lines.totalSale) AS totalSale,
DATE(orderDate) AS orderDay,
lines.productKey AS productKey
FROM
`google.com:bigquery-petabyte.retail_petabyte.sales_partitioned_clustered`,
UNNEST(salesOrderLines) AS lines
WHERE
orderdate BETWEEN "2011-01-01"
AND "2011-01-07"
GROUP BY
orderDay,
productKey ) s
JOIN
`google.com:bigquery-petabyte.retail_petabyte.products` p
ON
s.productKey = p.productKey
WHERE
REGEXP_CONTAINS(s.productKey, r'^110453[0-9]{7}\-[0-9]{3}$')
ORDER BY
orderDay
  • Let’s run the query and observe the output. It took 12 seconds to complete and processed 426.2GB of data.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sourabh Jain
Sourabh Jain

No responses yet

Write a response