Processing semi-structured data with Google BigQuery

Sourabh Jain
5 min readJan 12, 2022

--

Processing semi-structured data with BigQuery

While structured data forms a major portion of the organisations data storage, processing and analytical requirements, semi-structured data like JSON are increasingly being used by the organisations for various use cases. Often you would see that RDBMS(MySQL, Postgre, SQL Server etc) being used for storing and processing structured data and NoSQL databases(MongoDB,Cassandra etc) being recommended for semi-structured data. This often leads to. multiple challenges for the organisations:

  1. Managing multiple tools and platforms.
  2. Building teams will different skills.
  3. Data isolation in different technologies and then adding workloads to combine these datasets.

BigQuery is google cloud’s data warehouse offering that supports storage and processing of both structured and semi-structured data at scale. It even supports ingestion of real time data at scale of millions per second into BigQuery. For this story, we would look into following areas :

  1. Load semi-structured data into BigQuery with constantly changing schemas.
  2. Update nested json’s into BigQuery.
  3. Generate dynamic SQLs to do the updates in BigQuery.

The code repo can be found here load_bigquery .

Let’s start!!

NOTE: BigQuery supports newline JSON delimited files.

We will first load a normal json file which doesn’t have any nested structures.

{“id”: 1,”name”: “Sourabh”,”company”: “company-A”}
{“id”: 2,”name”: “Vikram”,”company”: “company-B”}
{“id”: 3,”name”: “Sahil”,”company”: “company-C”}
{“id”: 4,”name”: “Gaurav”,”company”: “company-D”}
{“id”: 5,”name”: “Hardik”,”company”: “company-E”}

We have uploaded this file “json_data_raw_1.json” into GCS bucket first and then we will load it into BigQuery. The URL is gs://bigquery_json_processing/json_data_raw_1.json

Now we need to load the data into BigQuery. Please execute the command via the cloud shell or any authenticated terminal.

!bq load --source_format NEWLINE_DELIMITED_JSON --autodetect \
--ignore_unknown_values sourabhjainceanalytics:demo.customer gs://bigquery_json_processing/json_data_raw_1.json

A simple load. Let’s validate the data.

%%bigquery
SELECT * FROM `sourabhjainceanalytics.demo.customer`
Loaded Data

Now, let’s say the source system has made changes to the source file structures and they have also added contact details. The new file that would be received would look like below. You would observe that the contact details is a nested array.

{"id": 6,"name": "Rajesh","company": "company-A","contact_details":[{"phone":"111111111","type":"home"},{"phone":"2222222222","type":"mobile"}]}
{"id": 7,"name": "Mandar","company": "company-B","contact_details":[{"phone":"3333333333","type":"home"},{"phone":"4444444444","type":"mobile"}]}
{"id": 8,"name": "Nishit","company": "company-C","contact_details":[{"phone":"5555555555","type":"home"},{"phone":"6666666666","type":"mobile"}]}
{"id": 9,"name": "Prasanna","company": "company-D","contact_details":[{"phone":"7777777777","type":"home"},{"phone":"8888888888","type":"mobile"}]}
{"id": 10,"name": "Rakesh","company": "company-E","contact_details":[{"phone":"9999999999","type":"home"},{"phone":"0000000000","type":"mobile"}]}

The requirement would be to keep adding this data to the BigQuery table. For the sake of this story, we would not be doing any pre-processing checks i.e. datatype validations , json structure validations etc but this needs to be done in the production pipelines.

We have uploaded this file “json_data_raw_2.json” into GCS bucket first and then we will load it into BigQuery. The URL is gs://bigquery_json_processing/json_data_raw_2.json

Now we need to load the data into BigQuery. Please execute the command via the cloud shell or any authenticated terminal.

!bq load --source_format NEWLINE_DELIMITED_JSON --autodetect \
--ignore_unknown_values --schema_update_option=ALLOW_FIELD_ADDITION \
sourabhjainceanalytics:demo.customer gs://bigquery_json_processing/json_data_raw_2.json

NOTE: Observe the difference in earlier load and this load. I have added a flag of “ — schema_update_option” with the value of “ALLOW_FIELD_ADDITION”. This allows loading the data with the new incoming fields. Also the flag “ — ignore_unknown_values” allows unknown values to be ignored.

Now let’s validate the data into BigQuery.

%%bigquery
SELECT * FROM `sourabhjainceanalytics.demo.customer`
Loaded semi-structured data

You will observe above that the new records are added with the additional fields and the old records have the newly added column empty.

Now, there may be an ask that we need to update all the old records with the contact details. In such case, we may be provided with JSONs which has only the required data. The data received would be as below:

{"id": 1,"contact_details":[{"phone":"111111111","type":"home"},{"phone":"2222222222","type":"mobile"}]}
{"id": 2,"contact_details":[{"phone":"3333333333","type":"home"},{"phone":"4444444444","type":"mobile"}]}
{"id": 3,"contact_details":[{"phone":"5555555555","type":"home"},{"phone":"6666666666","type":"mobile"}]}
{"id": 4,"contact_details":[{"phone":"7777777777","type":"home"},{"phone":"8888888888","type":"mobile"}]}
{"id": 5,"contact_details":[{"phone":"9999999999","type":"home"},{"phone":"0000000000","type":"mobile"}]}

We have uploaded this file “json_data_raw_3.json” into GCS bucket first and then we will load it into BigQuery. The URL is gs://bigquery_json_processing/json_data_raw_3.json

Our approach would be to load this into an interim table and then update it into the main table. Let’s load this into an interim table first.

!bq load --source_format NEWLINE_DELIMITED_JSON --autodetect sourabhjainceanalytics:demo.customer_change gs://bigquery_json_processing/json_data_raw_3.json

Now let’s validate the loaded interim data:

Incremental Data Received

Now we will run the below merge query to add these records to existing record ids. The query picks the existing values in the arrays and append the new values to it. In our case , since the cells were empty these values would be updated. However if there were existing values , the new values would have been appened via the ARRAY_CONCAT operation.

%%bigquery
MERGE demo.customer T
USING (
SELECT
id,
ARRAY_AGG(STRUCT(cd.type, cd.phone)) contact_details
FROM demo.customer_change,
UNNEST(contact_details) cd
GROUP BY 1) S
ON T.id = S.id
WHEN MATCHED THEN
UPDATE SET T.contact_details = ARRAY_CONCAT(T.contact_details, S.contact_details);

Now let’s validate the data in our BigQuery table

Table with appended data

Hope this was helpful to understand the power of BigQuery to support both structured and semi-structured data. You can learn more about JSON processing with BigQuery by referring below link.

In the next story, we would see how to dynamically update BigQuery table based on the input JSON received with varied columns. Stay Tuned and drop comments for any feedback or issues you are facing in JSON data processing with BigQuery!

You can read the next story here .

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Sourabh Jain
Sourabh Jain

Responses (1)

Write a response