Processing AVRO data using Google Cloud DataProc

Sourabh Jain
2 min readJun 8, 2020
Cloud DataProc

In this story, we will see how Google Cloud Platform’s managed service Cloud DataProc can be leveraged to read and parse the AVRO data file. As a simple use-case , we will read an AVRO file available on Google Cloud Storage(GCS) and convert it to parquet format and store it back on Google Cloud Storage(GCS).

Log into the Google Cloud Console at https://console.cloud.google.com/

Start the Cloud Shell environment

We will now create a bucket to host our avro file. Execute the below command to create a bucket on GCS. Replace the bucket name appropriately for your environment

gsutil mb gs://dataproc-spark-convert

We will now download a sample AVRO file available at https://github.com/miguno/avro-cli-examples/blob/master/twitter.avro and upload it to the bucket that we created above.

gsutil cp twitter.avro gs://dataproc-spark-convert

Now we will create a spark cluster with appropriate parameters. Once the below statement has completed execution, it will create a managed cluster of hadoop & spark.

gcloud beta dataproc clusters create <<ClusterName>> --image-version=1.4 --enable-component-gateway --bucket <<BucketName>> --region <<RegionName>> --project <<ProjectName>>

Now let’s create a script that will perform our core operation of reading the avro file and storing it into a parquet format. Paste the below code in a test.py file. The code is self explanatory. Please replace the bucket name appropriately.

from pyspark.sql import SparkSessionfrom pyspark.sql import SQLContextspark = SparkSession.builder.appName('test').getOrCreate()df = spark.read.format("avro").load("gs://dataproc-spark-convert/twitter.avro")df.write.parquet("gs://dataproc-spark-convert/parquet/")

Now we will submit a spark job to execute this script on the created cluster. Execute the below command to submit the job.

gcloud dataproc jobs submit pyspark test.py \--cluster <<ClusterName>>\--region <<RegionName>> \--properties spark.jars.packages='org.apache.spark:spark-avro_2.11:2.4.0'

Once the above job completes, you will see the files generated into the destination bucket.

Generated Parquet Files.

Once the required processing is over ,we can delete the cluster.

gcloud dataproc clusters delete <<ClusterName>>--region <<RegionName>>

Thanks for visiting the story.

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

Sourabh Jain
Sourabh Jain

No responses yet

Write a response