Processing AVRO data using Google Cloud DataProc

In this story, we will see how Google Cloud Platform’s managed service Cloud DataProc can be leveraged to read and parse the AVRO data file. As a simple use-case , we will read an AVRO file available on Google Cloud Storage(GCS) and convert it to parquet format and store it back on Google Cloud Storage(GCS).
Log into the Google Cloud Console at https://console.cloud.google.com/
Start the Cloud Shell environment
We will now create a bucket to host our avro file. Execute the below command to create a bucket on GCS. Replace the bucket name appropriately for your environment
gsutil mb gs://dataproc-spark-convert
We will now download a sample AVRO file available at https://github.com/miguno/avro-cli-examples/blob/master/twitter.avro and upload it to the bucket that we created above.
gsutil cp twitter.avro gs://dataproc-spark-convert
Now we will create a spark cluster with appropriate parameters. Once the below statement has completed execution, it will create a managed cluster of hadoop & spark.
gcloud beta dataproc clusters create <<ClusterName>> --image-version=1.4 --enable-component-gateway --bucket <<BucketName>> --region <<RegionName>> --project <<ProjectName>>
Now let’s create a script that will perform our core operation of reading the avro file and storing it into a parquet format. Paste the below code in a test.py file. The code is self explanatory. Please replace the bucket name appropriately.
from pyspark.sql import SparkSessionfrom pyspark.sql import SQLContextspark = SparkSession.builder.appName('test').getOrCreate()df = spark.read.format("avro").load("gs://dataproc-spark-convert/twitter.avro")df.write.parquet("gs://dataproc-spark-convert/parquet/")
Now we will submit a spark job to execute this script on the created cluster. Execute the below command to submit the job.
gcloud dataproc jobs submit pyspark test.py \--cluster <<ClusterName>>\--region <<RegionName>> \--properties spark.jars.packages='org.apache.spark:spark-avro_2.11:2.4.0'
Once the above job completes, you will see the files generated into the destination bucket.

Once the required processing is over ,we can delete the cluster.
gcloud dataproc clusters delete <<ClusterName>>--region <<RegionName>>
Thanks for visiting the story.