Deployment Topologies for Data Fusion with Shared VPCs

Sourabh Jain
5 min readFeb 15, 2022

What is Cloud Data Fusion?

Cloud Data Fusion is a fully managed, cloud-native, enterprise data integration service for quickly building and managing data pipelines.

The Cloud Data Fusion web UI allows you to build scalable data integration solutions to clean, prepare, blend, transfer, and transform data, without having to manage the infrastructure.

Cloud Data Fusion is powered by the open source project CDAP.

One of the key aspects of any ETL tool is to connect to disparate sources systems and perform extraction of data, transform the extracted data as per business rules and finally load the data into target system(database , data warehouse , file system etc). However we often see a challenge in the way networks are being setup across cloud and a common ask is to connect to on-premise systems to perform data extraction. In this article we will focus on various deployment topologies for Data Fusion that an organisation should consider as per their network topologies.

Let’s deep dive and look into the setup of Data Fusion.

There are 2 modes of setting up Data Fusion instance i.e. Private and Public.

Public Instance :

The easiest way to provision a CDF instance is to create a public instance. It serves well as a starting point and provides access to external endpoints on the public internet.

This kind of CDF instance is expecting you to use the Default VPC network in your project. Both design-time and runtime environments will have external (public) IP addresses firewalled for internal users access only.

In the default VPC network, we have auto-generated subnets for each region, routing tables, and firewall rules to ensure communication among your computing resources.

Private Instance :

Many organisations are required to keep all of their production systems off public IP addresses. CDF Private instances serve this purpose in all kinds of VPC network settings.

In the CDF private instance you provisioned, both design-time and runtime environments use private IP addresses. They do not use external Internet IP addresses attached to any CDF related GCEs. As a result, the CDF private IP instance (design time) cannot access data sources on the public Internet today.

As we work with enterprise organisation we commonly see most of the customers using private instances primarily driven by security standpoint to avoid public IPs.

As Data Fusion is a managed service , the deployment of the instance happens in a project hosted and managed by google. Since the instance gets hosted in google project , it needs to be peered with organisations VPC to enable access of the network and also the data sources. You can follow the steps here to deploy a private instance of Data Fusion.

Now let’s deep dive into various deployment topologies of Data Fusion:

  1. Shared VPC connected to on-premise/VPC Network over VPN
  2. Shared VPC connected to on-premise/VPC Network over VPC Peering
  3. Shared VPC Peered with Trust/Hub/Core VPC and Trust/Hub/Core VPC connected to on-premise over VPN(Peered with Shared VPC)

Now let’s understand each of this setup.

  • Shared VPC connected to on-premise/VPC Network over VPN

Your setup should look like below. This is the simplest of setup and most of the data flow happens without any additional setup.

When you are designing and developing pipelines , you should be able to connect to on-premise and cloud systems.

When you are running or scheduling pipelines , you should be able to connect to on-premise and cloud systems.

  • Shared VPC connected to on-premise/VPC Network over VPC Peering

Your setup should look like below.

When you are designing and developing pipelines , you would see that you are not able to connect to on-premise systems. The reason for it is that the “Only directly peered networks can communicate. Transitive peering is not supported. In other words, if VPC network N1 is peered with N2 and N3, but N2 and N3 are not directly connected, VPC network N2 cannot communicate with VPC network N3 over VPC Network Peering”. You can read more about it here . In our case we have 1st peering between Google Managed Tenant project and Shared VPC and 2nd peering between Shared VPC and Source VPC.

In order to make Source VPC accessible , we need to create a proxy on the shared VPC. It could be any proxy i.e. HAProxy etc.

When you are running or scheduling pipelines , you should be able to connect to the source VPC because the jobs are actual run via Dataproc cluster that gets the network from shared VPC and there is only 1 peering between Shared VPC and Source VPC.

  • Shared VPC Peered with Trust/Hub/Core VPC and Trust/Hub/Core VPC connected to on-premise over VPN(Peered with Shared VPC)

Your setup should look like below:

Similar to the previous setup, you should setup a proxy to enable communication between on-premise and data fusion design time environment.

When you are running or scheduling pipelines , you should be able to connect to the source VPC because the jobs are actual run via Dataproc cluster that gets the network from shared VPC and there is only 1 peering between Shared VPC and Source environment.

Hope this gives you a glimpse of how various deployment topologies can be achieved. Please feel free to comment if you have any queries. In the next story , I will share a terraform script to automate the complete setup. Ciao!

--

--