What is Datafusion in Google Cloud Platform (GCP) ?
Last Updated : 09 Jul, 2024
Let's start with an introduction to Cloud Data Fusion. Cloud Data Fusion provides a graphical user interface and APIs that increase time efficiency and reduce complexity. It is user-friendly. Cloud Data Fusion provides you with user user-friendly graphical interface to build data pipelines with NO CODE.
- It supports parallel query execution, which significantly helps in the multi-processing of data.
- You can use existing templates, connectors to Google Cloud, and other Cloud service providers.
- There is a variety of transformations present to help you get your desired quality and format of the data.
- Cloud Data Fusion is extensible. This includes the ability to integrate it with Apache Airflow, SQL Engine and many more.
Benefits of Using Data fusion
The following are the benefits of using data fusion:
- It reduces complexity by providing a simplified graphical user interface.
- It supports multiple triggers and extensions to integrate multiple sources.
- It supports multi-core processing which fastens the query execution.
The following are the Primary terminologies related to GCP Datafusion :
- Transformations (Transform)
- Sink
- Source
- Error Handlers
- Wranglers
When creating a Datafusion pipeline, Transformation is a process of changing the source data by imposing some rules to transform it into the desired result.
Example: CSV Formatter, Compressor.
2. Sink
- Sink is the terminology used in Datafusion to refer Target objects. Target objects can be of different types.
Example: Bigquery, GCS
3. Source
- Source is the terminology used in Datafusion to refer Source objects. Source objects can be of different types.
Example : Excel, Bigtable
4. Error Handlers
- Error Handlers in Datafusion is used to deal with errors occured in the pipelines which ensures robust data processing and query execution.
5. Wranglers
- Wrangling in Datafusion provides tools for data preparation includes harvesting of data (cleaning, structuring, enriching raw data) into desired format of the data in no time.
How to use Data Fusion in Google Cloud Console?
Step 1: In the Cloud console, from the Navigation menu select Data Fusion.
Step 2 : Click the Create an Instance link at the top of the section to create a Cloud Data Fusion instance.
- In the Create Data Fusion instance page that loads:
Step 3: A pictorial representation of the pipeline appears in the user i, which is a graphical interface for developing data integration pipelines.
Step 4: In the top right menu, there are several options click Deploy. This will submit the pipeline to Cloud Data Fusion.
What are alternate options for Datafusion in GCP?
The following are the services which you can use as an alternative way of Datafusion.
- Dataproc
- Dataflow
1. Dataproc
Cloud Data Fusion offers the ability to create ETL jobs using their graphical pipeline UI representation whereas Dataproc lets us run manually created Spark/Hadoop/Hive jobs depending upon your requirement. Also, If you focus on the data transformation/wrangling with low/no code solution, Data fusion is the solution.
2. Dataflow
Dataflow is a Google Cloud service that provides unified stream and batch data processing at scale.If systems are Hadoop dependent, then it is wise to choose Dataproc over Dataflow.
Similar Reads
What is Google Cloud Platform (GCP)? Google Cloud Platform (GCP) is a cloud computing service by Google that helps businesses, developers, and enterprises run applications, store data, and manage workloads on a secure, scalable, and high-performance infrastructure. Whether you're building a website, handling large datasets, or running
15+ min read
Features of Google Cloud Platform Google Cloud Platform (GCP) is Googleâs cloud computing service that helps businesses build, deploy, and scale applications on a secure, global infrastructure. It offers powerful features like virtual machines, cloud storage, databases, AI, machine learning, and big data tools. GCP reduces infrastru
5 min read
Google Cloud Platform (GCP) Interview Questions 2025 Amongst the most prominent cloud service providers, Google Cloud Platform (GCP) has grown rapidly through offering an extensive selection of solutions and services tailored to various business needs. It can be hard to get ready for a GCP interview if you are a beginner who only recently started out
15+ min read
What Is Google Cloud SQL:Complete Tutorial Google Cloud SQL is a completely managed relational database service. It provides high obtainability and automatic failover, which confirms that our database never fails and is available for application. If a server administrator is not available, then, with the help of Cloud SQL, users can easily d
8 min read
Google Cloud Platform - Working with External Data in BigQuery In BigQuery it's also possible to query data stored externally or outside BigQuery. In this article, we're diving into these external data sources. It's possible to leave your data in any place and use BigQuery as your query engine. These sources are called external or federated data sources. This f
4 min read