Types of Data Integration: ETL vs ELT and Batch vs Real-Time

Introduction

The world is drowning in data. More than 80% of respondents in an IBM study said that the sources, the volume, and the velocity of data they work with had increased. So it comes as no surprise that companies are eager to take advantage of these trends; the World Economic Forum’s report lists data analysts and scientists as the most in-demand job role across industries in 2020. And although companies are ramping up efforts in this field, there are major obstacles on the road ahead.

Not only are most analysts forced to work with unreliable and outdated data, but many also lack tools to quickly integrate data from different sources into a unified view. Traditional batch data integration is hardly up to this challenge.

That’s why a growing number of companies are looking for more effective and faster types of data integration. One solution is real-time data integration, a technology superior to batch methods because it enables rapid decision-making, breaks down data silos, future-proofs your business, and offers many other benefits.

Different types of data integration

data architecture

Depending on their business needs and IT infrastructures, companies opt for different types of data integration. Some choose to ingest, process, and deliver data in real time, while others might use batch integration. Let’s quickly dive into each one of those.

Batch data integration

Batch data integration involves storing all the data in a single batch and moving it at scheduled periods of time or only once a certain amount is collected. This approach is useful if you can wait to receive and analyze data.

Batch data integration, for instance, can be used for maintaining an index of company files. You don’t necessarily need an index to be refreshed each time a document is added or modified; once or twice a day should be sufficient.

Electric bills are another relevant example. Your electric consumption is collected during a month and then processed and billed at the end of that period. Banks also use batch processing, which is why some card transactions might take time to be reflected in your online banking dashboard.

Real-time data integration with change data capture

Real-time data integration involves processing and transferring data as soon as it’s collected. The process isn’t literally instantaneous, though. It takes a fraction of a second to transfer, transform, and analyze data using change data capture (CDC), transform-in-flight, and other technologies.

Event as a Change to an Entry in a Database
Imagine Each Event as a Change to an Entry in a Database

CDC involves tracking the database’s change logs and then turning inserts, updates, and other events into a stream of data applied to a target database. In many situations, however, data needs to be delivered in a specific format. That’s where the transform-in-flight feature comes into play as it turns data that’s in motion into a required format and enriches it with inputs from other sources. Data is delivered to the master file in a consumable form and is ready for processing.

Real-time data integration can be deployed in a range of time-sensitive use cases. Take, for example, reservation systems: When you book a vacation at your favorite hotel, its master database is automatically updated to prevent others from booking the same room. Point-of-sale terminals rely on the same data-processing tech. As you type your PIN and then take money from a terminal, your account is automatically updated to reflect this action.

ETL VS ELT

ETL (extract, transform, load) is another approach to data integration and has been standard for decades. It consists of three parts:

  • The first component of this method involves extracting data from the source systems using database queries (JDBC, SQL) or change data capture in the case of real-time data integration.
  • Transform, a second component of ETL, includes processing the data so it can be consumed properly in the target system. Examples of transformation include data type mapping, re-formatting data (e.g. removing special characters), or deriving aggregated values from raw data.
  • And load is the third component of ETL. It relates to the writing of the data to the target platform. This can be as simple as writing to a delimited file. Or, it can be as complex as creating schemas in a database or performing merge operations in a data warehouse.

ELT (Extract, load, transform) re-orders the equation by allowing the target data platform to handle transformation while the integration platform simply collects and delivers the data.

There are a few factors that have led to the recent popularity of ELT:

  • The cost of compute has been optimized over time with open source tools (Spark, Hadoop) and cloud infrastructure such as AWS, Microsoft Azure, and Google Cloud.
  • Modern cloud data platforms like Snowflake and Databricks provide analysts and cloud architects with a simple user experience to analyze disparate data sources in one platform. ELT tools load raw and unstructured data into these types if data platforms so analysts can join and correlate the data.
  • ETL has increasingly become synonymous with legacy, batch data integration workloads that poorly integrate with the modern data stack

Andreesen Horowitz’s recent paper on modern data infrastructure highlighted ELT as being a core component of next-generation data stacks while referring to ETL as ‘brittle’. It’s unclear why they are categorizing all ETL tools as brittle, but it’s clear there’s a perception that ETL has become synonymous with legacy, outdated data management practices.

However, real-time data integration modernizes ETL by using the latest paradigms to transform and correlate streaming data in-flight so it’s ready for analysis the moment it’s written to the target platform. This allows analysts to avoid data transformation headaches, reduce their cloud resource usage, and simply start analyzing their data in their platform of choice.

And real-time data processing is evolving and growing in popularity because it helps solve many difficult challenges and offers a range of benefits.

Real-time data flows allow rapid decision-making

By 2023, there will be over 5 billion internet users and 29.3 billion networked devices, each producing ever-larger amounts of different types of data. Real-time integration allows companies to act quickly on this information.

Data from on-premises and cloud-based sources can easily be fed, in real-time, into cloud-based analytics built on, for instance, Kafka (including cloud-hosted versions such as Google PubSub, AWS Kinesis, Azure EventHub), Snowflake, or BigQuery, providing timely insights and allowing fast decision making.

And speed is becoming a critical resource. Detecting and blocking fraudulent credit card usage requires matching payment details with a set of predefined parameters in real time. If, in this case, data processing took hours or even minutes, fraudsters could get away with stolen funds. But real-time data integration allows banks to collect and analyze information rapidly and cancel suspicious transactions.

Companies that ship their products also need to make decisions quickly. They require up-to-date information on inventory levels so that customers don’t order out-of-stock products. Real-time data integration prevents this problem because all departments have access to continuously updated information, and customers are notified about sold-out goods.

Real-time data integration breaks down data silos

When deciding which types of data integration to use, data silos are another obstacle companies have to account for. When data sets are scattered across ERP, CRM, and other systems, they’re isolated from each other. Engineers then find it hard to connect the dots, uncover insights, and make better decisions. Fortunately, real-time data integration helps businesses break down data silos.

From relational databases and data warehouses to IoT sensors and log files, real-time data integration delivers data with sub-second latency from various sources to a new environment. Organizations then have better visibility into their processes. Hospitals, for example, can integrate their radiology units with other departments and ensure that patient imaging data is shared with all stakeholders instead of being siloed.

Real-time data integration future-proofs your business

Speed is essential in a world that produces more and more data. Annual mobile traffic alone will reach almost a zettabyte by 2022, changing the existing technologies and giving rise to new ones. Thriving in this digital revolution requires handling an array of challenges and opportunities. It also requires navigating between different types of data integration options, with real-time tech capable of future-proofing your business in many different ways.

Avoid vendor lock-in with a multi-cloud strategy

According to IBM, 81% of all enterprises have a multi-cloud strategy already laid out or in the works.
Real-time data integration allows your team to get more value from the cloud by making it possible to experiment with or adopt different technologies. You’d be able to use a broader range of cloud services and, by extension, build better applications and improve machine-learning models. And these capabilities are critical to a resilient and flexible IT architecture that underpins innovation efforts across on-premises and cloud environments.

Improving customer service ops

Your support reps can better serve customers by having data from various sources readily available. Agents with real-time access to purchase history, inventory levels, or account balances will delight customers with an up-to-the-minute understanding of their problems. Rapid data flows also allow companies to be creative with customer engagement. They can program their order management system to inform a CRM system to immediately engage customers who purchased products or services.

Better customer experiences then translate into increased revenue, profits, and brand loyalty. Almost 75% of consumers say a good experience is critical for brand loyalties, while most businesses consider customer experience as a competitive differentiator vital for their survival and growth.

Optimizing business productivity

Spotting inefficiencies and taking corrective actions is another important goal for today’s companies. Manufacturers, for instance, achieve this goal by deploying various improvement methodologies, such as Lean production, Six Sigma, or Kaizen.

Whichever of those or other productivity tactics they choose, companies need access to real-time data and continuously updated dashboards. Relying on periodically refreshed data can slow down progress. Instead of tackling problems in real time, managers take a lot of time to spot problems, causing unnecessary costs and increased waste.

Therefore, the key to optimizing business productivity is collecting, transferring, and analyzing data in real time. And many companies agree with this argument. According to an IBM study, businesses expect that fast data will allow them to “make better informed decisions using insights from analytics (44%), improved data quality and consistency (39%), increased revenue (39%), and reduced operational costs (39%).”

Harnessing the power of digital transformation

 

Among different types of data integration, real-time tech is the one that allows companies to truly take their data game to the next level. No longer constrained by batch processing, businesses can innovate more, build better products, and drive profits. Harnessing the power of data will provide them with a much-needed competitive edge. And that can make all the difference between growth and stagnation as the digital revolution reshapes the world.

Definitions

Batch data integration involves storing all the data in a single batch and moving it only once a certain amount is collected or at scheduled periods of time.

Real-time data integration involves processing and transferring data as soon as it’s collected using change data capture (CDC), transform-in-flight, and other technologies.

Benefits of real-time data integration

  • Enables rapid decision-making
  • Accelerates ELT with faster loads
  • Modernizes ETL with high throughput transformations
  • Breaks down data silos
  • Prepares teams for a multi-cloud, anti-vendor lock-in strategy
  • Improves customer experiences

Striim Overview – Real-Time Enterprise Data Integration to the Cloud

A 3 minute overview of the Striim platform and common use-cases.

Striim’s real-time enterprise data integration platform, is a next generation cloud based platform that can ingest real-time data from a variety of sources, including change data from enterprise databases such as Oracle and Microsoft SQL Server, and rapidly deliver it to your cloud systems such as Google Cloud, Azure and AWS.

Video Transcription:

Cloud adoption is an essential part of your digital transformation journey.

Whether you are modernizing legacy applications and databases, or have a ‘cloud first’ strategy for all new applications and analytics, you have to consider data integration from diverse sources – a lot of which may reside on premise, or in other clouds – to where it needs to go in a timely and non disruptive  fashion.

How do you collect, move, process, and deliver data from existing and legacy sources to your new cloud technologies in a continuous, scalable, and reliable way? How do you address this without interruption to your business applications? 

That’s why you need Striim’s real time data integration platform, a next generation cloud based platform with built in Intelligence and AI. 

Striim can ingest real-time data from a variety of sources, including change data from enterprise databases such as Oracle and Microsoft SQL Server, and rapidly deliver it to your cloud systems such as Google Cloud, Azure and AWS.

While the data is moving, it’s easy to filter, transform, enrich, and correlate this data, using simple SQL based transformations, to get it into the correct form for the target. 

Real-time data delivery validation, monitoring and alerts provide visibility into your continuous data pipelines. Providing enterprise grade real-time integration, in a scalable, reliable and secure platform.

Financial organizations are using Striim to migrate legacy enterprise databases to the cloud, without taking any downtime, through our wizards and intuitive UI

Global delivery companies and retailers are continuously feeding data to the cloud for real-time reporting and operational intelligence

While large scale cloud analytics driven by continuous data from Striim is powering real-time decision making in education and healthcare

With continuous monitoring of their data flows in real-time as part of the solution.

Built with the cloud in mind, Striim can scale with your workloads, and provides the high-availability, reliability and security you would expect from a mission critical piece of your cloud transformation. 

You can try Striim from our website, find us in all major cloud marketplaces, or contact us for a demo tailored to your exact use case. 

Stream to the cloud, today, with Striim.

 

Striim Migration Service to Google Cloud Tutorials

Striim Migration Service for PostgreSQL to Cloud SQL for PostgreSQL

In this tutorial you will learn how you can use Striim to migrate an on-premise PostgreSQL database to Cloud SQL for PostgreSQL in Google Cloud, through Striim’s wizard-based UI and intuitive data pipelines, with zero database downtime.

Striim Migration Service for Oracle to Cloud SQL for PostgreSQL

Migrating from Oracle to Cloud SQL in Google Cloud opens up cloud services that offer a wealth of capabilities with low management overhead and cost. But, moving your existing on-premises applications to the cloud can be a challenge. Existing applications built on top of on-premises deployments of databases like MySQL. In this tutorial we are going to step you through a database technology called Change Data Capture to synchronize data from MySQL into a Google Cloud SQL instance.

Striim Migration Service for MySQL to Cloud SQL for MySQL

This guide provides a really quick and easy way to synchronize an on-premises instance of MySQL to Cloud SQL using Striim. You could start using the cloud database to run additional applications or do data analysis — without affecting the performance and use of your existing system.

Striim Migration Service for SQL Server to Cloud SQL for SQL Server

This guide provides a really quick and easy way to synchronize an on-premises instance of SQL Server to Cloud SQL database in Google Cloud using Striim. You could start using the cloud database to run additional applications or do data analysis — without affecting the performance and use of your existing system.

Striim Migration Service for Oracle to Cloud Spanner

In this tutorial you will learn how you can use Striim to migrate an on-premise Oracle database to Cloud Spanner, through Striim’s wizard-based UI and intuitive data pipelines, with zero database downtime.

Striim Product Demo – Oracle To Cloud Spanner

In this demo, you are going to see how you can use Striim to continuously move data from Oracle to Google Cloud Spanner. We will show you how to use Striim’s wizards and intuitive UI to build data flows; run the data flows to collect data from Oracle using Change Data Capture, and deliver it in real-time to Cloud Spanner; and see continuous monitoring of your cloud migration solution.

Video Transcription:

In this demo, you are going to see how you can use Striim to continuously move data from Oracle to Google Cloud Spanner. We will show you how to use Striim’s wizards and intuitive UI to build data flows; run the data flows to collect data from Oracle using Change Data Capture, and deliver it in real-time to Cloud Spanner; and see continuous monitoring of your cloud migration solution.

Performing streaming data integration with Striim starts with our wizards. We will select Oracle as the source, and Cloud Spanner as the target. After clicking the wizard and entering a name for our data flow, you just need to complete a few simple steps.

First, you will configure the source. Enter the necessary information to connect to the source and click on next. Don’t worry, any secure information like passwords is encrypted. The wizard will check that the connection information is correct, and that the connection has the correct privileges and supports change data capture. 

Next you select the tables that you are interested in collecting real-time data from. You can change this selection afterwards, so start with a few tables initially. Finally you need to configure the target connection information, including how the source data is mapped to target tables. 

When you complete the wizard, a data flow is created from the information you entered. You can see the source and target configuration here. To start the data flow, first deploy it to get it ready to run, then start it to begin collecting data from Oracle and delivering it to Cloud Spanner

Initially, there is no data flowing, because we are not generating any new data in Oracle. You can see from the UI for Cloud Spanner that there is no data present in any of the target tables.

Now we will run a data generator for Oracle that creates a set of inserts, updates and deletes. You can see the data in the data flow preview window, and view the rate of data collection and delivery in the UI. We can also look at the application progress here to see a summary view of your tables. After a number of operations have been generated, we can check back with the Cloud Spanner UI and see the data in the target tables.

Of course, Striim can perform initial loads as well through similar data flows. Here we are moving a million rows from tables in Oracle to Cloud Spanner using our smart delivery pipeline. You can monitor the progress through the Striim UI, and, if we switch to the Cloud Spanner UI, you can see the data in the target.

We can also use the Striim monitor UI to look at overall metrics, and drill down to see the application statistics, and detailed information for each of the application components.

This has been a quick demo of using Striim to deliver data continuously from Oracle to Cloud Spanner. Please go to our website to try Striim yourself, find Striim in the Google Cloud Marketplace, or contact us to learn more.

 

Online Enterprise Database Migration to Google Cloud

Migrate to cloud

Migrating existing workloads to the cloud is an formidable step in the journey of digital transformation for enterprises. Moving an enterprise application from on premises to run in the cloud, or modernizing with the best use of cloud-native technologies, is only part of the challenge. A major part of this task is to move the existing enterprise databases while business continuously operate at full speed.

Pause never

How the data is extracted and loaded into the new cloud environment plays a big role in keeping the business critical systems performant. Particularly for enterprise databases supporting mission-critical applications, avoiding downtime is a must-have requirement during migrations to minimize both the risk and operational disruption.

For business critical applications, the acceptable downtime precipitously approaches zero. All the while, moving large amounts of data, and essential testing of the business critical applications can take days, weeks, or even months.

Keep running your business

The best practice in enterprise database migration, to minimize and even altogether eliminate the downtime, is to use online database migration that keeps the application running.

In the online migration, changes from the enterprise source database are captured non-intrusively as real-time data streams using Change Data Capture (CDC) technology. This capability is available for most major databases, including Oracle, Microsoft SQL Server, HPE NonStop, MySQL, PostgreSQL, MongoDB, and Amazon RDS, but has to be harnessed in the correct way.

In online database migration, first, you initially load the source database to the cloud. Then, any changes in the source database that have happened since you were executing the initial load are applied to the target cloud database continuously from the real-time data stream. The source and target databases will remain up to date until you are ready to completely cut over. You will also have the option to fallback to the source all along, further minimizing risks.

Integrate continuously

Online database migration also provides essential data integration services for the new application development in the cloud. The change delivery can be kept running while you develop and test the new cloud applications. You may even choose to keep the target and source databases in sync indefinitely typically for continuous database replication in hybrid or multi-cloud use cases.

Keep fresh

Once the real-time streaming data pipelines to the cloud are set up, businesses can easily build new applications, and seamlessly adopt new cloud services to get the most operational value from the cloud environment. Real-time streaming is a crucial element in all such data movement use cases, and it can be widely applied to hybrid or multi-cloud architectures, operational machine learning, analytics offloading, large scale cloud analytics, or any other scenario where having up-to-the-second data is essential to the business.

Change Data Capture

Striim, in strategic partnership with Google Cloud, offers online database migrations and real-time hybrid cloud data integration to Google Cloud through non-intrusive Change Data Capture (CDC) technologies. Striim enables real-time continuous data integration from on-premises and other cloud data sources to BigQuery, Cloud Spanner, Cloud SQL for PostgreSQL, for MySQL, and for SQL Server, as well as Cloud Pub/Sub and Cloud Storage as well as other databases running in the Google Cloud.

Replicate to Google Cloud

In addition to data migration, data replication is an important use case as well. In contrast to data migration, data replication continuously replicates data from a source system to a target system “forever” without the intent to shut down the source system.

An example target system in the context of data replication is BigQuery. It is the data analytics platform of choice in Google Cloud. Striim supports continuous data streaming (replication) from an on-premises database to BigQuery in Google Cloud in case the data has to remain on-premises and cannot be migrated. Striim bridges the two worlds and makes Google Cloud data analytics accessible by supporting the hybrid environment.

Transform in flight

Data migration and continuous streaming in many cases transports the data unmodified from the source to the target systems. However, many use cases require data to be transformed to match the target systems, or to enrich and combine data from different sources in order to complement and complete the target data set for increased value and expressiveness in a simple and robust architecture. This method is frequently referred to as Extract Transform Load, or ETL.

Striim provides a very flexible and powerful in-flight transformation and augmentation functionality in order to support use cases that go beyond simple one-time data migration.

More to migrate? Keep replicate!

Enterprises in general have several data migration and online streaming use cases at the same time. Often data migration takes place for some source databases, while data replication is ongoing for others.

A single Striim installation can support several use cases at the same time, reducing the need for management and operational supervision. The Striim platform supports high-volume, high velocity data with built-in validation, security, high-availability, reliability, and scalability as well as backup-driven disaster recovery addressing enterprise requirements and operational excellence.

The following architecture shows an example where migration and online streaming is implemented at the same time. On the left, the database in the Cloud is migrated to the Cloud SQL database on the right. After a successful migration the source database is going to be removed. In addition, the two source databases on the left in an on-premises data center are continuously streamed (replicated) to BigQuery for analytics and Cloud Spanner for in-Cloud processing.

Keep going

In addition, Striim as the data migration technology is implemented in a high-availability configuration. The three servers on Compute Engine form a cluster, and each of the servers is executing in a different zone, making the cluster highly available and protecting the migration and online streaming from zone failures or zone outages.

Accelerate Cloud adoption

As organizations modernize their data infrastructure, integrating mission-critical databases is essential to ensure information is accessible, valuable, and actionable. Striim and Google Cloud’s partnership supports Google customers with a smooth data movement and continuous integration solutions, accelerating Google Cloud adoption and driving business growth.

Learn more

To learn more about the enterprise cloud data integration questions, feel free to reach out to Striim and check out these references:?

Google Cloud Solution Architecture: Architecting database migration and replication using Striim

Blog: Zero downtime database migration and replication to and from Cloud Spanner

Tutorial: Migrating from MySQL to BigQuery for Real-Time Data Analytics

 

Back to top